Arxiv今日论文 | 2025-10-09

本篇博文主要内容为 2025-10-09 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决长序列建模中固定大小压缩记忆（如RNN类模型）与无损增长记忆（如基于注意力机制的Transformer）之间的效率-保真度权衡问题。其核心解决方案是受认知科学中多存储模型启发，提出一种人工神经网络记忆框架：利用滑动窗口维护Transformer的键值（Key-Value, KV）缓存作为无损短期记忆，同时引入一个可学习模块——人工海马网络（Artificial Hippocampus Network, AHN），将窗口外的信息递归压缩为固定大小的紧凑长期记忆。该设计在保持高精度的同时显著降低计算和内存开销，实验表明AHN增强模型在LV-Eval和InfiniteBench等长上下文基准上优于滑动窗口基线，并接近或超越全注意力模型性能。

链接: https://arxiv.org/abs/2510.07318
作者: Yunhao Fang,Weihao Yu,Shu Zhong,Qinghao Ye,Xuehan Xiong,Lai Wei
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer’s KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and Gated DeltaNet. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: this https URL.
zh

[NLP-1] Vibe Checker: Aligning Code Evaluation with Human Preference

【速读】：该论文试图解决当前代码生成模型评估体系中忽视“非功能性指令”（non-functional instructions）的问题，即现有评估指标如pass@k仅关注功能正确性，而无法衡量用户在实际编程场景中对代码可读性、意图保留和风格一致性等“直觉偏好”（vibe check）的需求。其解决方案的关键在于提出VeriCode——一个包含30个可验证代码指令的分类体系及对应的确定性验证器，并基于此构建Vibe Checker测试平台，用于同时量化模型的功能正确性和指令遵循能力。实验表明，将指令遵循能力纳入综合评分后，与人类偏好相关性显著提升，且指令遵循成为区分真实编程任务中优秀模型的核心因素。

链接: https://arxiv.org/abs/2510.07315
作者: Ming Zhong,Xiang Zhou,Ting-Yun Chang,Qingze Wang,Nan Xu,Xiance Si,Dan Garrette,Shyam Upadhyay,Jeremiah Liu,Jiawei Han,Benoit Schillings,Jiao Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models’ code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.
zh

[NLP-2] Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain ACL

【速读】：该论文旨在解决当前文本到SQL（text-to-SQL）模型在真实商业场景中表现不足的问题，尤其是面对复杂业务查询时缺乏因果推理、时间序列预测和策略推荐等多步骤智能决策能力。其解决方案的关键在于提出一个新的基准测试集CORGI，该数据集基于DoorDash、Airbnb和Lululemon等企业构建的合成数据库，涵盖描述性、解释性、预测性和推荐性四类逐步复杂的业务问题，从而系统性地评估大语言模型（LLM）在真实商业语境下的多层级推理与执行能力。实验表明，CORGI比BIRD基准更难约21%，凸显了现有LLM在高阶业务智能任务上的显著性能差距。

链接: https://arxiv.org/abs/2510.07309
作者: Yue Li,Ran Tao,Derek Hommel,Yusuf Denizay Dönder,Sungyong Chang,David Mimno,Unso Eun Seo Jo
机构: Cornell University (康奈尔大学); Gena AI
类目: Computation and Language (cs.CL)
备注: 20 pages, 6 figures, under review for ACL ARR

点击查看摘要

Abstract:In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.
zh

[NLP-3] hink Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在处理非英语语言时面临的两大关键问题：一是输入、思考过程与输出之间难以保持语言一致性，二是非英语场景下的推理路径错误率较高且最终答案准确率显著低于英语环境，从而影响非英语用户的体验并限制LRMs的全球部署。解决方案的核心在于提出M-Thinker模型，其训练采用GRPO（Generalized Reward Policy Optimization）算法，并引入两个创新性奖励机制：语言一致性（Language Consistency, LC）奖励严格约束输入、思考和答案三者间的语言一致性；跨语言思维对齐（Cross-lingual Thinking Alignment, CTA）奖励通过对比模型在非英语和英语中的推理路径，将英语推理能力迁移至非英语语言，从而提升非英语场景下的推理准确性和鲁棒性。

链接: https://arxiv.org/abs/2510.07300
作者: Xue Zhang,Yunlong Liang,Fandong Meng,Songming Zhang,Kaiyu Huang,Yufeng Chen,Jinan Xu,Jie Zhou
机构: Beijing Jiaotong University (北京交通大学); Pattern Recognition Center, WeChat AI, Tencent Inc (微信人工智能研究院，腾讯公司)
类目: Computation and Language (cs.CL)
备注: 13 pages, 8 tables, 4 figures

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the “think-then-answer” paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model’s non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.
zh

[NLP-4] AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLM s

【速读】：该论文旨在解决大型音频语言模型（Large Audio Language Models, LALMs）在处理长时音频时面临的两大核心挑战：一是注意力机制带来的二次方复杂度（O(N²)）导致的计算成本过高，二是难以建模长程时间依赖关系。现有音频基准测试多基于短片段，无法评估模型在真实长上下文场景下的性能。为此，作者提出了AudioMarathon基准，其关键创新在于构建了一个包含90.0至300.0秒长音频输入（对应2,250至7,500个音频标记）的多样化任务集，覆盖语音、声音和音乐三大领域，并设计需要多跳推理的复杂任务以检验模型的深层理解能力。通过该基准对当前最优LALMs进行评估，发现模型性能随音频长度增长显著下降，同时揭示了令牌剪枝与KV缓存淘汰策略在加速推理中的权衡关系，从而凸显出提升时间推理能力和开发内存高效架构的必要性。

链接: https://arxiv.org/abs/2510.07293
作者: Peize He,Zichen Wen,Yubo Wang,Yuxuan Wang,Xiaoqian Liu,Jiajie Huang,Zehui Lei,Zhuangcheng Gu,Xiangqi Jin,Jiabing Yang,Kai Li,Zhifei Liu,Weijia Li,Cunxiang Wang,Conghui He,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Northeastern University (东北大学); Carnegie Mellon University (卡内基梅隆大学); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Sun Yat-sen University (中山大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 26 pages, 23 figures, the code is available at \url{ this https URL }

点击查看摘要

Abstract:Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ( O(N^2) ) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
zh

[NLP-5] On the Convergence of Moral Self-Correction in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在缺乏具体错误提示的情况下，如何通过内在机制实现自我修正（intrinsic self-correction）的问题。其解决方案的关键在于揭示了道德自修正（moral self-correction）过程中性能收敛的机制：持续注入自修正指令会激活模型内部的道德概念，从而降低模型不确定性，使得被激活的道德概念在多轮交互中逐渐稳定，最终导致性能收敛。这一发现为理解LLMs的内在自我改进能力提供了机制层面的解释，并展示了道德自修正在提升模型响应质量方面的潜力。

链接: https://arxiv.org/abs/2510.07290
作者: Guangliang Liu,Haitao Mao,Bochuan Cao,Zhiyu Xue,Xitong Zhang,Rongrong Wang,Kristen Marie Johnson
机构: Michigan State University (密歇根州立大学); Amazon (亚马逊); Pennsylvania State University (宾夕法尼亚州立大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
zh

[NLP-6] Online Rubrics Elicitation from Pairwise Comparisons

【速读】：该论文旨在解决静态评分标准（rubrics）在大语言模型（Large Language Models, LLMs）后训练过程中存在的局限性问题，即静态rubrics易导致奖励黑客（reward-hacking）行为且无法捕捉训练中涌现的新型期望属性。解决方案的关键在于提出在线评分标准提取方法（Online Rubrics Elicitation, OnlineRubrics），该方法通过当前策略与参考策略生成的回答之间的成对比较，在线动态更新评价标准，从而实现训练过程中的持续误差识别与修正。实证结果表明，该方法在AlpacaEval、GPQA、ArenaHard及专家问答验证集上相较仅使用静态rubrics的训练方式，性能提升达8%。

链接: https://arxiv.org/abs/2510.07284
作者: MohammadHossein Rezaei,Robert Vacareanu,Zihao Wang,Clinton Wang,Yunzhong He,Afra Feyza Akyürek
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
zh

[NLP-7] Dont Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

【速读】：该论文旨在解决小语言模型（Small Language Models, SLMs）在工具增强型AI系统中执行工具调用任务时的关键瓶颈问题，尤其是因工具模式（schema）与模型预训练知识不匹配而导致的工具名称幻觉（schema misalignment）问题。其解决方案的核心在于提出一种无需训练的方法PA-Tool（Pretraining-Aligned Tool Schema Generation），通过利用污染检测中的“峰值度”（peakedness）信号识别模型预训练中熟悉的名字模式，并自动重命名工具组件以对齐模型的先验知识，从而显著降低错误率并提升性能。实验表明，该方法可使SLMs在MetaTool和RoTBench基准上性能提升最高达17个百分点，同时将schema misalignment错误减少80%。

链接: https://arxiv.org/abs/2510.07248
作者: Jonggeun Lee,Woojung Song,Jongwook Han,Haesung Pyun,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Small language models (SLMs) offer significant computational advantages for tool-augmented AI systems, yet they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is schema misalignment: models hallucinate plausible but non-existent tool names that reflect naming conventions internalized during pretraining but absent from the provided tool schema. Rather than forcing models to adapt to arbitrary schemas, we propose adapting schemas to align with models’ pretrained knowledge. We introduce PA-Tool (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness-a signal from contamination detection indicating pretraining familiarity-to automatically rename tool components. By generating multiple candidates and selecting those with highest output concentration across samples, PA-Tool identifies pretrain-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17% points, with schema misalignment errors reduced by 80%. PA-Tool enables small models to approach state-of-the-art performance while maintaining computational efficiency for adaptation to new tools without retraining. Our work demonstrates that schema-level interventions can unlock the tool-use potential of resource-efficient models by adapting schemas to models rather than models to schemas.
zh

[NLP-8] LeMAJ (Legal LLM -as-a-Judge): Bridging Legal Reasoning and LLM Evaluation EMNLP

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在法律领域输出评估中存在的核心问题：现有评估方法要么依赖昂贵的参考数据，要么采用标准化评分方式，难以准确反映法律专业人士的实际判断逻辑，且在可靠性和一致性方面表现不足。其解决方案的关键在于提出一种全新的、无需参考答案的评估范式——将长篇法律回答拆解为“法律数据点”（Legal Data Points, LDPs），即自包含的信息单元，并基于律师实际审阅法律问答的习惯设计评估流程。该方法不仅显著优于多种基线模型，在自有数据集和LegalBench开源数据集上均展现出更强的与人类专家评价的一致性，还提升了标注者间的一致性水平，从而为LLM在法律问答任务中的可信评估提供了可复现、可扩展的技术路径。

链接: https://arxiv.org/abs/2510.07243
作者: Joseph Enguehard,Morgane Van Ermengem,Kate Atkinson,Sujeong Cha,Arijit Ghosh Chowdhury,Prashanth Kallur Ramaswamy,Jeremy Roghair,Hannah R Marlowe,Carina Suzana Negreanu,Kitty Boxall,Diana Mincu
机构: Robin AI; Amazon Web Services
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in Natural Legal Language Processing - EMNLP Workshop 2025

点击查看摘要

Abstract:Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into ‘Legal Data Points’ (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering. Comments: Published in Natural Legal Language Processing - EMNLP Workshop 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.07243 [cs.CL] (or arXiv:2510.07243v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.07243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] Hybrid Reinforcement: When Reward Is Sparse Its Better to Be Dense

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理能力后训练中依赖确定性校验器（verifier）所带来的监督信号过于单一的问题。传统基于0-1正确性反馈的二值化监督机制虽然可靠，但难以捕捉部分正确或替代性答案，限制了模型的学习潜力。为此，作者提出HERO（Hybrid Ensemble Reward Optimization）框架，其核心创新在于结构化地融合验证器信号与奖励模型（Reward Model, RM）得分：通过分层归一化（stratified normalization）将奖励模型分数限定在验证器定义的组内，从而在保持正确性的同时细化质量差异；并通过方差感知加权（variance-aware weighting）增强对困难样本的监督强度，使密集信号在关键场景下发挥更大作用。实验表明，HERO在多种数学推理基准上均优于仅使用奖励模型或仅使用验证器的基线方法，实现了稳定性与细微区分能力的协同提升。

链接: https://arxiv.org/abs/2510.07242
作者: Leitian Tao,Ilia Kulikov,Swarnadeep Saha,Tianlu Wang,Jing Xu,Yixuan Li,Jason E Weston,Ping Yu
机构: Meta
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle–many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
zh

[NLP-10] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

【速读】：该论文旨在解决自动化红队测试（automated red-teaming）在实际部署前对大型语言模型（Large Language Models, LLMs）进行审计时，现有方法缺乏针对目标模型特定漏洞在线适应能力的问题。其核心解决方案是提出Red-Bandit框架，关键在于通过强化学习（reinforcement learning）训练一组针对不同攻击风格（如操纵、俚语等）的低秩适配（LoRA）专家，并在推理阶段采用多臂赌博机（multi-armed bandit）策略动态选择最优攻击风格专家，从而在探索与利用之间实现平衡，有效识别并触发目标模型的失效模式，同时提供可解释的漏洞诊断能力。

链接: https://arxiv.org/abs/2510.07239
作者: Christos Ziakas,Nicholas Loo,Nishita Jain,Alessandra Russo
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model’s response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit’s bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.
zh

[NLP-11] When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

【速读】：该论文旨在解决当前广泛使用的生成式 AI（Generative AI）事实性评估基准因静态特性与现实世界动态发展脱节，导致对大语言模型（Large Language Models, LLMs）事实准确性评估不可靠的问题。其解决方案的关键在于构建一个更新的、基于实时知识检索的评估流程，并设计三种量化指标，系统性地分析基准老化（benchmark aging）现象及其对LLM事实性评价的影响，从而为评估基准的可靠性提供可操作的测试框架。

链接: https://arxiv.org/abs/2510.07238
作者: Xunyi Jiang,Dingyi Chang,Julian McAuley,Xin Xu
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in this https URL.
zh

[NLP-12] LAD-RAG : Layout-aware Dynamic RAG for Visually-Rich Document Understanding

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）方法在处理视觉丰富文档（Visually Rich Documents, VRDs）时，因将内容分割为孤立片段进行编码而导致结构信息与跨页依赖关系丢失的问题，以及在推理阶段固定检索页面数量无法适应具体问题需求所引发的证据不完整和问答准确率下降问题。解决方案的关键在于提出一种布局感知动态检索增强生成框架（Layout-Aware Dynamic RAG, LAD-RAG）：在文档摄入阶段构建包含版式结构和跨页依赖关系的符号化文档图（symbolic document graph），并将其与神经嵌入（neural embeddings）结合以获得更全面的文档表示；在推理阶段，由大语言模型（Large Language Model, LLM）代理动态交互于神经索引与符号索引之间，根据查询自适应地检索所需证据，从而实现高召回率（平均超过90%完美召回）且无需top-k调参，显著提升多页推理任务的问答准确性。

链接: https://arxiv.org/abs/2510.07233
作者: Zhivar Sourati,Zheng Wang,Marianne Menglin Liu,Yazhe Hu,Mengqing Guo,Sujeeth Bharadwaj,Kyu Han,Tao Sheng,Sujith Ravi,Morteza Dehghani,Dan Roth
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents’ structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.
zh

[NLP-13] Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在因果推理能力上的显著不足问题，即模型难以理解真实世界中的因果关系，而不仅限于模式匹配。现有基准测试存在依赖合成数据和领域覆盖狭窄等局限性，无法有效评估LLMs在复杂现实场景下的因果推理能力。解决方案的关键在于构建一个基于顶级经济学与金融期刊中通过严格因果识别方法（如工具变量法、双重差分法和断点回归设计）提取的真实因果关系数据集，包含40,379个评估项，涵盖健康、环境、技术、法律和文化等多领域任务类型，从而提供更具生态效度的评估基准。实验表明，即使最先进的LLMs在该基准上表现有限（最高准确率仅为57.6%），且模型规模提升并不保证因果推理性能改善，凸显了当前LLMs在高风险应用场景中可靠因果推理能力的严重缺失。

链接: https://arxiv.org/abs/2510.07231
作者: Donggyu Lee,Sungwon Park,Yerin Hwang,Hyunwoo Oh,Hyoshin Kim,Jungwon Kim,Meeyoung Cha,Sangyoon Park,Jihee Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.
zh

[NLP-14] Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在模拟用户行为时缺乏个性化的问题，即现有方法（如提示工程、监督微调和强化学习）主要学习群体层面的行为策略，无法根据用户的特定人格特征（persona）进行条件化建模，导致生成的行为通用而缺乏个体差异。解决方案的关键在于提出Customer-R1，一种基于强化学习（Reinforcement Learning, RL）的个性化步骤式用户行为模拟方法，其核心创新是将策略显式地以用户人格为条件，并通过动作正确性奖励信号优化下一步推理过程与行为生成，从而实现更贴近真实用户行为分布的个性化模拟。

链接: https://arxiv.org/abs/2510.07230
作者: Ziyi Wang,Yuxuan Lu,Yimeng Zhang,Jing Huang,Dakuo Wang
机构: Northeastern University (东北大学); Michigan State University (密歇根州立大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user’s persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users’ action distribution, indicating higher fidelity in personalized behavior simulation.
zh

[NLP-15] Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation

【速读】：该论文旨在解决小语言模型（Small Language Models, SLMs）在预训练过程中效率低下、性能受限的问题，尤其是在计算资源和数据消耗方面难以与大语言模型（Large Language Models, LLMs）竞争的挑战。其解决方案的关键在于提出一个简洁而高效的预训练框架，包含三个核心组件：首先，识别出结构稀疏的子网络初始化方式，相较于随机初始化能显著提升模型性能；其次，利用进化搜索（evolutionary search）自动发现高质量的子网络初始结构，为预训练提供更优起点；最后，通过知识蒸馏（knowledge distillation）从更大的教师模型中迁移知识，加速训练并增强泛化能力。该方法使得SLM预训练效率大幅提升，例如最优模型在验证困惑度（perplexity）上达到与Pythia系列SLM相当的水平，但仅需9.2倍少的预训练token。

链接: https://arxiv.org/abs/2510.07227
作者: Arjun Krishnakumar,Rhea Sanjay Sukthanker,Hannan Javed Mahadik,Gabriela Kadlecová,Vladyslav Moroshan,Timur Carstensen,Frank Hutter,Aaron Klein
机构: University of Freiburg, Germany (弗莱堡大学, 德国); ELLIS Institute Tübingen, Germany (ELLIS图宾根研究所, 德国); Charles University, Faculty of Mathematics and Physics (查尔斯大学数学与物理学院); PriorLabs; The Czech Academy of Sciences, Institute of Computer Science (捷克科学院计算机科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse sub-network initializations that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use evolutionary search to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply knowledge distillation from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring 9.2x fewer pretraining tokens. We release all code and models at this https URL, offering a practical and reproducible path toward cost-efficient small language model development at scale.
zh

[NLP-16] Machines in the Crowd? Measuring the Footprint of Machine-Generated Text on Reddit

【速读】：该论文旨在解决生成式人工智能（Generative Artificial Intelligence）在社交媒体平台Reddit中大规模生成的机器生成文本（Machine-Generated Text, MGT）如何融入在线社交环境的问题，特别是其分布特征、社会信号表达及用户参与度。解决方案的关键在于采用一种先进的统计检测方法对2022至2024年间51个代表性子版块中的MGT进行大规模量化分析，揭示其在不同社区类型（如信息获取、社会支持和讨论类）中的不均衡分布模式，并对比MGT与人工撰写的文本在社交信号（如温暖度和地位给予）和互动行为上的差异。研究发现MGT虽整体占比低（峰值约9%），但在特定社区中集中出现且能获得与人类内容相当甚至更高的参与度，表明AI生成内容正逐步成为在线社交话语的有机组成部分。

链接: https://arxiv.org/abs/2510.07226
作者: Lucio La Cava,Luca Maria Aiello,Andrea Tagarelli
机构: DIMES Dept., University of Calabria (意大利卡拉布里亚大学 DIMES 系); IT University of Copenhagen (丹麦哥本哈根信息技术大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence is reshaping online communication by enabling large-scale production of Machine-Generated Text (MGT) at low cost. While its presence is rapidly growing across the Web, little is known about how MGT integrates into social media environments. In this paper, we present the first large-scale characterization of MGT on Reddit. Using a state-of-the-art statistical method for detection of MGT, we analyze over two years of activity (2022-2024) across 51 subreddits representative of Reddit’s main community types such as information seeking, social support, and discussion. We study the concentration of MGT across communities and over time, and compared MGT to human-authored text in terms of social signals it expresses and engagement it receives. Our very conservative estimate of MGT prevalence indicates that synthetic text is marginally present on Reddit, but it can reach peaks of up to 9% in some communities in some months. MGT is unevenly distributed across communities, more prevalent in subreddits focused on technical knowledge and social support, and often concentrated in the activity of a small fraction of users. MGT also conveys distinct social signals of warmth and status giving typical of language of AI assistants. Despite these stylistic differences, MGT achieves engagement levels comparable than human-authored content and in a few cases even higher, suggesting that AI-generated text is becoming an organic component of online social discourse. This work offers the first perspective on the MGT footprint on Reddit, paving the way for new investigations involving platform governance, detection strategies, and community dynamics.
zh

[NLP-17] How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

【速读】：该论文旨在解决低资源非洲语言自动语音识别（ASR）系统开发中的两大核心问题：一是确定实现可接受性能所需的最小数据量，二是识别生产环境中主要的失败模式。解决方案的关键在于通过系统性实验验证OpenAI Whisper模型在两种班图语（卢旺达语和基库尤语）上的表现，发现仅需50小时训练数据即可达到WER低于13%的实用性能，且性能随数据量增至200小时进一步提升至WER低于10%；同时，错误分析表明，38.6%的高错误案例源于噪声标注数据，凸显数据质量与数据量同等重要，为低资源场景下的ASR部署提供了可操作的基准与优化方向。

链接: https://arxiv.org/abs/2510.07221
作者: Benjamin Akera,Evelyn Nafula,Patrick Walukagga,Gilbert Yiga,John Quinn,Ernest Mwebaze
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI’s Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper’s performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER 13%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER 10%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see this https URL
zh

[NLP-18] Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

【速读】：该论文旨在解决大语言模型在多语言任务中如何实现高效、可控的跨语言转换问题，尤其是如何在不依赖大量平行语料或重新训练的情况下，识别并干预模型内部的跨语言表征机制。其解决方案的关键在于发现并利用一组稀疏且稳定的维度（即在中间层到输出层始终位于一致索引位置的特征空间），这些维度主导了从源语言到目标语言的映射过程。通过仅需约50句平行或单语数据即可定位并操纵这些维度，该方法能够在保持语义不变的前提下切换输出语言，且显著优于以往基于神经元层面干预的方法，在计算成本和可解释性上均表现优越。

链接: https://arxiv.org/abs/2510.07213
作者: Chengzhi Zhong,Fei Cheng,Qianying Liu,Yugo Murawaki,Chenhui Chu,Sadao Kurohashi
机构: Kyoto University (京都大学); National Institute of Informatics (日本国立信息学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress. Our code will be available at: this https URL

点击查看摘要

Abstract:Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.
zh

[NLP-19] Sunflower: A New Approach To Expanding Coverag e of African Languages in Large Language Models

【速读】：该论文旨在解决非洲地区语言多样性与语言技术发展不均衡的问题，特别是针对大量低资源语言（如乌干达的多种本土语言）长期被主流大语言模型（Large Language Models, LLMs）忽视的现状。当前LLMs优先支持高语种资源语言（如斯瓦希里语或约鲁巴语），导致对非洲众多小语种的支持呈现碎片化和不连续性。解决方案的关键在于采用区域性聚焦策略，以乌干达为例，开发出基于Qwen 3架构的Sunflower 14B和32B模型，其在乌干达绝大多数本土语言中实现了最先进的理解能力，并通过开源方式促进实际应用中语言障碍的降低。

链接: https://arxiv.org/abs/2510.07203
作者: Benjamin Akera,Evelyn Nafula Ouma,Gilbert Yiga,Patrick Walukagga,Phionah Natukunda,Trevor Saaka,Solomon Nsumba,Lilian Teddy Nabukeera,Joel Muhanguzi,Imran Sekalala,Nimpamya Janat Namara,Engineer Bainomugisha,Ernest Mwebaze,John Quinn
机构: Sunbird AI( Sunbird AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.
zh

[NLP-20] Biasless Language Models Learn Unnaturally: How LLM s Fail to Distinguish the Possible from the Impossible

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）是否对人类可能的语言与人类不可能的语言之间的区别敏感，这一问题关乎LLMs与人类是否共享相同的先天学习偏置（innate learning biases）。为回答此问题，作者采用与先前研究相同的方法论，即通过比较LLM在真实语言数据集和由其经不同扰动函数生成的“不可能”语言数据集上的学习曲线来评估模型表现。关键解决方案在于扩展了语言种类和扰动方式的范围，并引入更宽松的判别标准——考察GPT-2能否在整体上区分自然语言集合与不可能语言集合。通过对困惑度曲线中跨语言变异性的多维指标分析，研究发现GPT-2在多数情况下对可能语言与其对应的不可能版本的学习难度无显著差异，且无法系统性地区分两类语言集合，从而表明LLMs不具备人类用于塑造语言类型学（linguistic typology）的先天偏置。

链接: https://arxiv.org/abs/2510.07178
作者: Imry Ziv,Nur Lan,Emmanuel Chemla,Roni Katzir
机构: Tel Aviv University (特拉维夫大学); CNRS, ENS, EHESS, PSL University; Earth Species Project
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Are large language models (LLMs) sensitive to the distinction between humanly possible languages and humanly impossible languages? This question is taken by many to bear on whether LLMs and humans share the same innate learning biases. Previous work has attempted to answer it in the positive by comparing LLM learning curves on existing language datasets and on “impossible” datasets derived from them via various perturbation functions. Using the same methodology, we examine this claim on a wider set of languages and impossible perturbations. We find that in most cases, GPT-2 learns each language and its impossible counterpart equally easily, in contrast to previous claims. We also apply a more lenient condition by testing whether GPT-2 provides any kind of separation between the whole set of natural languages and the whole set of impossible languages. By considering cross-linguistic variance in various metrics computed on the perplexity curves, we show that GPT-2 provides no systematic separation between the possible and the impossible. Taken together, these perspectives show that LLMs do not share the human innate biases that shape linguistic typology.
zh

[NLP-21] CARPAS: Towards Content-Aware Refinement of Provided Aspects for Summarization in Large Language Models

【速读】：该论文旨在解决现有方面感知摘要（Aspect-based Summarization）方法在实际应用中面临的核心问题：即当输入的预定义方面（aspects）不完整、无关或缺失时，系统难以自适应地调整这些方面以匹配文档内容，从而导致生成的摘要与用户需求不一致。为应对这一挑战，作者提出了一个新任务设置——内容感知的所提供方面精炼（Content-Aware Refinement of Provided Aspects for Summarization, CARPAS），其关键解决方案在于引入一个初步子任务：预测文档中相关方面的数量，并将该预测值作为约束条件引导大语言模型（LLMs）聚焦于最相关的方面，从而降低推理复杂度并提升摘要的相关性和准确性。实验表明，该策略显著改善了多个数据集上的性能，并揭示了LLMs在面对指定方面数量与自身估计不一致时仍具备良好合规性，为真实场景下LLM的应用提供了重要洞见。

链接: https://arxiv.org/abs/2510.07177
作者: Yong-En Tian,Yu-Chien Tang,An-Zi Yen,Wen-Chih Peng
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 17 figures

点击查看摘要

Abstract:Aspect-based summarization has attracted significant attention for its ability to generate more fine-grained and user-aligned summaries. While most existing approaches assume a set of predefined aspects as input, real-world scenarios often present challenges where these given aspects may be incomplete, irrelevant, or entirely missing from the document. Users frequently expect systems to adaptively refine or filter the provided aspects based on the actual content. In this paper, we initiate this novel task setting, termed Content-Aware Refinement of Provided Aspects for Summarization (CARPAS), with the aim of dynamically adjusting the provided aspects based on the document context before summarizing. We construct three new datasets to facilitate our pilot experiments, and by using LLMs with four representative prompting strategies in this task, we find that LLMs tend to predict an overly comprehensive set of aspects, which often results in excessively long and misaligned summaries. Building on this observation, we propose a preliminary subtask to predict the number of relevant aspects, and demonstrate that the predicted number can serve as effective guidance for the LLMs, reducing the inference difficulty, and enabling them to focus on the most pertinent aspects. Our extensive experiments show that the proposed approach significantly improves performance across all datasets. Moreover, our deeper analyses uncover LLMs’ compliance when the requested number of aspects differs from their own estimations, establishing a crucial insight for the deployment of LLMs in similar real-world applications.
zh

[NLP-22] Quantifying Data Contamination in Psychometric Evaluations of LLM s

【速读】：该论文旨在解决当前对大型语言模型（Large Language Models, LLMs）进行心理测量评估时存在的数据污染问题，即模型可能因训练数据中包含大量心理量表（psychometric inventories）而产生记忆偏差，从而影响评估结果的可靠性。解决方案的关键在于提出一个系统性的框架，用于量化这种污染程度，具体从三个维度进行评估：（1）项目记忆（item memorization），（2）评分记忆（evaluation memorization），以及（3）目标分数匹配（target score matching）。通过该框架对21个主流模型和四个常用心理量表的实证分析，研究发现如大五人格量表（Big Five Inventory, BFI-44）和肖像价值观问卷（Portrait Values Questionnaire, PVQ-40）等广泛使用的量表存在显著污染，模型不仅能够记住题目内容，还能主动调整回答以达到预设的心理评分目标，这揭示了现有心理测量评估方法在LLMs上的潜在局限性。

链接: https://arxiv.org/abs/2510.07175
作者: Jongwook Han,Woojung Song,Jonggeun Lee,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.
zh

[NLP-23] NurseLLM : The First Specialized Language Model for Nursing EMNLP2025

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在护理领域（nursing domain）应用中存在显著不足的问题，即现有通用及医学专用LLMs在护理相关多选题（Multiple Choice Question, MCQ）任务上的表现有限，缺乏针对护理专业场景的定制化模型。解决方案的关键在于构建首个面向护理领域的专用大模型——NurseLLM，并设计了一套多阶段数据生成流程以创建大规模护理MCQ数据集，从而实现对护理知识的系统性建模与训练；同时引入多个护理基准测试（benchmarks）用于严格评估模型性能，实验证明NurseLLM在多项指标上优于同规模的通用和医疗专用模型，凸显了领域专业化对提升护理AI能力的重要性。

链接: https://arxiv.org/abs/2510.07173
作者: Md Tawkat Islam Khondaker,Julia Harrington,Shady Shehata
机构: Yourika Labs(Yourika实验室); The University of British Columbia(不列颠哥伦比亚大学); Western University(西门菲莎大学); University of Waterloo(滑铁卢大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2025 Industry Track

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.
zh

[NLP-24] More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning EMNLP2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在数学推理任务中性能依赖于训练数据质量的问题，尤其是在真实工业场景下，现有数据构建方法的实际效用尚不明确。其解决方案的关键在于通过统一的训练与部署流程对开源数据集和数据合成技术进行系统评估，并从中提炼出有效的数据选择策略；研究发现，相较于单纯增加数据量，采用更具可解释性的数据结构或从更强模型中蒸馏知识更能显著提升模型性能，从而为实际应用中平衡“更多数据”与“更好数据”提供了可操作的指导。

链接: https://arxiv.org/abs/2510.07169
作者: Yike Zhao,Simin Guo,Ziqing Yang,Shifan Han,Dahua Lin,Fei Tan
机构: East China Normal University (华东师范大学); University of Chicago (芝加哥大学); Independent Researcher (独立研究员); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, submitted to EMNLP 2025 Industry Track

点击查看摘要

Abstract:The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance “more data” versus “better data” for real-world reasoning tasks.
zh

[NLP-25] Reasoning for Hierarchical Text Classification: The Case of Patents

【速读】：该论文旨在解决层级文本分类（Hierarchical Text Classification, HTC）中的关键挑战，尤其是自动化专利主题分类这一高难度场景，其难点在于领域知识复杂性和海量标签规模。传统方法仅输出扁平化的标签集合，缺乏可解释性与推理过程。为此，作者提出推理驱动的层级分类框架（Reasoning for Hierarchical Classification, RHC），其核心创新在于将HTC重构为一个逐步推理任务，通过两阶段训练策略增强大语言模型（LLMs）的多步推理能力：第一阶段采用思维链（Chain-of-Thought, CoT）格式对齐输出，第二阶段利用强化学习（Reinforcement Learning, RL）优化推理链条，从而实现逐层推导层级标签。该方案显著提升了准确率、可解释性、扩展性及跨任务适用性。

链接: https://arxiv.org/abs/2510.07167
作者: Lekang Jiang,Wenjun Sun,Stephan Goetz
机构: University of Cambridge (剑桥大学); National Science Library, Chinese Academy of Sciences (中国科学院文献情报中心); Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences (中国科学院大学经济与管理学院信息资源管理系)
类目: Computation and Language (cs.CL)
备注: 15 pages, 10 tables, 3 figures

点击查看摘要

Abstract:Hierarchical text classification (HTC) assigns documents to multiple levels of a pre-defined taxonomy. Automated patent subject classification represents one of the hardest HTC scenarios because of domain knowledge difficulty and a huge number of labels. Prior approaches only output a flat label set, which offers little insight into the reason behind predictions. Therefore, we propose Reasoning for Hierarchical Classification (RHC), a novel framework that reformulates HTC as a step-by-step reasoning task to sequentially deduce hierarchical labels. RHC trains large language models (LLMs) in two stages: a cold-start stage that aligns outputs with chain-of-thought (CoT) reasoning format and a reinforcement learning (RL) stage to enhance multi-step reasoning ability. RHC demonstrates four advantages in our experiments. (1) Effectiveness: RHC surpasses previous baselines and outperforms the supervised fine-tuning counterparts by approximately 3% in accuracy and macro F1. (2) Explainability: RHC produces natural-language justifications before prediction to facilitate human inspection. (3) Scalability: RHC scales favorably with model size with larger gains compared to standard fine-tuning. (4) Applicability: Beyond patents, we further demonstrate that RHC achieves state-of-the-art performance on other widely used HTC benchmarks, which highlights its broad applicability.
zh

[NLP-26] A Multi-Agent Framework for Stateful Inference-Time Search

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在多步推理任务中因缺乏持久状态而导致的性能瓶颈，以及任务特定微调或指令微调在需要深层推理和长程依赖的任务上表现脆弱的问题。其解决方案的关键在于提出一种无需训练的有状态多智能体进化搜索框架，通过三个核心机制实现：(i) 推理时持久状态维持，确保跨步骤信息传递；(ii) 对抗性变异机制，用于探索多样化的候选解；(iii) 进化保留策略，保障种群多样性与全局搜索能力。该方法在自动化单元测试生成任务中验证有效，尤其在发现高覆盖率边缘案例方面显著优于无状态单步基线模型。

链接: https://arxiv.org/abs/2510.07147
作者: Arshika Lalan,Rajat Ghosh,Aditya Kolsur,Debojyoti Dutta
机构: Carnegie Mellon University (卡内基梅隆大学); Nutanix (纽坦克斯)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.
zh

[NLP-27] Comparing human and language models sentence processing difficulties on complex structures

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在句子理解过程中是否表现出与人类相似的认知困难，特别是针对七种具有挑战性的句法结构。其核心问题是：LLMs 是否像人类一样在处理复杂句法时出现理解瓶颈？解决方案的关键在于构建一个统一的实验框架，系统比较人类与五类不同规模和训练方式的先进LLMs在相同语料上的表现，通过对比目标句（含难点结构）与匹配基线句（无难点结构）的性能差异，揭示人类与LLMs在句法处理上的收敛与分歧模式。研究发现，尽管最强模型在非花园路径（garden path, GP）句上接近人类准确率（如GPT-5达93.7%），但在GP句上仍显著落后（仅46.8%），且随着参数量增加，模型与人类对句法难度的排序相关性增强，表明二者在认知机制上存在部分相似性，但并非完全一致。

链接: https://arxiv.org/abs/2510.07141
作者: Samuel Joseph Amouyal,Aya Meltzer-Asscher,Jonathan Berant
机构: Blavatnik School of Computer Science, Tel Aviv University, Israel (特拉维夫大学计算机科学布劳特尼克学院); Department of Linguistics, Tel Aviv University, Israel (特拉维夫大学语言学系); Sagol School of Neuroscience, Tel Aviv University, Israel (特拉维夫大学萨戈尔神经科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Data and code will be released soon

点击查看摘要

Abstract:Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.
zh

[NLP-28] RIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

【速读】：该论文旨在解决大规模指令微调数据集在实际应用中存在计算成本高、效率低的问题，尤其针对传统方法依赖梯度等粗粒度样本级信号导致的资源消耗大且忽视细粒度结构特征的局限性。其解决方案的关键在于提出一种前向传播的、以token为中心的框架TRIM（Token Relevance via Interpretable Multi-layer Attention），通过匹配少量目标样本中由多层注意力机制提取的“指纹”模式来识别关键token，从而构建高质量的coreset（核心子集）。该方法避免了昂贵的反向传播过程，在显著降低计算开销的同时，更敏感地捕捉任务特异性结构信息，实验证明所选coreset在下游任务上性能优于现有最优基线，甚至超越全量数据微调效果。

链接: https://arxiv.org/abs/2510.07118
作者: Manish Nagaraj,Sakshi Choudhary,Utkarsh Saxena,Deepak Ravikumar,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based “fingerprints” from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
zh

[NLP-29] Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning EMNLP2025

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）任务中普遍存在的主观性、模糊性以及标注者之间合法分歧的问题，即如何建模人类标注变异性。其解决方案的关键在于利用大语言模型（Large Language Models, LLMs）的上下文学习（in-context learning）能力，并结合两阶段元学习训练策略：首先在多个需上下文学习的数据集上进行后训练（post-training），其次通过上下文元学习（in-context meta-learning）对特定数据分布进行专业化微调。实验表明，将评分者示例纳入上下文是系统性能的关键因素，且在较大数据集上进行数据集特异性微调、在其他上下文学习数据集上进行后训练，以及模型规模扩大均能显著提升性能。

链接: https://arxiv.org/abs/2510.07105
作者: Taylor Sorensen,Yejin Choi
机构: University of Washington (华盛顿大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NLPerspectives: The 4th Workshop on Perspectivist Approaches to Natural Language Processing at EMNLP 2025

点击查看摘要

Abstract:Many natural language processing (NLP) tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators. In this paper, we outline our system for modeling human variation. Our system leverages language models’ (LLMs) in-context learning abilities, along with a two-step meta-learning training procedure for 1) post-training on many datasets requiring in-context learning and 2) specializing the model via in-context meta-learning to the particular data distribution of interest. We also evaluate the performance of our system submission to the Learning With Disagreements (LeWiDi) competition, where it was the overall winner on both tasks. Additionally, we perform an ablation study to measure the importance of each system component. We find that including rater examples in-context is crucial for our system’s performance, dataset-specific fine-tuning is helpful on the larger datasets, post-training on other in-context datasets is helpful on one of the competition datasets, and that performance improves with model scale.
zh

[NLP-30] ALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription

【速读】：该论文旨在解决Table Visual Question Answering (Table VQA)任务中大型视觉语言模型（VLM）计算成本高、难以部署于移动端，而轻量级方案因结构化表示不兼容大语言模型（LLM）导致推理误差大的问题。其解决方案的关键在于提出TALENT框架，通过双模态表征增强：让小型VLM同时输出OCR文本和自然语言叙述（narration），并将二者与问题一同输入LLM进行联合推理，从而将Table VQA重构为以LLM为核心的多模态推理任务，使VLM仅承担感知与叙事功能，显著降低计算开销并提升准确性。

链接: https://arxiv.org/abs/2510.07098
作者: Guo Yutong,Wanying Wang,Yue Wu,Zichen Miao,Haoyu Wang
机构: Johns Hopkins University (约翰霍普金斯大学); Purdue University (普渡大学); University at Albany (阿尔巴尼大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.
zh

[NLP-31] Making Machines Sound Sarcastic: LLM -Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

【速读】：该论文旨在解决生成式语音合成中对讽刺（sarcasm）语用特征建模不足的问题，因其依赖细微的语义、上下文及韵律线索，而现有研究多集中于宽泛的情绪类别。解决方案的关键在于提出一种基于大型语言模型（Large Language Model, LLM）增强的检索增强框架（Retrieval-Augmented framework），通过两个核心模块实现：一是利用LoRA微调的LLaMA 3模型提取语义嵌入以捕捉讽刺中的语用不一致性和话语层面线索；二是借助检索增强生成（Retrieval-Augmented Generation, RAG）模块获取韵律示例，提供讽刺表达的风格参考模式。二者共同作为条件信息注入VITS声码器架构，显著提升讽刺语音的自然度与语境适配性，在客观指标和主观评测中均优于基线方法。

链接: https://arxiv.org/abs/2510.07096
作者: Zhu Li,Yuqing Zhang,Xiyuan Gao,Shekhar Nayak,Matt Coler
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.
zh

[NLP-32] he Cognitive Bandwidth Bottleneck: Shifting Long-Horizon Agent from Planning Planning with Actions to Planning with Schemas

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在开放世界自主性任务中长期规划能力不足的问题，特别是当环境动作空间呈组合爆炸式增长时，传统基于具体动作的规划方法（Planning with Actions, PwA）难以维持有效性和可扩展性。其核心解决方案是提出一种基于动作模式（Action Schema）的替代表示方式——规划与模式（Planning with Schemas, PwS），通过将抽象动作模板（如“将[OBJ]移动到[OBJ]”）实例化为结构化动作列表，从而压缩动作空间、提升可扩展性，并更贴近人类认知逻辑和环境约束。关键创新在于引入“认知带宽”视角作为理论框架，实证发现存在一个动作表示选择的拐点（inflection point），在ALFWorld（约35个动作）与SciWorld（约500个动作）之间，表明随着动作空间扩大，PwS相比PwA更具优势；进一步实验揭示该拐点位置受模型规划能力与模式实例化质量的影响：更强的规划能力使拐点右移，而更优的模式实例化则使其左移，为构建具备可扩展自主性的PwS代理提供了可操作的优化路径。

链接: https://arxiv.org/abs/2510.07091
作者: Baixuan Xu,Tianshi Zheng,Zhaowei Wang,Hong Ting Tsang,Weiqi Wang,Tianqing Fang,Yangqiu Song
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:Enabling LLMs to effectively operate long-horizon task which requires long-term planning and multiple interactions is essential for open-world autonomy. Conventional methods adopt planning with actions where a executable action list would be provided as reference. However, this action representation choice would be impractical when the environment action space is combinatorial exploded (e.g., open-ended real world). This naturally leads to a question: As environmental action space scales, what is the optimal action representation for long-horizon agents? In this paper, we systematically study the effectiveness of two different action representations. The first one is conventional planning with actions (PwA) which is predominantly adopted for its effectiveness on existing benchmarks. The other one is planning with schemas (PwS) which instantiate an action schema into action lists (e.g., “move [OBJ] to [OBJ]” - “move apple to desk”) to ensure concise action space and reliable scalability. This alternative is motivated by its alignment with human cognition and its compliance with environment-imposed action format restriction. We propose cognitive bandwidth perspective as a conceptual framework to qualitatively understand the differences between these two action representations and empirically observe a representation-choice inflection point between ALFWorld (~35 actions) and SciWorld (~500 actions), which serve as evidence of the need for scalable representations. We further conduct controlled experiments to study how the location of this inflection point interacts with different model capacities: stronger planning proficiency shifts the inflection rightward, whereas better schema instantiation shifts it leftward. Finally, noting the suboptimal performance of PwS agents, we provide an actionable guide for building more capable PwS agents for better scalable autonomy.
zh

[NLP-33] All Claims Are Equal but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

【速读】：该论文旨在解决现有大语言模型（Large Language Model, LLM）事实性评估方法对关键信息错误不敏感的问题，即当前评估指标将所有陈述视为同等重要，导致在关键信息缺失或错误时仍给出较高得分，从而误导评估结果。解决方案的关键在于引入VITAL（Value-based Importance-aware Truthfulness Assessment），通过量化每个陈述与查询的相关性和重要性，使评估指标能够更敏感地识别和区分关键信息的错误，从而提升事实性评估的准确性与可靠性。

链接: https://arxiv.org/abs/2510.07083
作者: Miriam Wanner,Leif Azzopardi,Paul Thomas,Soham Dan,Benjamin Van Durme,Nick Craswell
机构: Johns Hopkins University (约翰霍普金斯大学); Microsoft (微软); University of Strathclyde (斯特拉斯克莱德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.
zh

[NLP-34] Accelerating Diffusion LLM Inference via Local Determinism Propagation

【速读】：该论文旨在解决扩散大语言模型（Diffusion Large Language Models, dLLMs）在实际部署中面临的质量-速度权衡问题，尤其是由保守采样策略（如贪婪解码）导致的冗余迭代和延迟解码（delayed decoding）现象。其解决方案的关键在于提出一种无需训练的自适应并行解码策略——LocalLeap，该策略基于两个核心经验原则：围绕高置信度锚点的局部确定性传播与渐进的空间一致性衰减。通过识别锚点并在有限邻域内执行局部松弛并行解码，LocalLeap 实现了对已确定token的早期承诺，从而显著减少推理步数（降至原始需求的14.2%），同时保持输出质量几乎不变，并带来6.94倍的吞吐量提升。

链接: https://arxiv.org/abs/2510.07081
作者: Fanheng Kong,Jingyuan Zhang,Yahui Liu,Zirui Wu,Yu Tian,Victoria W.,Guorui Zhou
机构: Klear Team, Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures. Under review

点击查看摘要

Abstract:Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations–a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94 \times throughput improvements and reduces decoding steps to just 14.2% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: this https URL.
zh

[NLP-35] LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

【速读】：该论文旨在解决低资源语言（如卢森堡语）在指令微调（instruction tuning）过程中因缺乏高质量标注数据而导致性能受限的问题。传统方法依赖机器翻译生成指令数据，常引入语义错位和文化偏差。其解决方案的关键在于不使用机器翻译，而是通过利用英语、法语和德语的对齐数据构建跨语言指令微调数据集，从而保留语言与文化的细微差别，提升模型在卢森堡语中的生成能力和跨语言表征对齐效果。

链接: https://arxiv.org/abs/2510.07074
作者: Fred Philippy,Laura Bernardy,Siwen Guo,Jacques Klein,Tegawendé F. Bissyandé
机构: SnT, University of Luxembourg (SnT，卢森堡大学); Zortify Labs, Zortify S.A., Luxembourg (Zortify 实验室，Zortify 公司，卢森堡)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper under review; Dataset available at this https URL

点击查看摘要

Abstract:Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish, without resorting to machine-generated translations into it. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that cross-lingual instruction tuning not only improves representational alignment across languages but also the model’s generative capabilities in Luxembourgish. This highlights how cross-lingual data curation can avoid the common pitfalls of machine-translated data and directly benefit low-resource language development.
zh

[NLP-36] Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

【速读】：该论文旨在解决当前自动评估指标（automatic metrics）在印度语种中缺乏有效性验证的问题，这些指标此前主要针对英语等高资源语言开发和验证，导致对超过15亿人口使用的印度语言存在评估盲区。其解决方案的关键在于构建ITEM基准，这是一个大规模、多语言的评估体系，系统性地衡量26种自动指标在六种主要印度语言中与人类判断的一致性，并辅以细粒度标注。通过覆盖一致性、异常值敏感性、语言特异性可靠性、指标间相关性及对抗扰动下的鲁棒性等多个维度的实证分析，论文揭示了大语言模型（LLM）驱动的评估器在段落和系统层面均表现出最强的人类判断一致性，为未来面向印度语言的指标设计与优化提供了关键依据。

链接: https://arxiv.org/abs/2510.07061
作者: Amir Hossein Yari,Kalmit Kulkarni,Ahmad Raza Khan,Fajri Koto
机构: Sharif University of Technology(谢里夫理工大学); Vellore Institute of Technology(韦洛尔理工学院); IIT Kharagpur(印度理工学院克哈格布尔分校); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.
zh

[NLP-37] Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations

【速读】：该论文试图解决的问题是：当地新闻电视台被辛克莱广播集团（Sinclair Broadcast Group）收购后，其新闻报道内容是否发生改变，尤其是本地议题与全国性议题之间的平衡是否被打破，以及是否加剧了政治极化话题的报道频率。解决方案的关键在于采用计算方法对收购前后地方新闻台在互联网上发布的内容进行定量分析，并将其与全国性新闻媒体的内容进行对比，从而识别出报道结构的变化趋势，发现收购后的新闻台更频繁地报道全国性新闻，且对极化议题的关注显著增加。

链接: https://arxiv.org/abs/2510.07060
作者: Miriam Wanner,Sophia Hager,Anjalie Field
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. Because these stations are trusted news sources, viewers are particularly susceptible to the information they report. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We use computational methods to investigate changes in internet content put out by local news stations before and after being acquired by Sinclair and in comparison to national news outlets. We find that there is clear evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases.
zh

[NLP-38] Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在信息检索任务中利用率不足的问题，即尽管LLMs具备强大的自然语言理解能力，但其在生成高质量搜索嵌入（search embeddings）方面尚未得到充分挖掘。解决方案的关键在于提出Search-R3框架，该框架通过将LLMs的推理过程（chain-of-thought）与嵌入生成直接结合，使模型能够基于逐步的语义分析生成更有效的嵌入表示。其核心创新包括三个互补机制：(1) 监督学习阶段提升嵌入质量，(2) 强化学习（Reinforcement Learning, RL）方法同步优化推理与嵌入生成，(3) 设计专用RL环境以高效处理嵌入表示的动态更新，避免每次训练迭代时对整个语料库重新编码。这一整合后训练策略显著提升了复杂知识密集型任务中的检索性能。

链接: https://arxiv.org/abs/2510.07048
作者: Yuntao Gui,James Cheng
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to generate search embeddings as a direct output of their reasoning process. Our approach exploits LLMs’ chain-of-thought capabilities, allowing them to produce more effective embeddings by reasoning step-by-step through complex semantic analyses. We implement this through three complementary mechanisms. (1) a supervised learning stage enables the model’s ability to produce quality embeddings, (2) a reinforcement learning (RL) methodology that optimizes embedding generation alongside reasoning, and (3) a specialized RL environment that efficiently handles evolving embedding representations without requiring complete corpus re-encoding at each training iteration. Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes. This integrated post-training approach represents a substantial advancement in handling complex knowledge-intensive tasks that require both sophisticated reasoning and effective information retrieval. Project page: this https URL
zh

[NLP-39] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

【速读】：该论文旨在解决多语言自然语言处理（Natural Language Processing, NLP）中代码切换（Code-Switching, CSW）建模的挑战，尤其是在大型语言模型（Large Language Models, LLMs）背景下，现有模型对混合语言输入的处理能力有限、相关数据集稀缺以及评估方法存在偏差等问题。其解决方案的关键在于系统性地梳理和分类近年来面向CSW的LLM研究进展，涵盖架构设计、训练策略与评估方法，并提出一个包含30余个数据集、80多种语言的资源集合，强调构建包容性数据集、公平评估机制及基于语言学原理的模型设计，以推动实现真正具备多语言智能的系统。

链接: https://arxiv.org/abs/2510.07037
作者: Rajvee Sheth,Samridhi Raj Sinha,Mahavir Patil,Himanshu Beniwal,Mayank Singh
机构: SIIT Gandhinagar (印度理工学院甘地纳格尔分校); NMIMS Mumbai (印度管理研究所孟买分校); SVNIT Surat (萨尔特文德大学); LINGO Research Group (LINGO 研究组)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing \totalunique_references studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at this https URL.
zh

[NLP-40] Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

【速读】：该论文旨在解决当前对大语言模型（Large Language Models, LLMs）事实性知识（factual knowledge 或 beliefs）的理解不足问题，特别是以往研究多基于有偏样本进行分析，难以全面揭示模型知识的真实特性。其解决方案的关键在于利用 GPTKB v1.5 这一递归提取的、包含 1 亿条信念的数据集，系统性地剖析当前最强前沿 LLM 之一 GPT-4.1 的事实性知识分布与质量。研究表明，模型的事实知识与权威知识库存在显著差异，且准确率低于以往基准测试所反映的水平，同时揭示了不一致性、模糊性和幻觉（hallucination）是核心挑战，为未来提升 LLM 事实准确性提供了关键研究方向。

链接: https://arxiv.org/abs/2510.07024
作者: Shrestha Ghosh,Luca Giordano,Yujia Hu,Tuan-Phong Nguyen,Simon Razniewski
机构: University of Tübingen (图宾根大学); ScaDS.AI Dresden/Leipzig & TU Dresden (德累斯顿工业大学); VNU University of Engineering and Technology (越南国家大学工程与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are remarkable artifacts that have revolutionized a range of NLP and AI tasks. A significant contributor is their factual knowledge, which, to date, remains poorly understood, and is usually analyzed from biased samples. In this paper, we take a deep tour into the factual knowledge (or beliefs) of a frontier LLM, based on GPTKB v1.5 (Hu et al., 2025a), a recursively elicited set of 100 million beliefs of one of the strongest currently available frontier LLMs, GPT-4.1. We find that the models’ factual knowledge differs quite significantly from established knowledge bases, and that its accuracy is significantly lower than indicated by previous benchmarks. We also find that inconsistency, ambiguity and hallucinations are major issues, shedding light on future research opportunities concerning factual LLM knowledge.
zh

[NLP-41] Native Hybrid Attention for Efficient Sequence Modeling

【速读】：该论文旨在解决Transformer模型在长序列建模中面临的计算复杂度高（二次方复杂度）与线性注意力机制在长上下文下召回准确率下降之间的矛盾问题。解决方案的关键在于提出一种原生混合注意力机制（Native Hybrid Attention, NHA），其核心创新是将线性RNN用于维护长期上下文的键值槽，并通过滑动窗口引入短期token，再以单一softmax操作对所有键值进行加权，从而实现无需额外融合参数的逐token和逐头的上下文依赖权重分配；同时，通过控制滑动窗口大小这一单一超参数，可在纯线性与全注意力之间平滑调节，保持层结构统一，兼顾效率与准确性。

链接: https://arxiv.org/abs/2510.07019
作者: Jusen Du,Jiaxi Hu,Tao Zhang,Weigao Sun,Yu Cheng
机构: Tsinghua University (清华大学); Shanghai AI Laboratory; The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical report, 16 pages

点击查看摘要

Abstract:Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \ inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \textttsoftmax attention operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at this https URL.
zh

[NLP-42] Prag yaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages EMNLP2025

【速读】：该论文旨在解决当前开源后训练数据在多语言覆盖、文化适配性以及任务多样性方面存在的不足，尤其是在印度语种（Indic languages）中尤为突出的问题。其解决方案的关键在于提出了一种“人在回路”（human-in-the-loop）的数据生成管道，该管道结合了翻译与合成扩展技术，以生成高质量、多样化且文化敏感的印地语后训练数据。通过该方法，研究者构建了两个数据集：Pragyaan-IT（22.5K条样本）和Pragyaan-Align（100K条样本），涵盖10种印度语言、13个广泛类别和56个子类别，显著提升了多语言大语言模型（LLMs）在印度语境下的适应性和性能。

链接: https://arxiv.org/abs/2510.07000
作者: Neel Prabhanjan Rachamalla,Aravind Konakalla,Gautam Rajeev,Ashish Kulkarni,Chandra Khatri,Shubham Agarwal
机构: Krutrim AI(克鲁特rim人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025

点击查看摘要

Abstract:The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
zh

[NLP-43] owards Reliable Retrieval in RAG Systems for Large Legal Datasets EMNLP2025

【速读】：该论文旨在解决法律领域中检索增强生成（Retrieval-Augmented Generation, RAG）系统因文档级检索不匹配（Document-Level Retrieval Mismatch, DRM）而导致的可靠性问题，即检索器错误地从不相关源文档中提取信息，从而引发生成内容的偏差或幻觉。解决方案的关键在于提出一种名为摘要增强分块（Summary-Augmented Chunking, SAC）的技术：通过为每个文本块注入文档级别的合成摘要，将原本在标准分块过程中丢失的全局上下文信息重新引入，从而显著降低DRM的发生率，并提升细粒度检索的精度与召回率。实验表明，即使采用通用摘要策略，其效果也优于依赖法律专家知识的特定元素聚焦方法，凸显了SAC在实际应用中的有效性与可扩展性。

链接: https://arxiv.org/abs/2510.06999
作者: Markus Reuter,Tobias Lingenberg,Rūta Liepiņa,Francesca Lagioia,Marco Lippi,Giovanni Sartor,Andrea Passerini,Burcu Sayin
机构: Technical University of Darmstadt (达姆施塔特工业大学); University of Florence (佛罗伦萨大学); University of Bologna (博洛尼亚大学); University of Trento (特伦托大学); European University Institute (欧洲大学学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted for the 7th Natural Legal Language Processing Workshop (NLLP 2025), co-located with EMNLP 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.
zh

[NLP-44] RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在AI辅助软件开发场景中安全性不足的问题，尤其是针对其对话式越狱（conversational jailbreaks）的鲁棒性评估缺乏系统性和多样性。解决方案的关键在于提出RedTWIZ框架，该框架融合了三项核心贡献：(1) 系统化评估LLM对话越狱攻击的鲁棒性；(2) 构建多样化的多轮生成式攻击套件，支持组合性、现实性和目标导向的越狱策略；(3) 设计分层攻击规划器（hierarchical attack planner），能够自适应地识别特定LLM的漏洞并序列化触发定制化攻击。这一统一框架实现了从评估、攻击生成到战略规划的闭环，显著提升了对LLM安全弱点的探测能力。

链接: https://arxiv.org/abs/2510.06994
作者: Artur Horal,Daniel Pina,Henrique Paz,Iago Paulo,João Soares,Rafael Ferreira,Diogo Tavares,Diogo Glória-Silva,João Magalhães,David Semedo
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strategies; and (3) a hierarchical attack planner, which adaptively plans, serializes, and triggers attacks tailored to specific LLM’s vulnerabilities. Together, these contributions form a unified framework – combining assessment, attack generation, and strategic planning – to comprehensively evaluate and expose weaknesses in LLMs’ robustness. Extensive evaluation is conducted to systematically assess and analyze the performance of the overall system and each component. Experimental results demonstrate that our multi-turn adversarial attack strategies can successfully lead state-of-the-art LLMs to produce unsafe generations, highlighting the pressing need for more research into enhancing LLM’s robustness.
zh

[NLP-45] VelLM es: A high-interaction AI-based deception framework

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的欺骗系统（deception systems）稀缺且功能单一的问题，特别是现有系统多局限于模拟SSH shell服务，缺乏对多种协议和服务的支持，同时缺少面向真实人类攻击者的全面评估。解决方案的关键在于提出一种名为VelLMes的AI驱动欺骗框架，该框架能够模拟包括SSH Linux shell、MySQL、POP3和HTTP在内的多种协议与服务，并以交互性强、逼真度高的方式部署为蜜罐（honeypots），从而增强欺骗策略的多样性与实用性。通过单位测试验证LLM生成能力，并结合89名真实人类攻击者参与的实验及10个公网部署实例的实际流量分析，证明了VelLMes在生成真实响应和诱导人类攻击者误判方面具有显著效果，其核心优势在于利用精心设计的提示（prompting）实现高保真度的交互式欺骗行为。

链接: https://arxiv.org/abs/2510.06975
作者: Muris Sladić(1),Veronica Valeros(1),Carlos Catania(2),Sebastian Garcia(1) ((1) Czech Technical University in Prague, (2) CONICET, UNCuyo)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages. 9 figures. 1 table. This is a preprint of a paper that was presented at the Active Defense and Deception Workshop colocated with IEEE EuroSP 2025 conference

点击查看摘要

Abstract:There are very few SotA deception systems based on Large Language Models. The existing ones are limited only to simulating one type of service, mainly SSH shells. These systems - but also the deception technologies not based on LLMs - lack an extensive evaluation that includes human attackers. Generative AI has recently become a valuable asset for cybersecurity researchers and practitioners, and the field of cyber-deception is no exception. Researchers have demonstrated how LLMs can be leveraged to create realistic-looking honeytokens, fake users, and even simulated systems that can be used as honeypots. This paper presents an AI-based deception framework called VelLMes, which can simulate multiple protocols and services such as SSH Linux shell, MySQL, POP3, and HTTP. All of these can be deployed and used as honeypots, thus VelLMes offers a variety of choices for deception design based on the users’ needs. VelLMes is designed to be attacked by humans, so interactivity and realism are key for its performance. We evaluate the generative capabilities and the deception capabilities. Generative capabilities were evaluated using unit tests for LLMs. The results of the unit tests show that, with careful prompting, LLMs can produce realistic-looking responses, with some LLMs having a 100% passing rate. In the case of the SSH Linux shell, we evaluated deception capabilities with 89 human attackers. The results showed that about 30% of the attackers thought that they were interacting with a real system when they were assigned an LLM-based honeypot. Lastly, we deployed 10 instances of the SSH Linux shell honeypot on the Internet to capture real-life attacks. Analysis of these attacks showed us that LLM honeypots simulating Linux shells can perform well against unstructured and unexpected attacks on the Internet, responding correctly to most of the issued commands.
zh

[NLP-46] Probing Social Identity Bias in Chinese LLM s with Gendered Pronouns and Social Groups

【速读】：该论文试图解决的问题是：中文大语言模型（Chinese Large Language Models, Chinese LLMs）在用户面向应用中可能反映并放大社会身份偏见（social identity bias）的潜在风险。为解决这一问题，研究提出了一种语言敏感的评估框架，关键在于通过设计针对中文语境的特定提示（Mandarin-specific prompts），系统性地考察模型对内群体（ingroup，“我们”）和外群体（outgroup，“他们”）的响应倾向，并扩展至240个在中国语境下显著的社会群体；同时结合真实用户与聊天机器人的对话语料进行自然交互分析，从而揭示偏见不仅存在于受控实验中，更在实际对话场景中强化。此方案首次实证表明，英语语境下已知的社会身份偏见可跨语言迁移并在用户交互中加剧，为中文LLMs的公平性评估提供了可操作、情境化的评估方法。

链接: https://arxiv.org/abs/2510.06974
作者: Geng Liu,Feng Li,Junjie Mu,Mengxiao Zhu,Francesco Pierri
机构: Politecnico di Milano (米兰理工大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns about their potential to reflect and amplify social biases. We investigate social identity framing in Chinese LLMs using Mandarin-specific prompts across ten representative Chinese LLMs, evaluating responses to ingroup (“We”) and outgroup (“They”) framings, and extending the setting to 240 social groups salient in the Chinese context. To complement controlled experiments, we further analyze Chinese-language conversations from a corpus of real interactions between users and chatbots. Across models, we observe systematic ingroup-positive and outgroup-negative tendencies, which are not confined to synthetic prompts but also appear in naturalistic dialogue, indicating that bias dynamics might strengthen in real interactions. Our study provides a language-aware evaluation framework for Chinese LLMs, demonstrating that social identity biases documented in English generalize cross-linguistically and intensify in user-facing contexts.
zh

[NLP-47] EDUMATH: Generating Standards-aligned Educational Math Word Problems

【速读】：该论文旨在解决教师在大规模班级中难以个性化定制数学应用题（Math Word Problems, MWPs）以匹配学生兴趣和能力水平的问题，从而提升学习效果。解决方案的关键在于利用大语言模型（LLMs）生成符合教育标准且个性化的MWPs，并通过结合人类专家与LLM的联合评判机制评估超过11,000道生成题目，构建首个由教师标注的、与教育标准对齐的MWPs数据集。基于此数据集，研究者训练了一个12B参数的开源模型，在性能上媲美更大更强大的开源模型，并进一步开发了一个文本分类器使30B开源模型无需微调即可超越现有闭源基线模型，同时生成的题目在语义上更接近人工编写题目，且在真实学生实验中显示学生偏好定制化生成题而非人工题，但学习成效相当。

链接: https://arxiv.org/abs/2510.06965
作者: Bryan R. Christ,Penelope Molitz,Jonathan Kropko,Thomas Hartvigsen
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 15 figures

点击查看摘要

Abstract:Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students’ interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models’ MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models’ MWPs relative to human-written MWPs but consistently prefer our customized MWPs.
zh

[NLP-48] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation ICASSP2026

【速读】：该论文旨在解决自动语音识别（ASR）评估领域中存在的两大问题：一是现有评估基准多集中于短格式英文数据，缺乏对多语言和长音频场景的覆盖；二是模型效率指标（如实时因子）在评估中极少被报告，导致难以公平比较准确率与计算效率之间的权衡。解决方案的关键在于构建了一个可复现的开放 ASR 基准测试平台——Open ASR Leaderboard，涵盖 11 个数据集（包括多语言和长音频专项赛道），统一文本归一化流程，并同时报告词错误率（WER）和逆实时因子（RTFx），从而实现对 60 余个开源及专有系统的全面、透明、可扩展的性能对比。

链接: https://arxiv.org/abs/2510.06961
作者: Vaibhav Srivastav,Steven Zheng,Eric Bezzam,Eustache Le Bihan,Nithin Koluguri,Piotr Żelasko,Somshubra Majumdar,Adel Moumen,Sanchit Gandhi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026; Leaderboard: this https URL Code: this https URL

点击查看摘要

Abstract:Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.
zh

[NLP-49] Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces

【速读】：该论文旨在解决如何有效评估和提升大语言模型（Large Language Model, LLM）推理过程质量的问题，特别是通过分析推理轨迹中信息密度的分布均匀性来识别高质量推理路径。其解决方案的关键在于提出了一种基于熵的逐步信息密度度量方法，并引入局部与全局均匀性评分两个互补指标，用于量化推理步骤间的信息流稳定性；实验证明，具有更高步级均匀性的推理轨迹能显著提升模型准确性（如在AIME2025基准上相对基线提升10–32%），且正确推理通常避免信息密度突增，而错误推理则呈现不规则的信息爆发模式，从而表明信息密度均匀性是优于其他内部信号的推理质量预测指标。

链接: https://arxiv.org/abs/2510.06953
作者: Minju Gwak,Guijin Son,Jaehyung Kim
机构: Yonsei University (延世大学); OneLine AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Uniform Information Density (UID) hypothesis suggests that effective communication maintains a stable flow of information. In this work, we revisit this principle in the context of large language model (LLM) reasoning traces, asking whether step-level uniformity reflects reasoning quality. To this end, we propose an entropy-based stepwise information density metric and introduce two complementary measures of uniformity, local and global uniformity scores. Across the experiments on six different reasoning benchmarks, we find that step-level uniformity not only provides a strong theoretical lens but also yields practical performance benefits; for example, selecting reasoning traces with more uniform information density at the step-level improves accuracy by 10-32% relative gains over baselines at AIME2025. Our analysis further reveals that correct reasoning traces tend to avoid sharp information density spikes, while incorrect traces exhibit irregular information bursts. These results demonstrate that UID-inspired information density measures outperform alternative internal signals as predictors of reasoning quality. Results highlight the uniformity of the information density as a robust diagnostic and selection criterion for building more reliable and accurate reasoning systems.
zh

[NLP-50] SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

【速读】：该论文旨在解决当前语音语言模型（Speech Language Models, SLMs）在对话交互中响应延迟高、无法实时参与用户说话过程的问题。传统模型仅在用户完成发言后才开始推理和决策，导致交互不自然且不适用于对低延迟要求高的语音到语音场景。解决方案的关键在于提出SHANKS框架，该框架使SLM能够在接收用户语音输入的同时生成“未说出的思维链”（unspoken chain-of-thought reasoning），即在用户讲话过程中持续进行隐式推理，并据此决定是否打断用户或调用工具完成任务。SHANKS通过固定时长的语音块流式处理机制，实现了边听边思的实时交互能力，显著提升了交互准确性与效率。

链接: https://arxiv.org/abs/2510.06917
作者: Cheng-Han Chiang,Xiaofei Wang,Linjie Li,Chung-Ching Lin,Kevin Lin,Shujie Liu,Zhendong Wang,Zhengyuan Yang,Hung-yi Lee,Lijuan Wang
机构: National Taiwan University (国立台湾大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

Abstract:Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user’s turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally “think while listening.” In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at this https URL
zh

[NLP-51] LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

【速读】：该论文旨在解决当前奖励模型（Reward Model, RM）在长上下文场景下缺乏对响应与上下文一致性（long context-response consistency）评估能力的问题。现有RM主要局限于短上下文设置，仅关注响应层面的属性（如安全性或帮助性），而忽视了模型在处理长时间轨迹（如LLM代理任务）时保持语义连贯性和情境一致性的关键需求。解决方案的关键在于提出一种通用的多阶段训练策略（multi-stage training strategy），通过分析模型在长上下文中的失败模式，引导任意基础模型有效转化为鲁棒的长上下文奖励模型（Long-context Reward Model, LongRM）。实验表明，该方法不仅显著提升长上下文评估性能，还能保留强短上下文能力，且8B规模的LongRM优于70B级基线模型，并达到专有模型Gemini 2.5 Pro的水平。

链接: https://arxiv.org/abs/2510.06915
作者: Zecheng Tang,Baibei Ji,Quantong Qiu,Haitian Wang,Xiaobo Liang,Juntao Li,Min Zhang
机构: Soochow University (苏州大学); LCM Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model’s responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.
zh

[NLP-52] MeXtract: Light-Weight Metadata Extraction from Scientific Papers

【速读】：该论文旨在解决科学文献中元数据（metadata）提取的准确性与泛化能力不足的问题，传统基于规则或任务特定模型的方法难以适应不同领域和模式（schema）的变化。其解决方案的关键在于提出 MeXtract——一系列轻量级语言模型（参数规模从 0.5B 到 3B），通过微调 Qwen 2.5 基座模型构建，在 MOLE 基准上实现了最先进的性能。实验表明，针对特定元数据模式进行微调不仅提升精度，还能有效迁移到未见过的模式，验证了方法的鲁棒性与可扩展性。

链接: https://arxiv.org/abs/2510.06889
作者: Zaid Alyafeai,Maged S. Al-Shaibani,Bernard Ghanem
机构: KAUST; SDAIA-KFUPM Joint Research Center for AI, KFUPM
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.
zh

[NLP-53] λ-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

【速读】：该论文旨在解决Group Relative Policy Optimization (GRPO)在训练大语言模型（Large Language Models, LLMs）时存在的长度偏差（length bias）问题，即由于优势值（advantage）被均匀分配给响应中的所有token，导致较长文本在梯度更新中占据更大权重，从而影响模型性能。解决方案的关键在于引入一个可学习的参数 $\lambda$ ，用于自适应地调整token级别的权重分配，使模型能够通过优化过程自主学习其对不同token的偏好，从而实现更公平和有效的策略更新。该方法被称为 $\lambda$ -GRPO，在多个数学推理基准测试中显著优于原始GRPO和DAPO等变体，且无需修改训练数据或增加计算开销，展现出良好的实用性与有效性。

链接: https://arxiv.org/abs/2510.06870
作者: Yining Wang,Jinman Zhao,Chuangxin Zhao,Shuhao Guan,Gerald Penn,Shinan Liu
机构: The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter \lambda that adaptively controls token-level weighting. We use \lambda -GRPO to denote our method, and we find that \lambda -GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, \lambda -GRPO improves average accuracy by +1.9% , +1.0% , and +1.7% compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.
zh

[NLP-54] Unlocking Latent Discourse Translation in LLM s Through Quality-Aware Decoding

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在上下文感知翻译中对话语现象（discourse phenomena）处理能力不足的问题，例如指代消解（pronoun resolution）和文档级别的词汇衔接（lexical cohesion）。其解决方案的关键在于提出质量感知解码（Quality-aware Decoding, QAD），通过有效提取LLMs内部编码的语篇知识，显著提升翻译的语义丰富度并更贴近人类偏好。

链接: https://arxiv.org/abs/2510.06866
作者: Wafaa Mohammed,Vlad Niculae,Chrysoula Zerva
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as strong contenders in machine this http URL, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD) to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.
zh

[NLP-55] OpenJAI-v1.0: An Open Thai Large Language Model

【速读】：该论文旨在解决泰国语（Thai）自然语言处理（Natural Language Processing, NLP）资源匮乏及现有开源模型在指令遵循、长上下文理解与工具使用等实际任务中性能不足的问题。解决方案的关键在于基于Qwen3-14B模型，通过精心筛选和构建三类关键应用场景的数据集进行微调，从而显著提升模型在实际任务中的表现，同时避免灾难性遗忘（catastrophic forgetting），最终实现对主流开源泰语模型的超越。

链接: https://arxiv.org/abs/2510.06847
作者: Pontakorn Trakuekul,Attapol T. Rutherford,Jullajak Karnjanaekarin,Narongkorn Panitsrisit,Sumana Sumanakul
机构: Jasmine Technology Solution(茉莉科技解决方案); Chulalongkorn University(朱拉隆功大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce OpenJAI-v1.0, an open-source large language model for Thai and English, developed from the Qwen3-14B model. Our work focuses on boosting performance on practical tasks through carefully curated data across three key use cases: instruction following, long-context understanding, and tool use. Evaluation results show that OpenJAI-v1.0 improves on the capabilities of its base model and outperforms other leading open-source Thai models on a diverse suite of benchmarks, while avoiding catastrophic forgetting. OpenJAI-v1.0 is publicly released as another alternative NLP resource for the Thai AI community.
zh

[NLP-56] SID: Multi-LLM Debate Driven by Self Signals

【速读】：该论文旨在解决当前多大语言模型（Multi-LLM）辩论（MAD）方法中因过度依赖外部结构（如辩论图、LLM作为裁判）而忽视生成过程中内在自信号（self signals）所导致的冗余计算与性能下降问题。解决方案的关键在于提出一种基于自信号驱动的多大语言模型辩论框架（SID），其创新性地利用两类自信号：模型级置信度（model-level confidence）和词元级语义聚焦度（token-level semantic focus），以自适应地引导辩论过程；具体而言，高置信度代理可在模型层面提前退出，同时基于注意力机制压缩冗余辩论内容，从而在提升准确率的同时显著降低Token消耗，实现性能与效率的协同优化。

链接: https://arxiv.org/abs/2510.06843
作者: Xuhang Chen,Zhifan Song,Deyi Ji,Shuo Gao,Lanyun Zhu
机构: University of Cambridge (剑桥大学); Sorbonne Université (索邦大学); University of Science and Technology of China (中国科学技术大学); Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\hrefthis https URL\textttthis https URL.
zh

[NLP-57] GAMBIT: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics EMNLP2025

【速读】：该论文旨在解决自动质量评估（Automatic Quality Estimation, QE）指标中性别偏见问题缺乏系统性研究的空白。现有研究虽指出QE指标可能包含性别偏见，但受限于小样本、职业覆盖窄及语言多样性不足，难以全面揭示其行为模式。解决方案的关键在于构建一个大规模、多语言、结构化的挑战数据集：基于GAMBIT语料库扩展至三种源语言（性别中立或自然性别语言）与十一种目标语言（具有语法性别），形成33个源-目标语言对；每条源文本均生成两个仅在职业术语的语法性别上差异（阳性 vs. 阴性）的平行译文，并调整所有相关语法成分以保持一致性。该设计使QE指标在相同语义内容下对两种性别版本的评分应趋近一致，从而实现针对具体职业和跨语言的细粒度偏见分析。

链接: https://arxiv.org/abs/2510.06841
作者: Giorgos Filandrianos,Orfeas Menis Mastromichalakis,Wafaa Mohammed,Giuseppe Attanasio,Chrysoula Zerva
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication at the 10th Conference of Machine Translation (WMT25), co-located with EMNLP 2025

点击查看摘要

Abstract:Gender bias in machine translation (MT) systems has been extensively documented, but bias in automatic quality estimation (QE) metrics remains comparatively underexplored. Existing studies suggest that QE metrics can also exhibit gender bias, yet most analyses are limited by small datasets, narrow occupational coverage, and restricted language variety. To address this gap, we introduce a large-scale challenge set specifically designed to probe the behavior of QE metrics when evaluating translations containing gender-ambiguous occupational terms. Building on the GAMBIT corpus of English texts with gender-ambiguous occupations, we extend coverage to three source languages that are genderless or natural-gendered, and eleven target languages with grammatical gender, resulting in 33 source-target language pairs. Each source text is paired with two target versions differing only in the grammatical gender of the occupational term(s) (masculine vs. feminine), with all dependent grammatical elements adjusted accordingly. An unbiased QE metric should assign equal or near-equal scores to both versions. The dataset’s scale, breadth, and fully parallel design, where the same set of texts is aligned across all languages, enables fine-grained bias analysis by occupation and systematic comparisons across languages.
zh

[NLP-58] Crossing Domains without Labels: Distant Supervision for Term Extraction EMNLP

【速读】：该论文旨在解决自动术语提取（Automatic Term Extraction, ATE）在实际应用中面临的两大挑战：一是现有最先进方法依赖昂贵的人工标注，二是跨领域迁移能力弱，限制了其可扩展性和实用性。为应对这些问题，作者提出了一种基于大语言模型（Large Language Models, LLMs）的鲁棒解决方案，其关键在于两阶段策略：首先利用黑盒LLM在通用和科学领域生成伪标签（pseudo-labels），以确保模型具备良好的泛化能力；随后在此基础上微调LLM用于ATE任务，并引入轻量级后处理启发式规则以提升文档级别的术语一致性，从而更好地满足下游任务需求。该方法在七个不同领域的基准测试中优于多数基线模型，在5个领域上平均提升10个百分点，显著提升了ATE的性能与实用性。

链接: https://arxiv.org/abs/2510.06838
作者: Elena Senger,Yuri Campbell,Rob van der Goot,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany; Fraunhofer Center for International Management and Knowledge Economy IMW, Germany; Department of Computer Science, IT University of Copenhagen, Denmark
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at EMNLP Industry Track 2025

点击查看摘要

Abstract:Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark. The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.
zh

[NLP-59] Mid-Training of Large Language Models : A Survey

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中因后期出现噪声令牌（noisy tokens）导致的收益递减、收敛不稳定以及能力扩展受限等问题。其解决方案的关键在于引入一个统一的“中段训练”（mid-training）范式，该阶段通过多个类退火（annealing-style）迭代过程，系统性地优化数据质量、调整学习率调度策略并延长上下文长度，从而提升模型在训练后期的泛化能力和抽象能力。这一方法的有效性可从梯度噪声尺度（gradient noise scale）、信息瓶颈（information bottleneck）和课程学习（curriculum learning）三个理论视角加以解释，为当前主流LLM训练流程提供了结构化改进路径与评估基准。

链接: https://arxiv.org/abs/2510.06826
作者: Kaixiang Mo,Yuxin Shi,Weiwei Weng,Zhiqiang Zhou,Shuman Liu,Haibo Zhang,Anxiang Zeng
机构: Shopee(虾皮); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are typically developed through large-scale pre-training followed by task-specific fine-tuning. Recent advances highlight the importance of an intermediate mid-training stage, where models undergo multiple annealing-style phases that refine data quality, adapt optimization schedules, and extend context length. This stage mitigates diminishing returns from noisy tokens, stabilizes convergence, and expands model capability in late training. Its effectiveness can be explained through gradient noise scale, the information bottleneck, and curriculum learning, which together promote generalization and abstraction. Despite widespread use in state-of-the-art systems, there has been no prior survey of mid-training as a unified paradigm. We introduce the first taxonomy of LLM mid-training spanning data distribution, learning-rate scheduling, and long-context extension. We distill practical insights, compile evaluation benchmarks, and report gains to enable structured comparisons across models. We also identify open challenges and propose avenues for future research and practice.
zh

[NLP-60] Adaptive Tool Generation with Models as Tools and Reinforcement Learning

【速读】：该论文旨在解决工具增强型语言模型（tool-augmented language models）在训练和部署过程中对实时API调用的依赖所引发的可扩展性与可靠性问题。其解决方案的关键在于提出一种“仿真优先”（simulation-first）的训练框架MTR，通过构建多智能体架构实现无需真实API交互的端到端训练：其中ToolMaker生成符合OpenAI规范的特定任务工具接口，AutoAgent生成结构化的“思考-行动-观察”序列，ToolActor模拟逼真的响应；训练分为两个阶段——第一阶段监督微调（SFT）学习推理轨迹的语法结构，第二阶段组相对策略优化（GRPO）利用综合奖励函数（兼顾答案正确性与内部一致性）优化推理策略，从而在多个多跳问答基准测试中达到与使用实时API系统相当甚至更优的性能表现。

链接: https://arxiv.org/abs/2510.06825
作者: Chenpeng Wang,Xiaojie Cheng,Chunye Wang,Linfeng Yang,Lei Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think-act-observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches ‘trace grammar’ from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.
zh

[NLP-61] Exposing Citation Vulnerabilities in Generative Engines

【速读】：该论文旨在解决生成式 AI (Generative AI) 系统在回答用户提问时，因引用网络来源而面临的中毒攻击（poisoning attacks）风险问题，即恶意内容通过网页发布被注入到答案中，从而影响输出的可靠性。现有研究多关注答案内容与引用来源的一致性，却忽视了如何选择具有高内容注入壁垒（content-injection barrier）的引用源以增强防御能力。解决方案的关键在于引入一套基于引用来源出版商属性的评估标准，用以量化不同来源的中毒威胁水平，并据此识别出低壁垒但高频引用的危险来源。实验结果表明，美国政治类问答中来自官方政党网站（primary sources）的引用比例仅为25%–45%，显著低于日本的60%–65%，说明美国系统更易受攻击；同时发现低壁垒来源虽常被引用，但其内容在答案中体现不足，提示需提升权威来源的可见性以缓解中毒风险。

链接: https://arxiv.org/abs/2510.06823
作者: Riku Mochizuki,Shusuke Komatsu,Souta Noguchi,Kazuto Ataka
机构: QueryLift Inc.(QueryLift公司); Keio University(庆应义塾大学); Nara Institute of Science and Technology(奈良科学技术大学院大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, under-reviewing at a conference

点击查看摘要

Abstract:We analyze answers generated by generative engines (GEs) from the perspectives of citation publishers and the content-injection barrier, defined as the difficulty for attackers to manipulate answers to user prompts by placing malicious content on the web. GEs integrate two functions: web search and answer generation that cites web pages using large language models. Because anyone can publish information on the web, GEs are vulnerable to poisoning attacks. Existing studies of citation evaluation focus on how faithfully answer content reflects cited sources, leaving unexamined which web sources should be selected as citations to defend against poisoning attacks. To fill this gap, we introduce evaluation criteria that assess poisoning threats using the citation information contained in answers. Our criteria classify the publisher attributes of citations to estimate the content-injection barrier thereby revealing the threat of poisoning attacks in current GEs. We conduct experiments in political domains in Japan and the United States (U.S.) using our criteria and show that citations from official party websites (primary sources) are approximately (25%)–(45%) in the U.S. and (60%)–(65%) in Japan, indicating that U.S. political answers are at higher risk of poisoning attacks. We also find that sources with low content-injection barriers are frequently cited yet are poorly reflected in answer content. To mitigate this threat, we discuss how publishers of primary sources can increase exposure of their web content in answers and show that well-known techniques are limited by language differences.
zh

[NLP-62] BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中电路定位（circuit localization）的精度问题，即如何更准确地识别出负责特定任务行为的子网络（subnetworks）。其核心解决方案在于通过集成多种电路定位方法来提升性能，关键创新点包括两种集成策略：一是并行集成（parallel ensembling），通过聚合不同方法对边（edge）的归因分数（如平均、最小值或最大值）以增强鲁棒性；二是串行集成（sequential ensembling），利用计算成本较低但精度有限的EAP-IG方法获得初始归因分数，作为更精确但昂贵的边剪枝（edge pruning）方法的热启动（warm start）。实验表明，将这两种集成方式组合成一个综合并行集成方案，能够在Mechanistic Interpretability Benchmark（MIB）上显著提升电路识别精度，优于单一方法及基线。

链接: https://arxiv.org/abs/2510.06811
作者: Philipp Mondorf,Mingyang Wang,Sebastian Gerstner,Ahmad Dawar Hakimi,Yihong Liu,Leonor Veloso,Shijia Zhou,Hinrich Schütze,Barbara Plank
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The 8th BlackboxNLP Workshop (Shared Task), 6 pages

点击查看摘要

Abstract:The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.
zh

[NLP-63] Overview of the Plagiarism Detection Task at PAN 2025

【速读】：该论文旨在解决科学文献中自动生成文本剽窃（automatically generated textual plagiarism）的检测问题，即识别由大语言模型生成的内容与其原始来源之间的抄袭关系。解决方案的关键在于构建了一个大规模的新颖数据集，该数据集基于三种主流大语言模型（Llama、DeepSeek-R1 和 Mistral）生成的剽窃文本，并通过对比分析2025年PAN任务中所有参赛方法与四个基线模型的表现，评估其在当前场景下的有效性。研究发现，基于嵌入向量（embedding vectors）的朴素语义相似性方法在新数据集上可达到最高0.8召回率和0.5精确度，但这些方法在2015年PAN任务数据上表现显著下降，暴露出当前方法在泛化能力上的不足。

链接: https://arxiv.org/abs/2510.06805
作者: André Greiner-Petter,Maik Fröbe,Jan Philip Wahle,Terry Ruas,Bela Gipp,Akiko Aizawa,Martin Potthast
机构: Georg-August-Universität, Göttingen, Germany; National Institute of Informatics, Tokyo, Japan; Friedrich-Schiller-Universität Jena, Jena, Germany; University of Kassel, Kassel, Germany; hessian.ai, Darmstadt, Germany; ScaDS.AI, Leipzig, Germany
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Working Notes at PAN at CLEF 2025

点击查看摘要

Abstract:The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.
zh

[NLP-64] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

【速读】：该论文旨在解决现有角色扮演（Role-Playing, RP）评估基准在多样性、适应性和可扩展性方面的局限性，这些问题导致基准迅速过时且难以适配多场景应用。其解决方案的关键在于提出FURINA-Builder——一个基于多智能体协作的自动化构建管道，能够按需生成任意规模、可定制的角色扮演评估基准。该管道通过从结构化的角色-场景池中模拟对话，并利用大语言模型（LLM）裁判动态选择细粒度评估维度，最终生成测试语句，从而实现对任意角色在多样化场景和提示格式下的自适应评估。此方法首次实现了RP领域内灵活、可扩展的基准构建能力，支撑了FURINA-Bench的建立，为当前主流大语言模型提供了系统性的角色扮演性能与可靠性分析框架。

链接: https://arxiv.org/abs/2510.06800
作者: Haotian Wu,Shufan Jiang,Chios Chen,Yiyang Feng,Hehai Lin,Heqing Zou,Yao Shu,Yanran Li,Chengwei Qin
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); The University of Hong Kong (香港大学); National University of Singapore (新加坡国立大学); Stony Brook University (石溪大学); Nanyang Technological University (南洋理工大学); Datawhale Org. (Datawhale 组织); Independent Researcher (独立研究者)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character’s responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.
zh

[NLP-65] GPT -5 Model Corrected GPT -4Vs Chart Reading Errors Not Prompting

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在图表阅读任务中的推理准确性问题，特别是零样本（zero-shot）场景下不同模型架构与提示策略（prompting）对性能的影响。其关键解决方案在于通过定量评估对比了代理型 GPT-5 与多模态 GPT-4V 在处理 107 个可视化问题上的表现，发现模型架构是决定推理准确性的主导因素——GPT-5 显著提升了准确率，而提示变体仅带来微小改进。

链接: https://arxiv.org/abs/2510.06782
作者: Kaichun Yang,Jian Chen
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); The Ohio State University (俄亥俄州立大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V, for difficult image instances, where GPT-4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here: this https URL the Google Drive materials are here:this https URL.
zh

[NLP-66] Foundations of LLM Knowledge Materialization: Termination Reproducibility Robustness

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中蕴含的大量事实性知识难以被有效测量与结构化表达的问题。其解决方案的关键在于通过递归提取方法（如GPTKB方法）将LLM中的隐式知识转化为结构化知识库，并借助miniGPTKBs（领域特定的小规模可处理子爬取数据集）进行系统性实验，评估知识提取在终止性、可复现性和鲁棒性三个维度的表现，从而揭示LLM知识材料化的可行性与局限性。

链接: https://arxiv.org/abs/2510.06780
作者: Luca Giordano,Simon Razniewski
机构: ScaDS.AI Dresden/Leipzig & TU Dresden, Germany (ScaDS.AI德累斯顿/莱比锡 & 德累斯顿工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.
zh

[NLP-67] Adaptive LLM -Symbolic Reasoning via Dynamic Logical Solver Composition

【速读】：该论文旨在解决当前神经符号自然语言处理（Neuro-symbolic NLP）方法中存在静态集成问题，即形式逻辑求解器的使用在设计阶段即被固定，限制了对多样化形式推理策略的灵活应用。其解决方案的关键在于提出一种自适应、多范式的神经符号推理框架，该框架能够：（1）从自然语言表达的问题中自动识别所需的形式推理策略；（2）通过自动形式化接口动态选择并调用专门的逻辑求解器。此机制显著提升了模型在复杂推理任务中的灵活性与性能，实验证明该方法在多个基准上优于现有基线模型，并且对纯大语言模型（LLM）也具有正向提升效果。

链接: https://arxiv.org/abs/2510.06774
作者: Lei Xu,Pierre Beckmann,Marco Valentino,André Freitas
机构: Idiap Research Institute (Idiap 研究所); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); School of Computer Science, University of Sheffield (谢菲尔德大学计算机科学学院); Department of Computer Science, University of Manchester (曼彻斯特大学计算机科学系); Cancer Biomarker Centre, CRUK Manchester Institute (英国癌症研究中心曼彻斯特研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neuro-symbolic NLP methods aim to leverage the complementary strengths of large language models and formal logical solvers. However, current approaches are mostly static in nature, i.e., the integration of a target solver is predetermined at design time, hindering the ability to employ diverse formal inference strategies. To address this, we introduce an adaptive, multi-paradigm, neuro-symbolic inference framework that: (1) automatically identifies formal reasoning strategies from problems expressed in natural language; and (2) dynamically selects and applies specialized formal logical solvers via autoformalization interfaces. Extensive experiments on individual and multi-paradigm reasoning tasks support the following conclusions: LLMs are effective at predicting the necessary formal reasoning strategies with an accuracy above 90 percent. This enables flexible integration with formal logical solvers, resulting in our framework outperforming competing baselines by 27 percent and 6 percent compared to GPT-4o and DeepSeek-V3.1, respectively. Moreover, adaptive reasoning can even positively impact pure LLM methods, yielding gains of 10, 5, and 6 percent on zero-shot, CoT, and symbolic CoT settings with GPT-4o. Finally, although smaller models struggle with adaptive neuro-symbolic reasoning, post-training offers a viable path to improvement. Overall, this work establishes the foundations for adaptive LLM-symbolic reasoning, offering a path forward for unifying material and formal inferences on heterogeneous reasoning challenges.
zh

[NLP-68] Evolving and Executing Research Plans via Double-Loop Multi-Agent Collaboration

【速读】：该论文旨在解决自动化科学科研流程中的核心挑战：如何在动态不确定环境中同时实现高阶研究计划的演化（需具备新颖性和合理性）与正确执行。解决方案的关键在于提出一种双环多智能体（Double-Loop Multi-Agent, DLMA）框架，其中领导环由教授代理组成，通过进化算法结合参与、改进和整合会议迭代生成并优化研究提案，以探索解空间；跟随环由博士生代理组成，通过事前与事后会议动态调整执行计划，确保每一步骤（如撰写、编码）均得到上下文和外部观测的支持，从而保障执行的严谨性。实验证明该框架在ACLAward和实验室基准上生成的研究论文达到自动化评估的最先进水平，消融实验进一步验证了两环协同的重要性——进化驱动新颖性，执行保障合理性。

链接: https://arxiv.org/abs/2510.06761
作者: Zhi Zhang,Yan Liu,Zhejing Hu,Gong Chen,Sheng-hua Zhong,Jiannong Cao
机构: The Hong Kong Polytechnic University (香港理工大学); FireTorch Partners (火torch合作伙伴); Shenzhen University (深圳大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automating the end-to-end scientific research process poses a fundamental challenge: it requires both evolving high-level plans that are novel and sound, and executing these plans correctly amidst dynamic and uncertain conditions. To address this bilevel challenge, we propose a novel Double-Loop Multi-Agent (DLMA) framework to solve the given research problem automatically. The leader loop, composed of professor agents, is responsible for evolving research plans. It employs an evolutionary algorithm through involvement, improvement, and integration meetings to iteratively generate and refine a pool of research proposals, exploring the solution space effectively. The follower loop, composed of doctoral student agents, is responsible for executing the best-evolved plan. It dynamically adjusts the plan during implementation via pre-hoc and post-hoc meetings, ensuring each step (e.g., drafting, coding) is well-supported by contextual and external observations. Extensive experiments on benchmarks like ACLAward and Laboratory show that DLMA generates research papers that achieve state-of-the-art scores in automated evaluation, significantly outperforming strong baselines. Ablation studies confirm the critical roles of both loops, with evolution driving novelty and execution ensuring soundness.
zh

[NLP-69] Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLM s

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRM）在执行结构化任务时因过度思考（overthinking）而导致性能下降和计算资源浪费的问题。其核心解决方案是提出一种“叠加部署”（superposed deployment）策略，通过一种轻量级、无需训练的调控机制，在推理阶段动态地开关模型组件，而非传统的多模型路由方式。关键创新在于：不依赖模型选择或切换，而是通过分析奇异值累积能量，识别最优低秩投影来有选择性地“遗忘”LRM中的部分推理能力，从而在保持推理效果的同时显著降低计算开销。

链接: https://arxiv.org/abs/2510.06750
作者: Jaeseong Lee,Dayoung Kwon,seung-won hwang
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.
zh

[NLP-70] A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction ACL EACL2026

【速读】：该论文旨在解决语法错误修正（Grammatical Error Correction, GEC）评估中依赖单一参考答案导致的局限性问题，尤其是在多语言和生成式场景下，现有基于编辑距离的评估框架因对齐僵化而难以反映人类修正的多样性。其解决方案的关键在于提出一种基于流畅度（fluency-based）的多参考评估形式化框架，将n-gram相似性建模为对多个合法修正结果的聚合问题，并通过四种聚合策略（select-best、simple-average、weighted-average 和 merged-counts）实现对不同维度流畅度与覆盖度的捕捉，从而在不惩罚合法变异的前提下统一多参考评估，提升评估指标对语言多样性的适应能力。

链接: https://arxiv.org/abs/2510.06749
作者: Eitan Klinger,Zihao Huang,Tran Minh Nguyen,Emma Jayeon Park,Yige Chen,Yang Gu,Qingyu Gao,Siliang Liu,Mengyang Qiu,Jungyeul Park
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to ACL Rolling Review - October 2025 for EACL 2026

点击查看摘要

Abstract:Evaluating grammatical error correction requires metrics that reflect the diversity of valid human corrections rather than privileging a single reference. Existing frameworks, largely edit-based and English-centric, rely on rigid alignments between system and reference edits, limiting their applicability in multilingual and generative settings. This paper introduces a formal framework for \textitfluency-based multi-reference evaluation, framing n -gram similarity as an aggregation problem over multiple legitimate corrections. Within this formulation, we instantiate GLEU through four aggregation strategies–\textscselect-best, \textscsimple-average, \textscweighted-average, and \textscmerged-counts–and analyze their properties of boundedness, monotonicity, and sensitivity to reference variation. Empirical results on Czech, Estonian, Ukrainian, and Chinese corpora show that these strategies capture complementary aspects of fluency and coverage. The framework unifies multi-reference evaluation into a principled, fluency-oriented approach that incorporates linguistic diversity without penalizing legitimate variation.
zh

[NLP-71] WIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLM s

【速读】：该论文旨在解决商业场景中短文本聚类问题，即在缺乏标注数据和未知聚类数量的情况下，如何高效、准确地对用户意图进行聚类。其核心挑战在于传统方法通常依赖于对比学习或标签信息，而实际应用中这些条件往往不可得。解决方案的关键在于提出一种无需训练且不依赖标签的迭代向量更新方法：首先基于代表性文本构建稀疏向量，随后通过大语言模型（Large Language Model, LLM）引导进行迭代优化，从而实现高质量聚类。该方法具有模型无关性、低资源消耗和良好可扩展性，适用于任意嵌入器（embedder）、小规模LLM及多种聚类算法，在真实世界部署中更具实用性。

链接: https://arxiv.org/abs/2510.06747
作者: I-Fan Lin,Faegheh Hasibi,Suzan Verberne
机构: Leiden University (莱顿大学); Radboud University (奈梅亨大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose a training-free and label-free method for short text clustering that can be used on top of any existing embedder. In the context of customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these commercial settings, no labeled data is typically available, and the number of clusters is not known. Our method is based on iterative vector updating: it constructs sparse vectors based on representative texts, and then iteratively refines them through LLM guidance. Our method achieves comparable or superior results to state-of-the-art methods that use contrastive learning, but without assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show that our method scales to large datasets, reducing the computational cost of the LLM. These low-resource, adaptable settings and the scalability of our method make it more aligned with real-world scenarios than existing clustering methods.
zh

[NLP-72] Evaluating LLM s for Historical Document OCR: A Methodological Framework for Digital Humanities

【速读】：该论文旨在解决历史文档数字化中大型语言模型（Large Language Models, LLMs）用于光学字符识别（OCR）时缺乏有效评估框架的问题，尤其关注传统指标无法捕捉时间偏差和特定时期错误的局限性。其关键解决方案是提出一套针对历史OCR的新型评估方法，包括引入历史字符保留率（Historical Character Preservation Rate, HCPR）和古旧字符插入率（Archaic Insertion Rate, AIR）等创新指标，并建立污染控制协议与稳定性测试流程，从而系统性地评估LLM在历史文本处理中的表现与偏差，为数字人文研究者提供可靠的模型选择与质量评估依据。

链接: https://arxiv.org/abs/2510.06743
作者: Maria Levchenko
机构: Italian Institute of Germanic Studies (IISG)(意大利日耳曼研究所); University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The First Workshop on Natural Language Processing and Language Models for Digital Humanities (LM4DH 2025). RANLP 2025

点击查看摘要

Abstract:Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.
zh

[NLP-73] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）知识产权保护中的关键问题——即如何在模型经过多种后训练操作（如监督微调、持续预训练、强化学习、多模态扩展、剪枝和重构）后，准确判断其是否源自某个已知的基础模型。现有方法难以应对这些参数扰动带来的干扰，导致识别可靠性下降。解决方案的关键在于提出一种无需训练的指纹识别方法，基于权重矩阵设计了一种结合线性分配问题（Linear Assignment Problem, LAP）与无偏中心核对齐（Unbiased Centered Kernel Alignment, CKA）的相似性度量机制，能够有效中和各类后训练操作的影响，从而实现高鲁棒性和高保真度的模型谱系验证，在60个正样本和90个负样本测试集中达到近乎零误报率，并且整个计算过程可在NVIDIA 3090 GPU上于30秒内完成。

链接: https://arxiv.org/abs/2510.06738
作者: Boyi Zeng,Lin Chen,Ziwei He,Xinbing Wang,Zhouhan Lin
机构: LUMIA Lab (LUMIA实验室); School of Artificial Intelligence (人工智能学院); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at this https URL.
zh

[NLP-74] Are LLM s Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为信息检索中的重排序器（reranker）时，其排名行为易受微小但自然语言风格提示（prompt）操控的问题。这一漏洞可能被用于恶意提升特定目标项的排名，从而威胁检索系统的可信性与鲁棒性。解决方案的关键在于提出一种两阶段的令牌优化方法——Rank Anything First (RAF)，其核心机制是通过双目标引导：在保证语言自然性的前提下最大化目标项的排名效果。第一阶段利用贪婪坐标梯度（Greedy Coordinate Gradient）结合排名目标梯度与可读性评分筛选候选令牌；第二阶段采用基于熵的动态加权策略，在精确排名损失和可读性损失之间平衡，并通过温度控制采样确定最优令牌。该方法能够生成短且难以察觉的文本扰动，显著提升目标项排名，同时优于现有方法在自然性和鲁棒性上的表现。

链接: https://arxiv.org/abs/2510.06732
作者: Tiancheng Xing,Jerry Li,Yixuan Du,Xiyang Hu
机构: National University of Singapore (新加坡国立大学); University of Southern California (南加州大学); Georgetown University (乔治城大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as rerankers in information retrieval, yet their ranking behavior can be steered by small, natural-sounding prompts. To expose this vulnerability, we present Rank Anything First (RAF), a two-stage token optimization method that crafts concise textual perturbations to consistently promote a target item in LLM-generated rankings while remaining hard to detect. Stage 1 uses Greedy Coordinate Gradient to shortlist candidate tokens at the current position by combining the gradient of the rank-target with a readability score; Stage 2 evaluates those candidates under exact ranking and readability losses using an entropy-based dynamic weighting scheme, and selects a token via temperature-controlled sampling. RAF generates ranking-promoting prompts token-by-token, guided by dual objectives: maximizing ranking effectiveness and preserving linguistic naturalness. Experiments across multiple LLMs show that RAF significantly boosts the rank of target items using naturalistic language, with greater robustness than existing methods in both promoting target items and maintaining naturalness. These findings underscore a critical security implication: LLM-based reranking is inherently susceptible to adversarial manipulation, raising new challenges for the trustworthiness and robustness of modern retrieval systems. Our code is available at: this https URL.
zh

[NLP-75] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s

【速读】：该论文旨在解决当前句子嵌入模型（sentence embedding models）评估中依赖静态测试集（如MTEB）所导致的性能高估与真实场景鲁棒性不足的问题。其核心挑战在于，固定测试集可能使模型在特定数据分布上过度拟合，从而掩盖其在动态语境下的泛化能力。解决方案的关键在于提出一种动态评估协议——Paraphrasing Text Embedding Benchmark (PTEB)，通过在评估时随机生成语义不变但词元空间变化的改写句（paraphrase），并聚合多次运行的结果来衡量模型稳定性。该方法利用低成本的大语言模型（LLM）结合语义文本相似度（semantic textual similarity, STS）黄金标注生成高质量改写句，验证了即使语义保持一致，词元空间扰动仍显著影响模型性能，且小模型并未比大模型更易受影响。这一新范式强调以评估时计算资源驱动的动态、随机化评估，替代传统静态基准，提升模型评估的真实性与可靠性。

链接: https://arxiv.org/abs/2510.06730
作者: Manuel Frank,Haithem Afli
机构: Munster Technological University (梅努斯特理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current evaluations of sentence embedding models typically rely on static test beds such as the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported performance and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in semantic textual similarity gold ratings, we show that LLMs generate token-diverse but semantically preserving, paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs and we extended our experiments to 3 multilingual datasets covering 10 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute.
zh

[NLP-76] Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

【速读】：该论文旨在解决长时程多轮工具调用任务中因上下文长度限制导致的强化学习（Reinforcement Learning, RL）微调性能下降问题，具体表现为指令遵循能力退化、回放（rollout）成本过高以及严格的上下文窗口限制。其解决方案的关键在于引入基于摘要的上下文管理机制（summarization-based context management），通过LLM生成的摘要周期性压缩工具使用历史，保留任务相关的信息以维持紧凑的上下文表示，从而突破固定上下文长度的瓶颈。在此基础上，作者进一步推导出一种策略梯度表达式，使标准LLM RL基础设施能够端到端优化工具调用行为与摘要策略，最终提出SUPO（Summarization-Augmented Policy Optimization）算法，在交互式函数调用和搜索任务中显著提升成功率，同时保持或降低工作上下文长度，并在复杂搜索任务中展现出更强的可扩展性。

链接: https://arxiv.org/abs/2510.06727
作者: Miao Lu,Weiwei Sun,Weihua Du,Zhan Ling,Xuesong Yao,Kang Liu,Jiecao Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underlineSUmmarization augmented \underlinePolicy \underlineOptimization (\textttSUPO), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \textttSUPO significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \textttSUPO can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
zh

[NLP-77] Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG )

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）在敏感领域应用中面临的隐私风险问题。现有方法通常依赖查询时差分隐私（differential privacy, DP），需反复注入噪声，导致隐私损失累积。其解决方案的关键在于提出DP-SynRAG框架，利用大语言模型（Large Language Models, LLMs）生成具有差分隐私保障的合成RAG数据库；该合成数据一旦生成即可重复使用，避免了重复噪声注入和额外的隐私成本。为保留下游RAG任务所需的关键信息，DP-SynRAG进一步扩展了私有预测（private prediction）机制，指导LLMs以差分隐私方式生成模拟子采样数据库记录的文本，从而在固定隐私预算下实现优于现有私有RAG系统的性能表现。

链接: https://arxiv.org/abs/2510.06719
作者: Junki Mori,Kazuya Kakizaki,Taiki Miyagawa,Jun Sakuma
机构: NEC Corporation (NEC公司); Institute of Science Tokyo (东京科学研究所); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding them in external knowledge. However, its application in sensitive domains is limited by privacy risks. Existing private RAG methods typically rely on query-time differential privacy (DP), which requires repeated noise injection and leads to accumulated privacy loss. To address this issue, we propose DP-SynRAG, a framework that uses LLMs to generate differentially private synthetic RAG databases. Unlike prior methods, the synthetic text can be reused once created, thereby avoiding repeated noise injection and additional privacy costs. To preserve essential information for downstream RAG tasks, DP-SynRAG extends private prediction, which instructs LLMs to generate text that mimics subsampled database records in a DP manner. Experiments show that DP-SynRAG achieves superior performanec to the state-of-the-art private RAG systems while maintaining a fixed privacy budget, offering a scalable solution for privacy-preserving RAG.
zh

[NLP-78] XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

【速读】：该论文旨在解决生成式 AI（Generative AI）技术发展带来的语音合成欺骗攻击（spoofing attacks）对自动说话人验证系统（Automatic Speaker Verification, ASV）构成的严重威胁，尤其是针对基于自监督学习（Self-Supervised Learning, SSL）模型如XLSR-Conformer在检测合成语音时性能仍有提升空间的问题。解决方案的关键在于用基于Kolmogorov-Arnold表示定理的Kolmogorov-Arnold Network（KAN）替代原模型中传统的多层感知机（MLP），从而增强模型对合成语音的判别能力。实验表明，该改进在ASVspoof2021数据集上使等错误率（EER）相对降低60.55%，并在LA和DF测试集中达到0.70%的EER，且该方法对多种SSL架构具有良好的鲁棒性，验证了KAN作为通用逼近器在合成语音检测任务中的有效性。

链接: https://arxiv.org/abs/2510.06706
作者: Phuong Tuan Dat,Tran Huy Dat
机构: Hanoi University of Science and Technology (河内科学技术大学); Institute for Infocomm Research (I2R) (资讯通信研究所); A*STAR (新加坡科技研究局)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to 2025 IEEE International Conference on Advanced Video and Signal-Based Surveillance

点击查看摘要

Abstract:Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer architecture, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron (MLP) in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a powerful universal approximator based on the Kolmogorov-Arnold representation theorem. Our experimental results on ASVspoof2021 demonstrate that the integration of KAN to XLSR-Conformer model can improve the performance by 60.55% relatively in Equal Error Rate (EER) LA and DF sets, further achieving 0.70% EER on the 21LA set. Besides, the proposed replacement is also robust to various SSL architectures. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.
zh

[NLP-79] How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中内容效应（content effects）的机制不明确问题，即模型在判断推理问题逻辑有效性时会受到语义内容合理性的影响，从而产生偏差。解决方案的关键在于揭示了有效性和合理性在模型内部表征空间中呈线性且高度对齐的几何结构，导致模型将两者混淆；通过构建可操控的引导向量（steering vectors），实验证明了合理性向量能因果性地影响有效性判断，反之亦然，且二者对齐程度决定了内容效应的强度；进一步地，研究设计了去偏移向量（debiasing vectors）以解耦这两个概念，显著降低了内容效应并提升了推理准确性，表明基于表征干预的方法是提升模型逻辑能力的有效路径。

链接: https://arxiv.org/abs/2510.06700
作者: Leonardo Bertolazzi,Sandro Pezzelle,Raffaelle Bernardi
机构: University of Trento(特伦托大学); University of Amsterdam(阿姆斯特丹大学); Free University of Bozen-Bolzano(博岑-波尔扎诺自由大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.
zh

[NLP-80] Learning to Rewrite Prompts for Bootstrapping LLM s on Downstream Tasks

【速读】：该论文旨在解决现有提示工程方法在机器翻译等自然语言生成（NLG）任务中适用性不足的问题，特别是这些方法通常聚焦于优化指令（instruction）组件，而忽视了输入（input）组件在机器翻译中的关键作用。其解决方案的关键在于提出一种针对机器翻译任务的新型提示优化方法，该方法利用基于回译（back-translation）策略训练的小参数模型，在显著降低单任务优化训练开销的同时，实现了高效且有效的性能表现。

链接: https://arxiv.org/abs/2510.06695
作者: Qinhao Zhou,Xiang Xiang,Kun He,John E. Hopcroft
机构: Huazhong University of Science and Technology (华中科技大学); HUST AI & Visual Learning (HAIV) Lab (华中科技大学人工智能与视觉学习实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, the growing interest in Large Language Models (LLMs) has significantly advanced prompt engineering, transitioning from manual design to model-based optimization. Prompts for LLMs generally comprise two components: the \textitinstruction, which defines the task or objective, and the \textitinput, which is tailored to the instruction type. In natural language generation (NLG) tasks such as machine translation, the \textitinput component is particularly critical, while the \textitinstruction component tends to be concise. Existing prompt engineering methods primarily focus on optimizing the \textitinstruction component for general tasks, often requiring large-parameter LLMs as auxiliary tools. However, these approaches exhibit limited applicability for tasks like machine translation, where the \textitinput component plays a more pivotal role. To address this limitation, this paper introduces a novel prompt optimization method specifically designed for machine translation tasks. The proposed approach employs a small-parameter model trained using a back-translation-based strategy, significantly reducing training overhead for single-task optimization while delivering highly effective performance. With certain adaptations, this method can also be extended to other downstream tasks.
zh

[NLP-81] Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback EMNLP2025

【速读】：该论文旨在解决客户支持场景中代理在处理复杂对话时因频繁上下文切换和冗余信息回顾而导致的效率低下问题。其核心解决方案是提出一种增量式摘要系统（incremental summarization system），通过结合微调后的Mixtral-8x7B模型实现对话过程中的连续笔记生成，并利用基于DeBERTa的分类器过滤无意义内容，从而减少代理的认知负担；同时，代理对摘要的编辑行为被用于在线优化笔记生成并定期触发离线模型再训练，形成闭环反馈机制，显著提升了摘要质量与代理工作效率，在生产环境中实现了平均3%的案例处理时间缩短（复杂案例最高达9%）。

链接: https://arxiv.org/abs/2510.06677
作者: Yisha Wu, Cen (Mia)Zhao,Yuanpei Cao,Xiaoqing Su,Yashar Mehdad,Mindy Ji,Claire Na Cheng
机构: Airbnb Inc. (Airbnb公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2025 Industry Track

点击查看摘要

Abstract:We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents’ context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.
zh

[NLP-82] PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）对高质量指令数据依赖性强、数据获取成本高以及训练所需样本量过大等问题，尤其是在监督微调（Supervised Fine-Tuning, SFT）阶段普遍使用超过30万条示例的情况下仍难以达到商业模型性能的瓶颈。其解决方案的关键在于提出了一种名为PiKa的数据高效对齐数据集家族，特别是PiKa-SFT仅需3万条SFT样本即可实现优于现有大规模公开数据集（如Magpie）的效果，并在AlpacaEval 2.0和Arena-Hard等基准测试中超越了官方Llama-3-8B-Instruct模型（该模型基于超1000万条私有数据训练）。这一成果表明，通过精心设计的高质量数据构造策略，可显著降低对海量标注数据的依赖，为开源社区提供一条可扩展、可复现的大模型对齐路径。

链接: https://arxiv.org/abs/2510.06670
作者: Shangjian Yin,Shining Liang,Wenbiao Ding,Yuli Qian,Zhouxing Shi,Hongzhi Li,Yutao Xie
机构: Microsoft AI; University of California, Riverside
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness depends on high-quality instruction data. Most existing alignment datasets are either private or require costly human annotation, which limits reproducibility and scalability. Even with Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is unclear how much data is actually required to fine-tune a base model into a strong instruction-following model. Current approaches often rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, creating barriers for academic and resource-limited communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples, far fewer than state-of-the-art datasets like Magpie. Through evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public datasets, we show that PiKa-SFT outperforms models trained on much larger data. On AlpacaEval 2.0 and Arena-Hard benchmarks, PiKa-SFT fine-tuning even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B to 7B) on PiKa-SFT, achieving consistent gains. These findings demonstrate that high-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment. Code and data: this https URL.
zh

[NLP-83] oolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

【速读】：该论文旨在解决当前基于大语言模型（LLM）或视觉语言模型（VLM）的智能体在使用神经工具（neural tools）时，因缺乏动态适应能力而导致的工具选择僵化问题。现有方法通常依赖固定工具组合，无法根据任务特性灵活调整工具配置，从而限制了性能表现。解决方案的关键在于提出ToolMem机制，该机制通过记录和总结历史交互中各工具在不同场景下的优劣表现，构建可检索的工具能力记忆库；在推理阶段，智能体可基于此记忆进行精准的工具选择，显著提升任务执行准确性与效率。实验表明，相较于无记忆或通用代理，ToolMem增强了对工具性能的预测准确率（文本生成场景提升14.8%，多模态生成提升28.7%），并在多个工具选项中实现更优选择（绝对提升21%和24%）。

链接: https://arxiv.org/abs/2510.06664
作者: Yunzhong Xiao,Yangmin Li,Hewei Wang,Yunlong Tang,Zora Zhiruo Wang
机构: Carnegie Mellon University (卡内基梅隆大学); University of Rochester (罗切斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agents utilizing tools powered by large language models (LLMs) or vision-language models (VLMs) have demonstrated remarkable progress in diverse tasks across text and visual modalities. Unlike traditional tools such as calculators, which give deterministic outputs, neural tools perform uncertainly across task scenarios. While different tools for a task may excel in varied scenarios, existing agents typically rely on fixed tools, thus limiting the flexibility in selecting the most suitable tool for specific tasks. In contrast, humans snowball their understanding of the capabilities of different tools by interacting with them, and apply this knowledge to select the optimal tool when solving a future task. To build agents that similarly benefit from this process, we propose ToolMem that enables agents to develop memories of tool capabilities from previous interactions, by summarizing their strengths and weaknesses and storing them in memory; at inference, the agent can retrieve relevant entries from ToolMem, and select the best tool to solve individual tasks more accurately. We evaluate ToolMem on learning varied text generation and text-to-image generation neural tools. Compared to no-memory, generic agents, we find ToolMem-augmented agents predict tool performance 14.8% and 28.7% more accurately across text and multimodal generation scenarios. Moreover, ToolMem facilitates optimal tool selection among multiple choices by 21% and 24% absolute increases in respective scenarios.
zh

[NLP-84] Aligning Large Language Models via Fully Self-Synthetic Data

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）对昂贵的人类标注数据或外部奖励模型依赖的问题，尤其是在传统基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）和基于AI反馈的强化学习（Reinforcement Learning from AI Feedback, RLAIF）中，均需大量人工标注或使用高成本的预训练模型（如GPT-4）来构建偏好对（preference pairs）。其解决方案的关键在于提出一种完全自生成的数据框架——Self-Alignment Optimization (SAO)，其中所有训练数据（包括用户查询、响应及偏好标签）均由模型自身合成：首先通过角色扮演（persona role-play）引导LLM生成多样化输入与输出，随后利用模型内部评估机制进行偏好优化。实验表明，SAO在AlpacaEval~2.0等基准上显著提升对话能力，同时保持下游任务（如问答、数学推理）性能，为LLM的自主对齐提供了高效可行的新路径。

链接: https://arxiv.org/abs/2510.06652
作者: Shangjian Yin,Zhepei Wei,Xinyu Zhu,Wei-Lin Chen,Yu Meng
机构: University of California, Riverside (加州大学河滨分校); University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model’s chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: this https URL.
zh

[NLP-85] A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures

【速读】：该论文旨在解决状态空间模型（State Space Models, SSMs）与基于Transformer的模型（Transformer-Based Models, TBMs）在长序列处理中表示传播机制不明确的问题，特别是二者在token和layer层面如何演化上下文信息缺乏系统性理解。解决方案的关键在于首次提出统一的token级与layer级分析框架，结合中心核对齐（centered kernel alignment）、稳定性指标和探测（probing）方法，量化并比较两类模型中表示的演化路径；研究发现TBMs快速同质化token表示而SSMs早期保持token独特性但深层趋于同质化，并通过理论分析与参数随机化实验揭示：TBMs的过度平滑源于架构设计，而SSMs的同质化主要由训练动态驱动，从而澄清了两类架构的归纳偏置并为长上下文推理模型的设计提供指导。

链接: https://arxiv.org/abs/2510.06640
作者: Nhat M. Hoang,Do Xuan Long,Cong-Duy Nguyen,Min-Yen Kan,Luu Anh Tuan
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State Space Models (SSMs) have recently emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing, offering linear scaling and lower memory use. Yet, how contextual information flows across layers and tokens in these architectures remains understudied. We present the first unified, token- and layer-level analysis of representation propagation in SSMs and TBMs. Using centered kernel alignment, stability metrics, and probing, we characterize how representations evolve within and across layers. We find a key divergence: TBMs rapidly homogenize token representations, with diversity reemerging only in later layers, while SSMs preserve token uniqueness early but converge to homogenization deeper. Theoretical analysis and parameter randomization further reveal that oversmoothing in TBMs stems from architectural design, whereas in SSMs it arises mainly from training dynamics. These insights clarify the inductive biases of both architectures and inform future model and training designs for long-context reasoning.
zh

[NLP-86] Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在版权保护中的指纹识别难题，即如何在不访问模型内部参数的情况下，通过黑盒方式有效提取并验证模型的唯一性特征（“指纹”），以识别非法复制行为。现有方法因依赖模型输出而难以生成具有区分度的指纹，原因在于非线性函数导致关键参数信息丢失。其解决方案的关键在于：基于Fisher信息理论证明模型输入梯度比输出更具信息量，并提出ZeroPrint方法，在黑盒场景下利用零阶估计近似计算这些梯度；针对离散文本特性，通过语义保持的词替换模拟输入扰动，从而估计出模型的雅可比矩阵（Jacobian matrix）作为唯一指纹，显著提升了指纹识别的有效性和鲁棒性。

链接: https://arxiv.org/abs/2510.06605
作者: Shuo Shao,Yiming Li,Hongwei Yao,Yifei Chen,Yuchen Yang,Zhan Qin
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); City University of Hong Kong (香港城市大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The substantial investment required to develop Large Language Models (LLMs) makes them valuable intellectual property, raising significant concerns about copyright protection. LLM fingerprinting has emerged as a key technique to address this, which aims to verify a model’s origin by extracting an intrinsic, unique signature (a “fingerprint”) and comparing it to that of a source model to identify illicit copies. However, existing black-box fingerprinting methods often fail to generate distinctive LLM fingerprints. This ineffectiveness arises because black-box methods typically rely on model outputs, which lose critical information about the model’s unique parameters due to the usage of non-linear functions. To address this, we first leverage Fisher Information Theory to formally demonstrate that the gradient of the model’s input is a more informative feature for fingerprinting than the output. Based on this insight, we propose ZeroPrint, a novel method that approximates these information-rich gradients in a black-box setting using zeroth-order estimation. ZeroPrint overcomes the challenge of applying this to discrete text by simulating input perturbations via semantic-preserving word substitutions. This operation allows ZeroPrint to estimate the model’s Jacobian matrix as a unique fingerprint. Experiments on the standard benchmark show ZeroPrint achieves a state-of-the-art effectiveness and robustness, significantly outperforming existing black-box methods.
zh

[NLP-87] Do Internal Layers of LLM s Reveal Patterns for Jailbreak Detection?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）面临的越狱攻击（jailbreaking）问题，即恶意用户通过精心设计的提示词诱导模型输出受限制或敏感内容。现有防御机制难以应对不断演进的攻击手法，且尚无模型具备完全抗攻击能力。论文的关键解决方案在于从模型内部表示入手，系统分析LLM在不同隐藏层对越狱提示与良性提示的响应差异，聚焦于GPT-J和状态空间模型Mamba2的层间行为特征，从而为基于内部动态建模的鲁棒越狱检测与防御提供新的研究方向。

链接: https://arxiv.org/abs/2510.06594
作者: Sri Durga Sai Sowmya Kadali,Evangelos E. Papalexakis
机构: University of California, Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs. Adversarial users often exploit these models through carefully engineered prompts to elicit restricted or sensitive outputs, a strategy widely referred to as jailbreaking. While numerous defense mechanisms have been proposed, attackers continuously develop novel prompting techniques, and no existing model can be considered fully resistant. In this study, we investigate the jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM GPT-J and the state-space model Mamba2, presenting preliminary findings that highlight distinct layer-wise behaviors. Our results suggest promising directions for further research on leveraging internal model dynamics for robust jailbreak detection and defense.
zh

[NLP-88] nyScientist: An Interactive Extensible and Controllable Framework for Building Research Agents EMNLP2025

【速读】：该论文旨在解决自动研究（automatic research）中因多智能体系统、规划、工具调用、代码执行及人机交互等复杂工作流所带来的扩展性和维护性难题，尤其在算法与架构持续演进背景下愈发突出。其解决方案的关键在于提出一个名为TinyScientist的交互式、可扩展且可控的框架，该框架通过识别自动研究工作流的核心组件，实现了对新工具的灵活适配和迭代式增长能力，从而显著降低复杂工作流的构建与维护门槛。

链接: https://arxiv.org/abs/2510.06579
作者: Haofei Yu,Keyang Xuan,Fenghai Li,Kunlun Zhu,Zijie Lei,Jiaxun Zhang,Ziheng Qi,Kyle Richardson,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 7 pages, EMNLP 2025 Demo track

点击查看摘要

Abstract:Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi-agent systems, planning, tool usage, code execution, and human-agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth. We provide an open-source codebase, an interactive web demonstration, and a PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to every researcher and developer.
zh

[NLP-89] he Algebra of Meaning: Why Machines Need Montague More Than Moores Law

【速读】：该论文旨在解决当前语言模型在生成内容时频繁出现幻觉（hallucination）、对指令的脆弱响应以及合规性结果不透明等问题，这些问题本质上源于缺乏类型理论语义（type-theoretic semantics），而非数据量或模型规模的限制。其核心解决方案是将对齐（alignment）重新建模为一个解析（parsing）问题：通过构建一个神经符号架构 Savassan，将自然语言输入编译为蒙塔古风格（Montague-style）的逻辑形式，并映射到扩展了道义算子（deontic operators）与管辖权上下文（jurisdictional contexts）的类型化本体（typed ontologies）。关键创新在于“一次解析、多域投影”机制——即系统首先识别输入的语义结构（如缺陷索赔），再将其自动映射至多个法律本体（如韩国/日本的诽谤风险、美国的受保护言论、欧盟GDPR合规），从而生成可解释且跨司法管辖区一致的合规建议，而非简单的二元屏蔽决策。这一方法实现了描述性、规范性和法律责任维度的统一建模，为可信自主系统提供了基于组合类型的语义推理框架。

链接: https://arxiv.org/abs/2510.06559
作者: Cheonkam Jeong,Sungdo Kim,Jewoo Park
机构: Savassan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Contemporary language models are fluent yet routinely mis-handle the types of meaning their outputs entail. We argue that hallucination, brittle moderation, and opaque compliance outcomes are symptoms of missing type-theoretic semantics rather than data or scale limitations. Building on Montague’s view of language as typed, compositional algebra, we recast alignment as a parsing problem: natural-language inputs must be compiled into structures that make explicit their descriptive, normative, and legal dimensions under context. We present Savassan, a neuro-symbolic architecture that compiles utterances into Montague-style logical forms and maps them to typed ontologies extended with deontic operators and jurisdictional contexts. Neural components extract candidate structures from unstructured inputs; symbolic components perform type checking, constraint reasoning, and cross-jurisdiction mapping to produce compliance-aware guidance rather than binary censorship. In cross-border scenarios, the system “parses once” (e.g., defect claim(product x, company y)) and projects the result into multiple legal ontologies (e.g., defamation risk in KR/JP, protected opinion in US, GDPR checks in EU), composing outcomes into a single, explainable decision. This paper contributes: (i) a diagnosis of hallucination as a type error; (ii) a formal Montague-ontology bridge for business/legal reasoning; and (iii) a production-oriented design that embeds typed interfaces across the pipeline. We outline an evaluation plan using legal reasoning benchmarks and synthetic multi-jurisdiction suites. Our position is that trustworthy autonomy requires compositional typing of meaning, enabling systems to reason about what is described, what is prescribed, and what incurs liability within a unified algebra of meaning. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2510.06559 [cs.CL] (or arXiv:2510.06559v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.06559 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-90] he Markovian Thinker

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在训练生成式 AI（Generative AI）模型进行长链思维（Long Chain of Thought, LongCoT）推理时所面临的计算复杂度问题：传统RL环境的状态空间随推理长度线性增长，导致注意力机制的计算量呈二次增长，严重限制了可扩展性。解决方案的关键在于提出**马尔可夫式思维（Markovian Thinking）**范式，其核心是将推理过程结构化为固定大小的状态块（chunk），在每个块内模型按常规方式推理，在边界处重置上下文并仅保留一个短文本状态作为“携带信息”（carryover），从而实现状态维度恒定、推理长度与上下文长度解耦。这一设计使计算复杂度从二次降至线性，同时通过RL训练策略学习如何生成有效的状态表示，使得模型可在有限上下文下完成远超上下文长度的推理任务（如8K token chunk中完成24K token推理），显著提升效率与可扩展性。

链接: https://arxiv.org/abs/2510.06557
作者: Milad Aghajohari,Kamran Chitsaz,Amirhossein Kazemnejad,Sarath Chandar,Alessandro Sordoni,Aaron Courville,Siva Reddy
机构: Mila; Microsoft Research; McGill University; ServiceNow Research; Canada CIFAR AI Chair; Chandar Research Lab; Polytechnique Montréal; Université de Montréal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL “thinking environment”, where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
zh

[NLP-91] Flipping the Dialogue: Training and Evaluating User Language Models

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在多轮对话场景下性能评估失真的问题，即传统用户模拟方法使用原本训练为“助手机器人”的LLM充当用户角色，导致仿真环境过于理想化，无法真实反映人类用户的复杂交互行为。解决方案的关键在于提出专门设计的“用户语言模型”（User Language Models, User LMs），这些模型通过后训练（post-training）被优化为模拟真实人类用户在多轮对话中的不完美、动态且多样化的表达方式，从而显著提升模拟的真实性与鲁棒性。实验表明，采用User LMs进行模拟时，强助理模型（如GPT-4o）在代码和数学任务上的表现从74.6%下降至57.4%，验证了更贴近现实的用户行为对助手能力构成实质性挑战。

链接: https://arxiv.org/abs/2510.06552
作者: Tarek Naous,Philippe Laban,Wei Xu,Jennifer Neville
机构: Microsoft Research (微软研究院); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user’s request. To satisfy this specific role, LMs are post-trained to be helpful assistants – optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.
zh

[NLP-92] From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining NEURIPS2025

【速读】：该论文旨在解决多阶段预训练（multi-stage pretraining）中 bootstrapped pretraining（即利用已预训练的基础模型进行进一步预训练）的有效性问题，尤其是当基础模型已经过量预训练时，其提升效果是否仍具性价比。解决方案的关键在于通过实证研究揭示了 bootstrapped pretraining 的缩放行为规律：第二阶段预训练的性能提升随基础模型预训练 token 数量增加而呈现对数衰减趋势，且整体缩放关系可由一个简洁的 scaling law 准确建模。这一发现揭示了多阶段预训练中的根本权衡——基础模型越充分预训练，后续 bootstrapping 所带来的边际收益越低，从而为高效语言模型训练提供了实用指导，并引发对过度预训练模型再利用价值的重新审视。

链接: https://arxiv.org/abs/2510.06548
作者: Seng Pei Liew,Takuya Kato
机构: SB Intuitions( SB Intuitions)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 11 figures, an abridged version to appear in NeurIPS 2025 LLM Evaluation Workshop

点击查看摘要

Abstract:Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.
zh

[NLP-93] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在模仿学习（imitation learning）范式下存在的训练-生成差距（training-generation gap）问题，该差距限制了模型的鲁棒推理能力。传统方法依赖大规模文本语料进行预训练，但缺乏对行为策略的显式优化，导致性能瓶颈。解决方案的关键在于提出Webscale-RL数据流水线，该流水线能够将大规模预训练文档系统性地转化为百万级多样化、可验证的问答对，用于强化学习（Reinforcement Learning, RL）训练，从而构建包含120万样本、覆盖9个以上领域的Webscale-RL数据集。实验证明，基于该数据集的RL训练显著优于持续预训练和强数据精炼基线，且效率提升高达100倍token用量，为实现与预训练规模相当的强化学习训练提供了可行路径。

链接: https://arxiv.org/abs/2510.06499
作者: Zhepeng Cen,Haolin Chen,Shiyu Wang,Zuxin Liu,Zhiwei Liu,Ding Zhao,Silvio Savarese,Caiming Xiong,Huan Wang,Weiran Yao
机构: Salesforce AI Research( Salesforce 人工智能研究中心); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100 \times fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
zh

[NLP-94] PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

【速读】：该论文旨在解决基础模型（foundation models）在复杂动态环境中推理与规划能力的评估难题，以及其可扩展性限制问题。解决方案的关键在于提出一个名为PuzzlePlex的基准测试框架，该框架通过15类多样化的谜题（包括确定性和随机性游戏、单人与双人场景）构建了全面且可扩展的评估环境，支持生成更具挑战性的实例以适配模型演进。同时，论文设计了细粒度性能指标，并对比指令驱动与代码执行两种设置下前沿模型的表现，揭示出推理模型在指令设置中更优，而代码执行虽更具挑战但提供了更高效、可扩展的替代路径，从而为未来基础模型在推理、规划和泛化能力上的改进提供定向指导。

链接: https://arxiv.org/abs/2510.06475
作者: Yitao Long,Yuru Jiang,Hongjun Liu,Yilun Zhao,Jingchen Sun,Yiqiu Shen,Chen Zhao,Arman Cohan,Dennis Shasha
机构: New York University (纽约大学); Zhejiang University (浙江大学); Yale University (耶鲁大学); University at Buffalo, SUNY (纽约州立大学布法罗分校); NYU Grossman School of Medicine (纽约大学格罗斯曼医学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.
zh

[NLP-95] st-Time Scaling of Reasoning Models for Machine Translation

【速读】：该论文试图解决的问题是：在机器翻译（Machine Translation, MT）任务中，推理时扩展（Test-time Scaling, TTS）是否能够提升翻译质量。以往研究表明TTS能显著改善生成式AI（Generative AI）在数学和编码等推理任务中的表现，但其在MT领域的有效性尚未被充分探索。论文通过评估12个通用型推理模型（Reasoning Models, RMs）在多种领域MT基准上的表现，发现直接翻译场景下TTS带来的收益有限且不稳定；而关键解决方案在于对模型进行领域特定微调（domain-specific fine-tuning），使模型的推理过程与任务需求对齐，从而释放TTS的潜力，实现稳定且可达到最优推理深度的性能提升。此外，强制模型超出其自然停止点会损害翻译质量，但在后编辑（post-editing）场景中，TTS可有效促进自我修正，体现出其在多步自纠错工作流中的价值。

链接: https://arxiv.org/abs/2510.06471
作者: Zihao Li,Shaoxiong Ji,Jörg Tiedemann
机构: University of Helsinki (赫尔辛基大学); University of Turku (图尔库大学); ELLIS Institute Finland (芬兰ELLIS研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model’s reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.
zh

[NLP-96] Linguistically Informed Tokenization Improves ASR for Underresourced Languages

【速读】：该论文旨在解决生成式 AI 在低资源语言（如澳大利亚原住民语言 Yan-nhangu）中自动语音识别（ASR）性能受限的问题。其关键解决方案在于采用语言学驱动的音素级（phonemic）标记化策略，相较于传统的正字法（orthographic）标记化方法，显著降低了词错误率（WER）和字符错误率（CER），从而提升了ASR在濒危语言文献记录中的实用性。此外，研究还证明了对ASR输出进行人工校正比从零开始手写转录音频更为高效，验证了ASR作为语言文档化流程中可行工具的价值。

链接: https://arxiv.org/abs/2510.06461
作者: Massimo Daul,Alessio Tosolini,Claire Bowern
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.
zh

[NLP-97] A Survey on Agent ic Security: Applications Threats and Defenses

【速读】：该论文旨在解决生成式 AI（Generative AI）代理（LLM-agents）在网络安全领域中因自主性增强而引入的新型安全风险问题。其解决方案的关键在于构建一个全面的、结构化的安全研究框架，围绕“应用（Applications）”、“威胁（Threats）”和“防御（Defenses）”三个相互依赖的核心支柱，对超过150篇相关文献进行系统分类与分析，从而揭示当前代理架构中的新兴趋势及模型与模态覆盖方面的关键研究空白。

链接: https://arxiv.org/abs/2510.06445
作者: Asif Shahriar,Md Nafiu Rahman,Sadif Ahmed,Farig Sadeque,Md Rizwan Parvez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rapid shift from passive LLMs to autonomous LLM-agents marks a new paradigm in cybersecurity. While these agents can act as powerful tools for both offensive and defensive operations, the very agentic context introduces a new class of inherent security risks. In this work we present the first holistic survey of the agentic security landscape, structuring the field around three interdependent pillars: Applications, Threats, and Defenses. We provide a comprehensive taxonomy of over 150 papers, explaining how agents are used, the vulnerabilities they possess, and the countermeasures designed to protect them. A detailed cross-cutting analysis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage.
zh

[NLP-98] MathRobust-LV: Evaluation of Large Language Models Robustness to Linguistic Variations in Mathematical Reasoning

【速读】：该论文旨在解决大语言模型在数学推理任务中对语言形式变化的鲁棒性不足的问题，即当题目表述（如名称、背景、变量等）发生变化但数值结构和答案保持不变时，模型性能是否会下降。其解决方案的关键在于构建了一个名为MathRobust-LV的测试集与评估方法，该方法通过在保持问题难度和解题逻辑一致的前提下，系统性地重述高中数学题目的表面细节，从而模拟教师在教学评估中常用的变式出题方式，以更贴近真实教育场景下对模型语言鲁棒性的要求。实验表明，即使在当前广泛使用的MATH数据集上，多数模型在面对此类语言变异时准确率仍显著下降，凸显了语言鲁棒性作为模型推理能力基础挑战的重要性。

链接: https://arxiv.org/abs/2510.06430
作者: Neeraja Kirtane,Yuvraj Khanna,Peter Relan
机构: Got It Education (Got It 教育)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.
zh

[NLP-99] Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

【速读】：该论文旨在解决多语言、多树库（treebank）的语篇结构分析（discourse structure analysis）中因关系类型（relation inventory）不一致导致的统一建模难题。现有方法通常需针对每个树库单独训练模型，难以实现跨资源的端到端统一解析。其关键解决方案是提出两种训练策略：一是Multi-Head方法，为每个树库独立分配关系分类层以保留原始关系定义；二是Masked-Union方法，通过选择性标签掩码实现共享参数训练，从而在不修改原有关系词表的前提下兼容不同树库的关系类型。实验表明，Masked-Union不仅参数效率更高，且整体性能优于16个单树库基线模型，验证了统一多语言语篇解析框架的有效性。

链接: https://arxiv.org/abs/2510.06427
作者: Elena Chistova
机构: Institute of System Architecture (ISA)
类目: Computation and Language (cs.CL)
备注: Accepted to CODI CRAC 2025

点击查看摘要

Abstract:We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark monotreebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono-treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.
zh

[NLP-100] FinLFQA: Evaluating Attributed Text Generation of LLM s in Financial Long-Form Question Answering EMNLP2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在回答长文本金融类问题时频繁产生幻觉（hallucination）的问题，即生成看似合理但事实错误的答案。现有方法多依赖简单的引用检索作为归因手段，但在真实金融场景中，可靠的归因需涵盖更复杂的要素。解决方案的关键在于提出FinLFQA基准，用于评估LLMs在生成长文本答案时提供可靠且细致归因的能力，具体包括从财务报告中提取支持证据、中间数值推理步骤以及领域特定的金融知识。该基准通过人工标注验证归因质量，并辅以自动评估框架，从而更全面地衡量模型在答案准确性与归因精细度上的表现。

链接: https://arxiv.org/abs/2510.06426
作者: Yitao Long,Tiansheng Hu,Yilun Zhao,Arman Cohan,Chen Zhao
机构: NYU Shanghai (纽约大学上海分校); Yale University (耶鲁大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.
zh

[NLP-101] Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLM s Actually Align?

【速读】：该论文旨在解决教师在使用虚拟实验室（Virtual Labs）进行科学教学时，难以根据自身教学目标生成与实验模拟情境一致且具有教育意义的问题这一难题。现有第三方资源常与课堂需求不匹配，而自主开发则耗时且难于扩展。其解决方案的关键在于提出了一种面向教学目标对齐的问题生成框架，通过教师与大型语言模型（Large Language Models, LLMs）的自然语言交互实现教学意图理解，并结合知识单元与关系分析实现对实验场景的理解；同时引入问题分类体系（question taxonomy）以结构化认知与教学意图，以及TELeR分类体系控制提示细节，从而提升生成问题的质量、格式契合度和可解析性。实证表明，该框架能显著提高问题质量（平均提升0.8 Likert点），并使大模型在格式遵循性和可解析性方面表现最优。

链接: https://arxiv.org/abs/2510.06411
作者: R. Alexander Knipper,Indrani Dey,Souvika Sarkar,Hari Narayanan,Sadhana Puntambekar,Santu Karmaker
机构: Bridge-AI Lab@UCF(桥接人工智能实验室@佛罗里达大学); Department of EdPsych, University of Wisconsin-Madison(教育心理学系, 威斯康星大学麦迪逊分校); Department of CS, Wichita State University(计算机科学系, 威奇托州立大学); Department of CSSE, Auburn University(计算机与软件工程系, 阿伯丁大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Virtual Labs offer valuable opportunities for hands-on, inquiry-based science learning, yet teachers often struggle to adapt them to fit their instructional goals. Third-party materials may not align with classroom needs, and developing custom resources can be time-consuming and difficult to scale. Recent advances in Large Language Models (LLMs) offer a promising avenue for addressing these limitations. In this paper, we introduce a novel alignment framework for instructional goal-aligned question generation, enabling teachers to leverage LLMs to produce simulation-aligned, pedagogically meaningful questions through natural language interaction. The framework integrates four components: instructional goal understanding via teacher-LLM dialogue, lab understanding via knowledge unit and relationship analysis, a question taxonomy for structuring cognitive and pedagogical intent, and the TELeR taxonomy for controlling prompt detail. Early design choices were informed by a small teacher-assisted case study, while our final evaluation analyzed over 1,100 questions from 19 open-source LLMs. With goal and lab understanding grounding questions in teacher intent and simulation context, the question taxonomy elevates cognitive demand (open-ended formats and relational types raise quality by 0.29-0.39 points), and optimized TELeR prompts enhance format adherence (80% parsability, 90% adherence). Larger models yield the strongest gains: parsability +37.1%, adherence +25.7%, and average quality +0.8 Likert points.
zh

[NLP-102] Reward Model Perspectives: Whose Opinions Do Reward Models Reward? EMNLP2025

【速读】：该论文旨在解决奖励模型（Reward Model, RM）在语言模型对齐过程中存在的社会偏见问题，尤其是其与不同人口统计群体之间意见一致性不足及系统性强化有害刻板印象的现象。解决方案的关键在于构建一个形式化的框架以量化RM在争议话题上的主观态度和价值观，并通过提示工程（prompting）尝试引导奖励偏向特定目标群体；然而研究发现，仅靠提示调整无法有效克服RM固有的偏见，强调在偏好学习阶段需更审慎地设计和评估RM行为，以防止社会偏见在语言技术中的传播。

链接: https://arxiv.org/abs/2510.06391
作者: Elle
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published at EMNLP 2025 under the full author name “Elle”

点击查看摘要

Abstract:Reward models (RMs) are central to the alignment of language models (LMs). An RM often serves as a proxy for human preferences to guide downstream LM behavior. However, our understanding of RM behavior is limited. Our work (i) formalizes a framework for measuring the alignment of opinions captured by RMs, (ii) investigates the extent to which RMs demonstrate sociodemographic biases, and (iii) explores the effects of prompting to steer rewards towards the preferences of a target group. We study the subjective and diverse perspectives on controversial topics, which allows us to quantify RM perspectives in terms of their opinions, attitudes, and values. We show that RMs are poorly aligned with several demographic groups and can systematically reward harmful stereotypes, and steering alone is not enough to overcome these limitations. Our findings underscore the need for more careful consideration of RM behavior in model alignment during preference learning to prevent the propagation of unwanted social biases in the language technologies that we use.
zh

[NLP-103] Controllable Stylistic Text Generation with Train-Time Attribute-Regularized Diffusion

【速读】：该论文旨在解决生成式文本中风格属性可控性不足的问题，即如何在保持语义内容不变的前提下，精确控制生成文本的特定风格属性（如情感倾向、文体特征等）。现有方法主要分为无分类器引导（Classifier-Free Guidance, CFG）和有分类器引导（Classifier Guidance, CG），其中CFG虽能较好保留语义但难以实现有效属性控制，而CG虽可提升属性对齐效果但存在采样计算开销大及分类器泛化能力弱的问题。论文提出RegDiff框架，其核心创新在于通过引入基于VAE的编码-解码架构与带属性监督的潜空间扩散模型，在训练阶段注入属性特征信息，而在推理阶段无需依赖预训练分类器即可实现高效且可控的文本生成，从而在保证重建保真度的同时显著降低计算成本并提升属性控制精度。

链接: https://arxiv.org/abs/2510.06386
作者: Fan Zhou,Chang Tian,Tim Van de Cruys
机构: KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注: Preprint under review

点击查看摘要

Abstract:Generating stylistic text with specific attributes is a key problem in controllable text generation. Recently, diffusion models have emerged as a powerful paradigm for both visual and textual generation. Existing approaches can be broadly categorized into classifier-free guidance (CFG) and classifier guidance (CG) methods. While CFG effectively preserves semantic content, it often fails to provide effective attribute control. In contrast, CG modifies the denoising trajectory using classifier gradients, enabling better attribute alignment but incurring high computational costs during sampling and suffering from classifier generalization issues. In this work, we propose RegDiff, a regularized diffusion framework that leverages attribute features without requiring a pretrained classifier during sampling, thereby achieving controllable generation with reduced computational costs. Specifically, RegDiff employs a VAE-based encoder–decoder architecture to ensure reconstruction fidelity and a latent diffusion model trained with attribute supervision to enable controllable text generation. Attribute information is injected only during training. Experiments on five datasets spanning multiple stylistic attributes demonstrate that RegDiff outperforms strong baselines in generating stylistic texts. These results validate the effectiveness of RegDiff as an efficient solution for attribute-controllable text diffusion. Our code, datasets, and resources will be released upon publication at this https URL.
zh

[NLP-104] Protecting De-identified Documents from Search-based Linkage Attacks

【速读】：该论文旨在解决去标识化（de-identification）模型在文本处理中未能有效防范链接风险（linkage risk）的问题，即攻击者可通过提取去标识文本中的短语并在原始数据集中进行匹配，从而重新识别个体身份。解决方案的关键在于提出一种两阶段方法：首先构建文档集合中N-gram的倒排索引（inverted index），高效识别出出现在少于k个文档中的N-gram组合；随后利用大语言模型（LLM）迭代重写这些高风险片段，直至无法通过搜索实现链接攻击，同时保持文本语义完整性。实验结果表明，该方法在真实法庭案例数据集上能有效阻止基于搜索的链接攻击，且对原文内容忠实度较高。

链接: https://arxiv.org/abs/2510.06383
作者: Pierre Lison,Mark Anderson
机构: Norwegian Computing Center (挪威计算中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than k documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.
zh

[NLP-105] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）特征可解释性中的自然语言描述模糊、不一致且需人工重命名的问题。其核心解决方案是引入语义正则表达式（semantic regexes），通过组合捕捉语言与语义模式的基元（primitives）以及用于上下文化、组合和量化修饰符（modifiers），生成精确且表达力强的特征描述。该方法在定量基准和定性分析中表现出与自然语言相当的准确性，同时提供更简洁、一致的描述，并因其结构化特性支持对模型各层特征复杂度的量化分析及从个体特征到整体模型模式的自动化解释扩展。

链接: https://arxiv.org/abs/2510.06378
作者: Angie Boggust,Donghao Ren,Yannick Assogba,Dominik Moritz,Arvind Satyanarayan,Fred Hohman
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Apple (苹果公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.
zh

[NLP-106] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

【速读】：该论文旨在解决当前大规模多模态模型在处理需要文化背景和日常常识的视觉问答（Visual Question Answering, VQA）任务时表现不佳的问题，尤其是在低资源和代表性不足的语言中。其关键解决方案是提出了一种名为Everyday Multimodal and Multilingual QA (EverydayMMQA) 的框架，并基于此构建了OASIS数据集——一个包含语音、图像与文本的多模态、多语言数据集，涵盖超过0.92百万张图像和1480万组问答对，其中370万为口语问题，支持四种输入组合（纯语音、纯文本、语音+图像、文本+图像）。该数据集聚焦英语和阿拉伯语变体，覆盖18个国家，内容设计反映真实世界情境，从而推动模型在文化感知、常识推理和语用理解等非对象识别任务上的能力提升。

链接: https://arxiv.org/abs/2510.06371
作者: Firoj Alam,Ali Ezzat Shahroor,Md. Arid Hasan,Zien Sheikh Ali,Hunzalah Hassan Bhatti,Mohamed Bayan Kmainasi,Shammur Absar Chowdhury,Basel Mousi,Fahim Dalvi,Nadir Durrani,Natasa Milic-Frayling
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Contextual Understanding, Culturally Informed

点击查看摘要

Abstract:Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.
zh

[NLP-107] EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preference

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）和奖励模型（Reward Models, RMs）在面对全球用户多样化价值取向与风格偏好时，缺乏有效可衡量的评估基准问题。为填补现有数据集无法支持对RM引导能力进行受控评估的空白，作者提出EVALUESTEER基准，其关键在于通过合成生成165,888组偏好样本，系统性地在4个价值观维度（传统型、世俗理性型、生存导向型、自我表达型）和4个风格维度（冗余度、可读性、自信度、亲和力）上构建可控变量，从而实现对LLMs和RMs根据用户完整偏好配置选择最优响应的能力进行量化评估。实验表明，当提供完整用户画像时，最佳模型仅达到75%准确率，远低于仅提供相关偏好信息时的99%，揭示了当前RMs在识别与适配用户相关信息方面的显著局限性，为开发更具人类价值观适应性的奖励模型提供了挑战性测试平台。

链接: https://arxiv.org/abs/2510.06370
作者: Kshitish Ghate,Andy Liu,Devansh Jain,Taylor Sorensen,Atoosa Kasirzadeh,Aylin Caliskan,Mona T. Diab,Maarten Sap
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Preprint under review

点击查看摘要

Abstract:As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs’ and reward models’ (RMs) steerability towards users’ value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs – systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user’s preferences. We evaluate six open-source and proprietary LLMs and RMs under sixteen systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user’s full profile of values and stylistic preferences, the best models achieve 75% accuracy at choosing the correct response, in contrast to 99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.
zh

[NLP-108] LLM Bias Detection and Mitigation through the Lens of Desired Distributions EMNLP2025

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）输出分布偏离预期目标的问题，即模型在性别-职业关联等社会敏感维度上未能准确反映现实世界分布或公平分配目标。传统偏见缓解方法多聚焦于促进社会平等和人口统计学上的均等性，而本文提出将偏见定义为模型输出分布与期望分布之间的偏差，该期望分布可为真实世界数据分布或理想化的平等分布。其解决方案的关键在于提出一种基于加权自适应损失的微调方法，通过动态调整训练过程中不同样本的权重，使模型在保持语言建模能力的同时，有效对齐至指定的目标分布；实验表明，在美国劳工统计数据构建的三类职业集合（男性主导、女性主导、性别平衡）下，该方法可在平等目标下实现近乎完全的偏见消除，在真实世界目标下实现30%-75%的偏见降低，尤其在自回归LLM中表现出显著效果。

链接: https://arxiv.org/abs/2510.06354
作者: Ingroj Shrestha,Padmini Srinivasan
机构: University of Iowa (爱荷华大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025

点击查看摘要

Abstract:Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM’s outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM’s gender-profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets – male-dominated, female-dominated, and gender-balanced – derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50-62% reduction.
zh

[NLP-109] Asking For It: Question-Answering for Predicting Rule Infractions in Online Content Moderation

【速读】：该论文旨在解决在线社区中内容 moderation（内容审核）的透明性、治理一致性与自动化难题，这些问题源于社区规则（community rules）在不同平台间差异大、随时间演变且执行不一致。解决方案的关键在于提出 ModQ，一个基于问答（question-answering）范式的规则敏感型内容审核框架，其核心创新在于推理时条件化于完整的社区规则集，从而精准识别适用于特定评论的违规规则。该方法通过抽取式（extractive）和多选式（multiple-choice）两种 QA 模型实现，并在 Reddit 和 Lemmy 的大规模数据集上训练，展现出对未见过的社区和规则的良好泛化能力，同时保持轻量级与可解释性，为低资源和动态治理场景提供有效支持。

链接: https://arxiv.org/abs/2510.06350
作者: Mattia Samory,Diana Pamfile,Andrew To,Shruti Phadke
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted at ICWSM 2026

点击查看摘要

Abstract:Online communities rely on a mix of platform policies and community-authored rules to define acceptable behavior and maintain order. However, these rules vary widely across communities, evolve over time, and are enforced inconsistently, posing challenges for transparency, governance, and automation. In this paper, we model the relationship between rules and their enforcement at scale, introducing ModQ, a novel question-answering framework for rule-sensitive content moderation. Unlike prior classification or generation-based approaches, ModQ conditions on the full set of community rules at inference time and identifies which rule best applies to a given comment. We implement two model variants - extractive and multiple-choice QA - and train them on large-scale datasets from Reddit and Lemmy, the latter of which we construct from publicly available moderation logs and rule descriptions. Both models outperform state-of-the-art baselines in identifying moderation-relevant rule violations, while remaining lightweight and interpretable. Notably, ModQ models generalize effectively to unseen communities and rules, supporting low-resource moderation settings and dynamic governance environments.
zh

[NLP-110] ype and Complexity Signals in Multilingual Question Representations EMNLP2025

【速读】：该论文旨在解决多语言Transformer模型如何表征疑问句的形态句法特性这一问题，特别是不同语言中疑问类型与复杂度的表示能力。其解决方案的关键在于构建了一个包含七种语言的Question Type and Complexity (QTC) 数据集，并引入依赖长度（dependency length）、树深度（tree depth）和词汇密度（lexical density）等复杂度指标进行标注；同时，通过扩展探针（probing）方法至回归标签并引入选择性控制（selectivity controls），量化模型在不同层上对结构复杂性的捕捉能力，从而系统比较冻结的Glot500-m预训练表示、子词TF-IDF基线和微调模型的表现，揭示了神经探针在捕捉细粒度结构复杂性上的优势及参数更新对预训练语言信息可用性的影响。

链接: https://arxiv.org/abs/2510.06304
作者: Robin Kokot,Wessel Poelman
机构: 未知
类目: Computation and Language (cs.CL)
备注: Workshop on Multilingual Representation Learning at EMNLP 2025

点击查看摘要

Abstract:This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce the availability of pre-trained linguistic information.
zh

[NLP-111] Reproducibility Study of “XRec: Large Language Models for Explainable Recommendation”

【速读】：该论文旨在解决推荐系统中解释性不足的问题，即如何利用大语言模型（Large Language Models, LLMs）生成用户可理解且个性化的推荐解释。其解决方案的关键在于提出XRec框架，这是一个模型无关的协同指令微调方法，通过引入协作信息（collaborative information）增强LLM的解释能力，从而生成更准确、更具个性化特征的推荐理由。该框架的核心创新在于利用Mixture of Experts（MoE）模块中的嵌入结构来融合协同信号与语言建模能力，有效提升解释的稳定性和相关性。

链接: https://arxiv.org/abs/2510.06275
作者: Ranjan Mishra,Julian I. Bibo,Quinten van Engelen,Henk Schaapman
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we reproduced the work done in the paper “XRec: Large Language Models for Explainable Recommendation” by Ma et al. (2024). The original authors introduced XRec, a model-agnostic collaborative instruction-tuning framework that enables large language models (LLMs) to provide users with comprehensive explanations of generated recommendations. Our objective was to replicate the results of the original paper, albeit using Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. We built on the source code provided by Ma et al. (2024) to achieve our goal. Our work extends the original paper by modifying the input embeddings or deleting the output embeddings of XRec’s Mixture of Experts module. Based on our results, XRec effectively generates personalized explanations and its stability is improved by incorporating collaborative information. However, XRec did not consistently outperform all baseline models in every metric. Our extended analysis further highlights the importance of the Mixture of Experts embeddings in shaping the explanation structures, showcasing how collaborative signals interact with language modeling. Through our work, we provide an open-source evaluation implementation that enhances accessibility for researchers and practitioners alike. Our complete code repository can be found at this https URL.
zh

[NLP-112] Language models for longitudinal analysis of abusive content in Billboard Music Charts

【速读】：该论文旨在解决流行音乐中滥用和性内容显著增加但缺乏实证研究支持的问题，以推动有效政策制定。其关键解决方案是采用深度学习方法对过去七十年美国公告牌（Billboard）榜单歌曲歌词进行纵向分析，结合情感分析与暴力/不当语言检测技术，量化并识别歌词内容演变趋势，从而揭示社会规范和语言使用随时间变化的细微模式。

链接: https://arxiv.org/abs/2510.06266
作者: Rohitash Chandra,Yathin Suresh,Divyansh Raj Sinha,Sanchit Jindal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There is no doubt that there has been a drastic increase in abusive and sexually explicit content in music, particularly in Billboard Music Charts. However, there is a lack of studies that validate the trend for effective policy development, as such content has harmful behavioural changes in children and youths. In this study, we utilise deep learning methods to analyse songs (lyrics) from Billboard Charts of the United States in the last seven decades. We provide a longitudinal study using deep learning and language models and review the evolution of content using sentiment analysis and abuse detection, including sexually explicit content. Our results show a significant rise in explicit content in popular music from 1990 onwards. Furthermore, we find an increasing prevalence of songs with lyrics containing profane, sexually explicit, and otherwise inappropriate language. The longitudinal analysis of the ability of language models to capture nuanced patterns in lyrical content, reflecting shifts in societal norms and language use over time.
zh

[NLP-113] A Comprehensive Survey of Hallucination in Large Language Models : Causes Detection and Mitigation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中存在的幻觉（hallucination）问题，即模型生成内容虽语法正确、表达流畅，但事实性错误或缺乏外部证据支持，从而损害其在高可靠性场景下的可信度。解决方案的关键在于系统性地梳理和分类幻觉的成因、检测方法与缓解策略：首先构建涵盖数据收集、模型架构设计到推理阶段的幻觉根源分析框架；其次提出结构化的检测方法分类体系与缓解策略分类体系，并评估现有基准测试与评价指标的有效性；最终识别当前研究的局限性并指明未来提升LLMs真实性与可信度的研究方向。

链接: https://arxiv.org/abs/2510.06265
作者: Aisha Alansari,Hamzah Luqman
机构: King Fahd University of Petroleum and Minerals (国王法赫德石油矿产大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed natural language processing, achieving remarkable performance across diverse tasks. However, their impressive fluency often comes at the cost of producing false or fabricated information, a phenomenon known as hallucination. Hallucination refers to the generation of content by an LLM that is fluent and syntactically correct but factually inaccurate or unsupported by external evidence. Hallucinations undermine the reliability and trustworthiness of LLMs, especially in domains requiring factual accuracy. This survey provides a comprehensive review of research on hallucination in LLMs, with a focus on causes, detection, and mitigation. We first present a taxonomy of hallucination types and analyze their root causes across the entire LLM development lifecycle, from data collection and architecture design to inference. We further examine how hallucinations emerge in key natural language generation tasks. Building on this foundation, we introduce a structured taxonomy of detection approaches and another taxonomy of mitigation strategies. We also analyze the strengths and limitations of current detection and mitigation approaches and review existing evaluation benchmarks and metrics used to quantify LLMs hallucinations. Finally, we outline key open challenges and promising directions for future research, providing a foundation for the development of more truthful and trustworthy LLMs.
zh

[NLP-114] Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians

【速读】：该论文旨在解决电子健康记录（Electronic Health Records, EHRs）中海量非结构化临床数据对急诊医生造成的信息过载问题，提出了一种完全离线运行的两阶段摘要生成系统，以在不依赖云端计算的前提下实现高效、隐私安全的临床信息提炼。解决方案的关键在于采用双设备架构：第一阶段由Jetson Nano-R设备负责基于本地存储的EHR数据检索与语义分段后的相关文本片段；第二阶段由Jetson Nano-S设备利用本地部署的小语言模型（Small Language Model, SLM）生成结构化摘要，包括固定格式的关键发现列表和针对临床查询的上下文相关叙述。整个流程通过轻量级套接字通信连接两个设备，并通过LLM-as-Judge评估机制确保摘要在事实准确性、完整性与清晰度上的质量，实验证明该系统可在30秒内完成有效摘要生成。

链接: https://arxiv.org/abs/2510.06263
作者: Jiajun Wu,Swaleh Zaidi,Braden Teitge,Henry Leung,Jiayu Zhou,Jessalyn Holodinsky,Steve Drew
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the IEEE Annual Congress on Artificial Intelligence of Things (IEEE AIoT) 2025

点击查看摘要

Abstract:Electronic health records (EHRs) contain extensive unstructured clinical data that can overwhelm emergency physicians trying to identify critical information. We present a two-stage summarization system that runs entirely on embedded devices, enabling offline clinical summarization while preserving patient privacy. In our approach, a dual-device architecture first retrieves relevant patient record sections using the Jetson Nano-R (Retrieve), then generates a structured summary on another Jetson Nano-S (Summarize), communicating via a lightweight socket link. The summarization output is two-fold: (1) a fixed-format list of critical findings, and (2) a context-specific narrative focused on the clinician’s query. The retrieval stage uses locally stored EHRs, splits long notes into semantically coherent sections, and searches for the most relevant sections per query. The generation stage uses a locally hosted small language model (SLM) to produce the summary from the retrieved text, operating within the constraints of two NVIDIA Jetson devices. We first benchmarked six open-source SLMs under 7B parameters to identify viable models. We incorporated an LLM-as-Judge evaluation mechanism to assess summary quality in terms of factual accuracy, completeness, and clarity. Preliminary results on MIMIC-IV and de-identified real EHRs demonstrate that our fully offline system can effectively produce useful summaries in under 30 seconds.
zh

[NLP-115] Prakriti200: A Questionnaire-Based Dataset of 200 Ayurvedic Prakriti Assessments

【速读】：该论文旨在解决传统阿育吠陀（Ayurveda）中个体体质分类（Prakriti）评估缺乏标准化、结构化数据支持的问题，从而阻碍了其在计算智能与个性化健康分析中的应用。解决方案的关键在于构建了一个基于英文-印地语双语的标准化Prakriti评估问卷（24个多项选择题），遵循AYUSH/CCRAS指南设计，确保数据采集的全面性和准确性；通过隐藏三因（Dosha，即Vata、Pitta、Kapha）标签以减少认知偏差，并利用Google Forms实现自动化评分，将个体特征映射为对应的三因得分，形成可用于机器学习建模和统计分析的结构化数据集，为后续的个性化健康预测与智能健康应用开发提供可靠的数据基础。

链接: https://arxiv.org/abs/2510.06262
作者: Aryan Kumar Singh,Janvi Singh
机构: Indian Institute of Science (IISc); Jawaharlal Nehru University (JNU)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 4 figures

点击查看摘要

Abstract:This dataset provides responses to a standardized, bilingual (English-Hindi) Prakriti Assessment Questionnaire designed to evaluate the physical, physiological, and psychological characteristics of individuals according to classical Ayurvedic principles. The questionnaire consists of 24 multiple-choice items covering body features, appetite, sleep patterns, energy levels, and temperament. It was developed following AYUSH/CCRAS guidelines to ensure comprehensive and accurate data collection. All questions are mandatory and neutrally phrased to minimize bias, and dosha labels (Vata, Pitta, Kapha) are hidden from participants. Data were collected via a Google Forms deployment, enabling automated scoring of responses to map individual traits to dosha-specific scores. The resulting dataset provides a structured platform for research in computational intelligence, Ayurvedic studies, and personalized health analytics, supporting analysis of trait distributions, correlations, and predictive modeling. It can also serve as a reference for future Prakriti-based studies and the development of intelligent health applications.
zh

[NLP-116] AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agent ic Reasoning

【速读】：该论文旨在解决基础模型（Foundation Model, FM）推理中的两个瓶颈问题：一是模型内在能力受限，二是测试阶段迭代过程不可靠。其解决方案的关键在于构建一个自演化代理式推理系统 AlphaApollo，通过协调多个具备专业工具的模型实现可验证的、有意识的推理过程。该系统整合了计算工具（如 Python 及数值与符号库）和检索工具（任务相关的外部信息），以执行精确计算并基于外部证据做出决策；同时利用共享状态映射机制支持多轮、多模型的解法演化，记录候选方案、可执行检查及反馈用于迭代优化，从而显著提升 FM 的推理性能与可靠性。

链接: https://arxiv.org/abs/2510.06261
作者: Zhanke Zhou,Chentao Cao,Xiao Feng,Xuan Li,Zongze Li,Xiangyu Lu,Jiangchao Yao,Weikai Huang,Linrui Xu,Tian Cheng,Guanyu Jiang,Yiming Zheng,Brando Miranda,Tongliang Liu,Sanmi Koyejo,Masashi Sugiyama,Bo Han
机构: TMLR Group; Department of Computer Science; Hong Kong Baptist University; RIKEN AIP; Cooperative Medianet Innovation Center; Shanghai Jiao Tong University; Stanford University; Sydney AI Centre; The University of Sydney; The University of Tokyo
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Ongoing project

点击查看摘要

Abstract:We present AlphaApollo, a self-evolving agentic reasoning system that aims to address two bottlenecks in foundation model (FM) reasoning-limited model-intrinsic capacity and unreliable test-time iteration. AlphaApollo orchestrates multiple models with professional tools to enable deliberate, verifiable reasoning. It couples (i) a computation tool (Python with numerical and symbolic libraries) and (ii) a retrieval tool (task-relevant external information) to execute exact calculations and ground decisions. The system further supports multi-round, multi-model solution evolution via a shared state map that records candidates, executable checks, and feedback for iterative refinement. In evaluations on AIME 2024/2025 across multiple models, AlphaApollo delivers consistent gains: +5.15% Average@32 and +23.34% Pass@32 for Qwen2.5-14B-Instruct, and +8.91% Average@32 with +26.67% Pass@32 for Llama-3.3-70B-Instruct. Tool-use analysis shows that more than 80% of tool calls are successfully executed, with consistent outperformance of non-tool baselines, thereby lifting the capability ceiling of FMs. More empirical results and implementation details will be updated at this https URL.
zh

[NLP-117] Scalable multilingual PII annotation for responsible AI in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理个人身份信息（Personally Identifiable Information, PII）时，因多语言和多监管环境差异导致的可靠性不足问题。解决方案的关键在于构建一个可扩展的多语言数据整理框架，通过分阶段的人工介入标注方法（human-in-the-loop annotation methodology），融合语言学专业知识与严格的质量控制机制，在13个资源匮乏的语言环境中实现了约336种本地化PII类型的高质量标注。该框架利用标注者间一致性指标（inter-annotator agreement metrics）和根本原因分析（root-cause analysis），系统性识别并修正标注不一致，从而生成高保真度的数据集，用于监督式LLM微调，显著提升了召回率和降低了假阳性率。

链接: https://arxiv.org/abs/2510.06250
作者: Bharti Meena,Joanna Skubisz,Harshit Rajgarhia,Nand Dave,Kiran Ganesh,Shivali Dalmia,Abhishek Mukherji,Vasudevan Sundarababu,Olga Pospelova
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) gain wider adoption, ensuring their reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential. This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales, covering approximately 336 locale-specific PII types. Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates from pilot, training, and production phases. By leveraging inter-annotator agreement metrics and root-cause analysis, the framework systematically uncovers and resolves annotation inconsistencies, resulting in high-fidelity datasets suitable for supervised LLM fine-tuning. Beyond reporting empirical gains, we highlight common annotator challenges in multilingual PII labeling and demonstrate how iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability.
zh

[NLP-118] RepLiNa: Layer-wise CKAREPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

【速读】：该论文旨在解决印度多种低资源语言（Low-Resource Languages, LRLs）在机器翻译中缺乏高质量训练数据的问题，特别是提升从LRL到高资源语言（High-Resource Language, HRL）的翻译质量。其解决方案的关键在于提出一种名为TRepLiNa的联合方法，该方法结合了中心核对齐（Centered Kernel Alignment, CKA）与REPINA正则化技术，通过在解码器特定中间层强制跨语言表示相似性，以低计算成本实现对LRL翻译性能的有效提升，尤其在零样本（zero-shot）、少样本（few-shot）和微调（fine-tuning）等数据稀缺场景下表现显著。

链接: https://arxiv.org/abs/2510.06249
作者: Toshiki Nakai,Ravi Kiran Chikkala,Lena Sophie Oberkircher,Nicholas Jennings,Natalia Skachkova,Tatiana Anikina,Jesujoba Oluwadara Alabi
机构: Saarland University (萨尔兰大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: It is work in progress

点击查看摘要

Abstract:The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India’s most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
zh

[NLP-119] Evaluating Embedding Frameworks for Scientific Domain

【速读】：该论文旨在解决科学领域中词表示（word representation）算法与分词（tokenization）方法的选择问题，特别是在不同领域语境下同一词汇可能具有多重语义的情况下，如何获取最优的上下文感知词向量表示。其核心挑战在于，在计算资源受限的前提下，如何在不从头预训练复杂模型（如生成式 AI 或 Transformer 架构）的情况下，找到适用于科学文本下游任务的高效且准确的词表示与分词策略。解决方案的关键在于构建一个全面的评估套件（evaluation suite），包含多个下游任务及其对应的科学领域数据集，并利用该套件系统性地测试和比较多种词表示与分词算法的性能，从而为科学文本处理提供可复用、可扩展的基准评估框架。

链接: https://arxiv.org/abs/2510.06244
作者: Nouman Ahmed,Ronin Wu,Victor Botev
机构: Iris.ai; University of Oxford (牛津大学); QunaSys
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Finding an optimal word representation algorithm is particularly important in terms of domain specific data, as the same word can have different meanings and hence, different representations depending on the domain and context. While Generative AI and transformer architecture does a great job at generating contextualized embeddings for any given work, they are quite time and compute extensive, especially if we were to pre-train such a model from scratch. In this work, we focus on the scientific domain and finding the optimal word representation algorithm along with the tokenization method that could be used to represent words in the scientific domain. The goal of this research is two fold: 1) finding the optimal word representation and tokenization methods that can be used in downstream scientific domain NLP tasks, and 2) building a comprehensive evaluation suite that could be used to evaluate various word representation and tokenization algorithms (even as new ones are introduced) in the scientific domain. To this end, we build an evaluation suite consisting of several downstream tasks and relevant datasets for each task. Furthermore, we use the constructed evaluation suite to test various word representation and tokenization algorithms.
zh

[NLP-120] CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在指代表达理解与分割（Referring Expression Comprehension and Segmentation）任务中的能力不足问题，尤其是在复杂查询场景下语义对齐与跨模态推理的准确性问题。解决方案的关键在于提出一种名为CoT Referring的新策略，通过构建结构化的链式思维（Chain-of-Thought, CoT）训练数据格式，将文本结构解析为一系列有序的指代步骤，在每一步中明确识别对象间的关系并确保参考一致性，从而提升模型在复杂语境下的推理能力；同时，该方法还重新设计了输出形式以适配新标注的数据集，并整合检测与分割模块于统一框架中，采用自适应加权损失函数进行联合优化，显著提升了模型性能（在定制基准和RefCOCO/+/g上相较基线模型提升2.5%以上）。

链接: https://arxiv.org/abs/2510.06243
作者: Qihua Dong,Luis Figueroa,Handong Zhao,Kushal Kafle,Jason Kuen,Zhihong Ding,Scott Cohen,Yun Fu
机构: Adobe Research; Northeastern University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: MLLM, Referring Expression Segmentation

点击查看摘要

Abstract:Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.
zh

[NLP-121] ransparent Reference-free Automated Evaluation of Open-Ended User Survey Responses EMNLP

【速读】：该论文旨在解决开放式问卷调查中人工撰写回答的质量评估问题，现有自动评估方法主要针对大语言模型（Large Language Model, LLM）生成文本设计，难以有效识别和量化人类回答的特性。其解决方案的关键在于提出一个两阶段评估框架：首先通过“无意义文本过滤”（gibberish filtering）剔除明显无意义的回答；随后基于真实调查数据的实证分析，利用LLM能力从努力程度（effort）、相关性（relevance）和完整性（completeness）三个维度对高质量响应进行精细化评估，从而实现对人工撰写回答的有效量化与预测，且在英语和韩语数据集上验证了其优于现有指标并具备实际应用价值。

链接: https://arxiv.org/abs/2510.06242
作者: Subin An,Yugyeong Ji,Junyoung Kim,Heejin Kook,Yang Lu,Josh Seltzer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP Industry Track

点击查看摘要

Abstract:Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.
zh

[NLP-122] Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

【速读】：该论文旨在解决工业场景下问答系统（Industrial Question-Answering, QA）对安全性和可靠性要求更高的问题，尤其是多智能体大语言模型在设备故障诊断等高风险场景中因迭代失控和输出不可验证导致的可信度不足，以及传统知识蒸馏方法难以迁移协作推理能力至轻量化部署模型的问题。解决方案的关键在于提出基于知识图谱引导的多智能体系统蒸馏（Knowledge Graph-guided Multi-Agent System Distillation, KG-MASD），将蒸馏过程建模为马尔可夫决策过程（Markov Decision Process），并引入知识图谱作为结构化先验来增强状态表示与收敛性保障；通过融合协作推理与知识锚定机制，生成高置信度指令微调数据，并联合蒸馏推理深度与可验证性至适用于边缘部署的紧凑学生模型，从而显著提升准确率（最高达20.1%）与可靠性。

链接: https://arxiv.org/abs/2510.06240
作者: Jiqun Pan,Zhenke Duan,Jiani Tu,Anzhi Cheng,Yanqing Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 41 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Industrial question-answering (QA) systems require higher safety and reliability than general-purpose dialogue models, as errors in high-risk scenarios such as equipment fault diagnosis can have severe consequences. Although multi-agent large language models enhance reasoning depth, they suffer from uncontrolled iterations and unverifiable outputs, and conventional distillation methods struggle to transfer collaborative reasoning capabilities to lightweight, deployable student models. To address these challenges, we propose Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD). Our approach formulates distillation as a Markov Decision Process and incorporates a knowledge graph as a verifiable structured prior to enrich state representation and ensure convergence. By integrating collaborative reasoning with knowledge grounding, KG-MASD generates high-confidence instruction-tuning data and jointly distills reasoning depth and verifiability into compact student models suitable for edge deployment. Experiments on an industrial QA dataset show that KG-MASD improves accuracy by 2.4 per cent to 20.1 per cent over baselines and significantly enhances reliability, enabling trustworthy AI deployment in safety-critical industrial scenarios. Code and data are available at this https URL.
zh

[NLP-123] OpenStaxQA: A multilingual dataset based on open-source college textbooks

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在高等教育场景中缺乏针对性评估基准的问题，尤其针对大学教材内容的问答任务。为应对这一挑战，作者构建了OpenStaxQA，一个基于43本开源大学教材（涵盖英语、西班牙语和波兰语）的多语言评测基准，并采用量化低秩适配器（Quantized Low Rank Adapters, QLoRa）对约70亿参数的LLMs进行微调与评估。其解决方案的关键在于：一方面通过高质量、教育领域专用的数据集提升模型在学术场景下的理解与推理能力；另一方面利用QLoRa技术实现高效且低资源消耗的微调，从而验证该数据集是否能带来跨任务性能提升（如在AI2推理挑战开发集上的零样本表现）。

链接: https://arxiv.org/abs/2510.06239
作者: Pranav Gupta
机构: Lowe’s(劳氏公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.
zh

[NLP-124] CML-Bench: A Framework for Evaluating and Enhancing LLM -Powered Movie Scripts Generation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成电影剧本时缺乏叙事深度与情感共鸣的问题，即模型虽能产出结构化文本，但难以捕捉电影所需的“灵魂”——包括对话连贯性、角色一致性及情节合理性等核心要素。解决方案的关键在于：首先构建CML-Dataset，通过分析高质量电影剧本的多轮连续性和叙事结构，提炼出三个关键评估维度——对话连贯性（Dialogue Coherence, DC）、角色一致性（Character Consistency, CC）和情节合理性（Plot Reasonableness, PR）；进而提出CML-Bench基准测试体系，量化评估剧本质量，并结合CML-Instruction提示策略，引导LLMs生成更符合影视创作逻辑的剧本，实验证明该方法显著提升了生成剧本的质量并贴合人类偏好。

链接: https://arxiv.org/abs/2510.06231
作者: Mingzhe Zheng,Dingjie Song,Guanyu Zhou,Jun You,Jiahao Zhan,Xuran Ma,Xinyuan Song,Ser-Nam Lim,Qifeng Chen,Harry Yang
机构: The Hong Kong University of Science and Technology (香港科技大学); Lehigh University (莱赫igh大学); Fudan University (复旦大学); Emory University (埃默里大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the ‘soul’ of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where ‘content’ consists of segments from esteemed, high-quality movie scripts and ‘summary’ is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.
zh

[NLP-125] WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

【速读】：该论文旨在解决历史天气档案（historical weather archives）中蕴含的定性叙事难以转化为结构化知识以支持气候科学研究的问题，尤其在社会脆弱性（societal vulnerability）与韧性（resilience）指标识别方面存在显著挑战。其解决方案的关键在于构建首个用于评估检索增强生成（Retrieval-Augmented Generation, RAG）系统的基准测试框架——WeatherArchive-Bench，该框架包含两个核心任务：WeatherArchive-Retrieval（衡量系统从百万级档案新闻片段中定位历史相关段落的能力）和WeatherArchive-Assessment（评估大语言模型能否准确分类极端天气叙述中的社会脆弱性与韧性指标）。实验表明，密集检索器在处理历史术语时表现不佳，而大语言模型常误读复杂社会指标，揭示了当前RAG系统在气候档案语境下的局限性，并为设计更鲁棒的气候导向型RAG系统提供了关键洞见。

链接: https://arxiv.org/abs/2510.05336
作者: Yongan Yu,Xianda Du,Qingchen Hu,Jiahao Liang,Jingwei Ni,Dan Qiang,Kaiyu Huang,Grant McKenzie,Renee Sieber,Fengran Mo
机构: McGill University (麦吉尔大学); University of Waterloo (滑铁卢大学); Université de Montréal (蒙特利尔大学); ETH Zurich (苏黎世联邦理工学院); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system’s ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at this https URL.
zh

计算机视觉

[CV-0] mporal Prompting Matters: Rethinking Referring Video Object Segmentation

【速读】：该论文旨在解决参考视频目标分割（Referring Video Object Segmentation, RVOS）任务中现有方法依赖密集掩码标注进行端到端训练所带来的计算开销大、可扩展性差的问题。解决方案的关键在于将RVOS任务解耦为指代（referring）、视频（video）和分割（segmentation）三个独立因素，并提出Temporal Prompt Generation and Selection（Tenet）框架来专门处理前两者，而将分割任务交由预训练的图像基础分割模型完成。具体而言，通过利用现成的目标检测器和跟踪器生成与查询句子关联的时间提示（temporal prompts），并引入Prompt Preference Learning机制评估这些提示的质量，从而高效地指导基础分割模型生成高质量的目标掩码，实现对RVOS任务的有效适应。

链接: https://arxiv.org/abs/2510.07319
作者: Ci-Siang Lin,Min-Hung Chen,I-Jieh Liu,Chien-Yi Wang,Sifei Liu,Yu-Chiang Frank Wang
机构: National Taiwan University (国立台湾大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.
zh

[CV-1] Quantum-enhanced Computer Vision: Going Beyond Classical Algorithms

【速读】：该论文旨在解决量子增强计算机视觉（Quantum-enhanced Computer Vision, QeCV）领域中缺乏系统性综述的问题，以推动该交叉学科的发展。其核心挑战在于如何将量子计算的独特优势——如利用量子力学效应实现经典计算机无法达到的计算效率——与计算机视觉任务相结合，并开发出适配量子硬件的新型算法。解决方案的关键在于提出两种主流量子计算范式（门模型量子计算和量子退火）的兼容方法论，明确量子电路参数化设计作为未来替代传统神经网络的潜在路径，并提供面向计算机视觉研究者的量子计算工具、编程接口及学习资源指南，从而为QeCV的研究奠定理论基础并指明实践方向。

链接: https://arxiv.org/abs/2510.07317
作者: Natacha Kuete Meli,Shuteng Wang,Marcel Seelbach Benkner,Michele Sasdelli,Tat-Jun Chin,Tolga Birdal,Michael Moeller,Vladislav Golyanik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages, 23 figures and 6 tables

点击查看摘要

Abstract:Quantum-enhanced Computer Vision (QeCV) is a new research field at the intersection of computer vision, optimisation theory, machine learning and quantum computing. It has high potential to transform how visual signals are processed and interpreted with the help of quantum computing that leverages quantum-mechanical effects in computations inaccessible to classical (i.e. non-quantum) computers. In scenarios where existing non-quantum methods cannot find a solution in a reasonable time or compute only approximate solutions, quantum computers can provide, among others, advantages in terms of better time scalability for multiple problem classes. Parametrised quantum circuits can also become, in the long term, a considerable alternative to classical neural networks in computer vision. However, specialised and fundamentally new algorithms must be developed to enable compatibility with quantum hardware and unveil the potential of quantum computational paradigms in computer vision. This survey contributes to the existing literature on QeCV with a holistic review of this research field. It is designed as a quantum computing reference for the computer vision community, targeting computer vision students, scientists and readers with related backgrounds who want to familiarise themselves with QeCV. We provide a comprehensive introduction to QeCV, its specifics, and methodologies for formulations compatible with quantum hardware and QeCV methods, leveraging two main quantum computational paradigms, i.e. gate-based quantum computing and quantum annealing. We elaborate on the operational principles of quantum computers and the available tools to access, program and simulate them in the context of QeCV. Finally, we review existing quantum computing tools and learning materials and discuss aspects related to publishing and reviewing QeCV papers, open challenges and potential social implications.
zh

[CV-2] Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers NEURIPS2025

【速读】：该论文旨在解决当前基于生成式模型的单目深度估计方法中存在的“飞点”（flying pixels）问题，即在使用变分自编码器（VAE）将深度图压缩至潜在空间进行扩散生成时，会在边缘和细节处引入伪影。其解决方案的关键在于提出Pixel-Perfect Depth模型，通过直接在像素空间中执行扩散生成来规避VAE带来的失真；同时设计了两个核心组件：一是语义提示扩散变换器（Semantics-Prompted Diffusion Transformers, SP-DiT），利用视觉基础模型的语义表示引导扩散过程以保持全局语义一致性并增强局部细节；二是级联扩散变换器（Cascade DiT）结构，逐步增加token数量以提升效率与精度。该方法在五个基准测试中达到现有生成式模型的最佳性能，并在边缘感知点云评估中显著优于其他模型。

链接: https://arxiv.org/abs/2510.07316
作者: Gangwei Xu,Haotong Lin,Hongcheng Luo,Xianqi Wang,Jingfeng Yao,Lianghui Zhu,Yuechuan Pu,Cheng Chi,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Sida Peng,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025. Project page: this https URL

点击查看摘要

Abstract:This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textitflying pixels at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.
zh

[CV-3] WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

【速读】：该论文旨在解决视觉语言动作（VLA）模型中因缺乏腕部视角（wrist-view）数据而导致的性能瓶颈问题。现有世界模型难以从锚点视角（anchor view）生成腕部视角视频，因其通常需要以腕部视角作为初始帧，无法填补锚点与腕部视角之间的极端视点差异。解决方案的关键在于提出WristWorld，这是首个仅依赖锚点视角生成腕部视角视频的4D世界模型，其核心创新包括：(i) 重建阶段通过扩展VGGT并引入空间投影一致性（Spatial Projection Consistency, SPC）损失，实现几何一致的腕部视角姿态和4D点云估计；(ii) 生成阶段利用视频生成模型合成时序连贯的腕部视角视频。实验表明，WristWorld在Droid、Calvin和Franka Panda等基准上实现了最先进的视频生成效果，并显著提升VLA任务表现，平均任务完成长度提高3.81%，缩小了42.4%的锚点-腕部视角差距。

链接: https://arxiv.org/abs/2510.07313
作者: Zezhong Qian,Xiaowei Chi,Yuming Li,Shizun Wang,Zhiyuan Qin,Xiaozhu Ju,Sirui Han,Shanghang Zhang
机构: Peking University (北京大学); Hong Kong University of Science and Technology (香港科技大学); National University of Singapore (新加坡国立大学); Beijing Innovation Center of Humanoid Robotics (北京人形机器人创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.
zh

[CV-4] MATRIX: Mask Track Alignment for Interaction-aware Video Generation

【速读】：该论文旨在解决视频扩散模型（Video DiTs）在建模多实例或主体-客体交互时存在的不足，特别是其内部如何表征交互关系尚不明确的问题。解决方案的关键在于：首先通过构建包含交互感知标注和多实例掩码轨迹的MATRIX-11K数据集，系统性地从语义定位（semantic grounding）与语义传播（semantic propagation）两个维度分析模型注意力机制；进而发现此类交互信息主要集中在少数“交互主导层”（interaction-dominant layers），并据此提出MATRIX正则化方法，将特定层的注意力对齐至多实例掩码轨迹，从而增强模型的交互保真度与语义一致性，同时减少生成过程中的漂移与幻觉现象。

链接: https://arxiv.org/abs/2510.07310
作者: Siyoon Jin,Seongchan Kim,Dahyun Chung,Jaeho Lee,Hyunwook Choi,Jisu Nam,Jiyoung Kim,Seungryong Kim
机构: KAIST AI (韩国科学技术院人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page is available at: this https URL

点击查看摘要

Abstract:Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.
zh

[CV-5] SpecGuard: Spectral Projection-based Advanced Invisible Watermarking ICCV2025

【速读】：该论文旨在解决图像水印技术在面对多种变换（包括失真、图像再生和对抗扰动）时鲁棒性不足的问题。现有方法难以在保持水印不可见性的前提下，有效抵御复杂攻击，导致版权信息在实际应用中易被破坏或篡改。解决方案的关键在于提出一种名为SpecGuard的新颖水印方法：通过将消息嵌入隐藏卷积层中，并利用小波分解后的高频带进行频域投影，结合快速傅里叶变换（Fast Fourier Transform, FFT）近似实现空间到频率域的高效转换；编码阶段引入强度因子以增强对几何、再生及对抗攻击的抗性，解码阶段则基于帕塞瓦尔定理（Parseval’s theorem）学习并提取水印模式，从而在复杂变换下仍能准确恢复原始信息。

链接: https://arxiv.org/abs/2510.07302
作者: Inzamamul Alam,Md Tanvir Islam,Khan Muhammad,Simon S. Woo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Accepted Paper

点击查看摘要

Abstract:Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval’s theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark’s invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models. To ensure reproducibility, the full code is released on \hrefthis https URL\textcolorblue\textbfGitHub.
zh

[CV-6] Evaluating Fundus-Specific Foundation Models for Diabetic Macular Edema Detection

【速读】：该论文旨在解决糖尿病性黄斑水肿（Diabetic Macular Edema, DME）自动检测中因标注数据稀缺导致深度学习模型性能受限的问题。其解决方案的关键在于系统性比较基础模型（Foundation Models, FM）与传统迁移学习方法在DME检测任务中的表现，具体包括两种主流眼底图像基础模型——RETFound和FLAIR，以及一个轻量级卷积神经网络EfficientNet-B0，在IDRiD、MESSIDOR-2和OEFI三个数据集上的不同训练策略和评估设置下的性能差异。结果表明，尽管FM规模更大，但并未在所有场景下优于微调后的CNN，而EfficientNet-B0在多数情况下表现最优，FLAIR则在零样本（zero-shot）条件下展现出良好潜力，提示在数据稀缺环境下，轻量级CNN仍是可靠的基线方案。

链接: https://arxiv.org/abs/2510.07277
作者: Franco Javier Arellano,José Ignacio Orlando
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at SIPAIM 2025

点击查看摘要

Abstract:Diabetic Macular Edema (DME) is a leading cause of vision loss among patients with Diabetic Retinopathy (DR). While deep learning has shown promising results for automatically detecting this condition from fundus images, its application remains challenging due the limited availability of annotated data. Foundation Models (FM) have emerged as an alternative solution. However, it is unclear if they can cope with DME detection in particular. In this paper, we systematically compare different FM and standard transfer learning approaches for this task. Specifically, we compare the two most popular FM for retinal images–RETFound and FLAIR–and an EfficientNet-B0 backbone, across different training regimes and evaluation settings in IDRiD, MESSIDOR-2 and OCT-and-Eye-Fundus-Images (OEFI). Results show that despite their scale, FM do not consistently outperform fine-tuned CNNs in this task. In particular, an EfficientNet-B0 ranked first or second in terms of area under the ROC and precision/recall curves in most evaluation settings, with RETFound only showing promising results in OEFI. FLAIR, on the other hand, demonstrated competitive zero-shot performance, achieving notable AUC-PR scores when prompted appropriately. These findings reveal that FM might not be a good tool for fine-grained ophthalmic tasks such as DME detection even after fine-tuning, suggesting that lightweight CNNs remain strong baselines in data-scarce environments.
zh

[CV-7] alkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

【速读】：该论文旨在解决多镜头人类语音视频生成中的可控性与视觉连贯性问题，现有数据集多局限于单镜头、静态视角，难以支持复杂摄像机运动和多样姿态表达的生成任务。其解决方案的关键在于构建TalkCuts这一大规模高质量多镜头语音视频数据集，包含164k个视频片段（超500小时），涵盖近10,000个身份，并提供详细的文本描述、2D关键点及3D SMPL-X人体姿态标注，从而为多模态学习提供丰富语义与动作约束；在此基础上提出的Orator框架利用大语言模型（LLM）作为多维导演，协同控制摄像机切换、手势动作与声学调节，通过集成的多模态生成模块实现长时程视频的一致性合成，显著提升生成视频的电影化连贯性和视觉表现力。

链接: https://arxiv.org/abs/2510.07249
作者: Jiaben Chen,Zixin Wang,Ailing Zeng,Yang Fu,Xueyang Yu,Siyuan Cen,Julian Tanke,Yihang Chen,Koichi Saito,Yuki Mitsufuji,Chuang Gan
机构: UMass Amherst (马萨诸塞大学阿默斯特分校); Tencent AI (腾讯人工智能实验室); Fudan University (复旦大学); Sony AI (索尼人工智能); UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
zh

[CV-8] GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation EMNLP2025

【速读】：该论文旨在解决文本到图像生成（text-to-image synthesis）中复杂长提示词（long and complex prompts）难以准确解析的问题，这常导致语义不一致和细节缺失。现有方法如微调（fine-tuning）具有模型依赖性且需训练，而先前的自动提示优化（automatic prompt optimization, APO）缺乏系统性错误分析与精细化修正策略，限制了可靠性与效果；同时，测试时缩放（test-time scaling）方法基于固定提示词，在噪声或样本数量上进行调整，可解释性和适应性不足。论文提出一种灵活高效的测试时提示优化策略，其核心创新在于设计了一个即插即用的多智能体系统 GenPilot，集成错误分析、基于聚类的自适应探索、细粒度验证及记忆模块，实现对输入文本的迭代优化。该方案具备模型无关性（model-agnostic）、可解释性强，并显著提升长提示词下的图像生成一致性与结构连贯性。

链接: https://arxiv.org/abs/2510.07217
作者: Wen Ye,Zhaocheng Liu,Yuwei Gui,Tingyu Yuan,Yunyue Su,Bowen Fang,Chaoyang Zhao,Qiang Liu,Liang Wang
机构: New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences; Baichuan Inc.; Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (CASIA); Objecteye.Inc; Beijing University of Posts and Telecommunications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 21 figures, accepted to EMNLP 2025 findings

点击查看摘要

Abstract:Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at this https URL.
zh

[CV-9] EigenScore: OOD Detection using Covariance in Diffusion Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在安全敏感领域部署时的分布外（Out-of-Distribution, OOD）检测问题，即如何有效识别训练数据分布之外的输入样本，以提升模型的安全性和可靠性。解决方案的关键在于提出 EigenScore 方法，其核心思想是利用扩散模型诱导的后验协方差（posterior covariance）的特征值谱结构作为分布偏移的可靠信号：OOD 输入会引发更大的迹和主特征值，形成清晰的谱特征。为实现高效计算，作者采用无需雅可比矩阵的子空间迭代法，仅通过前向传播即可估计主导特征值，从而在保持高精度的同时具备良好的可扩展性。实证表明，EigenScore 在多个基准上达到当前最优性能，并在近似分布外场景（如 CIFAR-10 与 CIFAR-100 之间）中展现出显著鲁棒性。

链接: https://arxiv.org/abs/2510.07206
作者: Shirin Shoushtari,Yi Wang,Xiao Shi,M. Salman Asif,Ulugbek S. Kamilov
机构: Washington University in St. Louis (圣路易斯华盛顿大学); University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have recently emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection. We propose EigenScore, a new OOD detection method that leverages the eigenvalue spectrum of the posterior covariance induced by a diffusion model. We argue that posterior covariance provides a consistent signal of distribution shift, leading to larger trace and leading eigenvalues on OOD inputs, yielding a clear spectral signature. We further provide analysis explicitly linking posterior covariance to distribution mismatch, establishing it as a reliable signal for OOD detection. To ensure tractability, we adopt a Jacobian-free subspace iteration method to estimate the leading eigenvalues using only forward evaluations of the denoiser. Empirically, EigenScore achieves SOTA performance, with up to 5% AUROC improvement over the best baseline. Notably, it remains robust in near-OOD settings such as CIFAR-10 vs CIFAR-100, where existing diffusion-based methods often fail.
zh

[CV-10] Resolution scaling governs DINOv3 transfer performance in chest radiograph classification

【速读】：该论文旨在解决自监督学习（Self-supervised Learning, SSL）在胸部X光影像这一高通量、细粒度特征丰富的医学影像模态中，其迁移学习性能是否优于传统ImageNet预训练方法的问题。研究聚焦于Meta最新提出的DINOv3模型，特别是其基于Gram矩阵锚定的自蒸馏机制是否能有效提升胸片图像的表征能力。解决方案的关键在于系统性地对比DINOv3与前代DINOv2及ImageNet初始化模型在多个胸部X光数据集上的表现，同时评估不同输入分辨率（224×224、512×512、1024×1024像素）、骨干网络结构（ViT-B/16与ConvNeXt-B）以及冻结特征与微调策略的效果，结果表明：在512×512分辨率下，DINOv3初始化的ConvNeXt-B骨干网络表现出最优性能，尤其对边界依赖型和小病灶异常检测具有显著优势，而进一步放大输入尺寸至1024×1024并未带来实质性收益，凸显了合理分辨率与领域适配微调的重要性。

链接: https://arxiv.org/abs/2510.07191
作者: Soroosh Tayebi Arasteh,Mina Shaigan,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta’s DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.
zh

[CV-11] MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis SIGGRAPH

【速读】：该论文旨在解决当前视频扩散模型在人体中心场景下难以实现360度视角合成的问题，尤其针对单目全身捕捉数据生成同步多视角视频的挑战。其关键解决方案在于提出MV-Performer框架，通过引入基于定向部分点云渲染的相机依赖法向图（camera-dependent normal maps）作为条件信号，有效缓解可见与不可见区域之间的观测歧义；同时设计了一种多视角人体中心视频扩散模型，融合参考视频、局部渲染和不同视角的信息以保持生成视频的时间同步性，并提供适用于野外视频场景的鲁棒推理流程，显著降低因单目深度估计不准确导致的伪影问题。

链接: https://arxiv.org/abs/2510.07190
作者: Yihao Zhi,Chenghong Li,Hongjie Liao,Xihe Yang,Zhengwentai Sun,Jiahao Chang,Xiaodong Cun,Wensen Feng,Xiaoguang Han
机构: Shenzhen University (深圳大学); CUHKSZ (香港中文大学（深圳）); Great Bay University (大湾区大学); FNii-Shenzhen (深圳鹏城实验室); Guangdong Provincial Key Laboratory of Future Networks of Intelligence (广东省未来网络智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH Asia 2025 conference track

点击查看摘要

Abstract:Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer’s state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.
zh

[CV-12] IGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在空间推理中缺乏计算精度的问题，现有方法通常将几何问题简化为模式识别任务，无法利用深度传感器和相机标定提供的度量线索，因而难以满足机器人操作所需的厘米级精度。解决方案的关键在于提出TIGeR（Tool-Integrated Geometric Reasoning）框架，该框架通过使VLMs能够调用外部工具来生成并执行精确的几何计算，从而将其从感知估计器转变为几何计算器；具体而言，TIGeR让模型识别几何推理需求、合成相应的计算代码，并调用专业库进行精确运算，而非试图在神经网络内部嵌入复杂几何操作，同时构建了包含点变换、位姿估计、轨迹生成和空间兼容性验证等任务的工具调用导向数据集TIGeR-300K，并采用分层奖励设计的两阶段训练策略（监督微调与强化微调），实现了几何推理基准上的最先进性能及真实机器人操作中的厘米级精度。

链接: https://arxiv.org/abs/2510.07181
作者: Yi Han,Cheng Chi,Enshen Zhou,Shanyu Rong,Jingkun An,Pengwei Wang,Zhongyuan Wang,Lu Sheng,Shanghang Zhang
机构: Beihang University (北京航空航天大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Peking University (北京大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
zh

[CV-13] Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）中视觉 token 压缩方法评估存在的任务不匹配与基准噪声问题。现有评估基准主要针对模型感知与推理能力设计，而非专门用于衡量压缩技术的有效性，导致评估结果不可靠，且简单图像下采样在多个基准上反而优于复杂压缩方法。解决方案的关键在于提出 VTC-Bench 评估框架，通过引入数据过滤机制对现有基准进行去噪处理，从而实现对视觉 token 压缩方法更公平、准确的量化评估。

链接: https://arxiv.org/abs/2510.07143
作者: Chenfei Liao,Wensong Wang,Zichen Wen,Xu Zheng,Yiyu Wang,Haocong He,Yuanhuiyi Lyu,Lutao Jiang,Xin Zou,Yuqian Fu,Bin Ren,Linfeng Zhang,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Shanghai Jiao Tong University (上海交通大学); Northeastern University (东北大学); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT，索非亚大学“圣克莱门特·奥霍里斯基”); Shanghai AI Laboratory (上海人工智能实验室); Hong Kong University of Science and Technology (香港科技大学); University of Pisa (比萨大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at this https URL.
zh

[CV-14] Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

【速读】：该论文旨在解决遥感视觉-语言模型（Remote Sensing Vision-Language Models, RSVLMs）在低数据场景下（如少样本学习，few-shot learning）泛化能力不足的问题。当前RSVLMs虽在零样本（zero-shot）任务中表现优异，但其在少量标注数据下的适应性能尚未得到系统评估。论文的关键解决方案是构建首个结构化的少样本适应评估基准，涵盖十种遥感场景分类数据集，并对三种先进RSVLMs应用五种主流少样本适应策略进行系统实验。结果表明，不同RSVLM在少样本条件下的表现差异显著，且现有方法无明显优劣之分，凸显了开发更鲁棒的、面向遥感领域的少样本适应方法的必要性。为推动后续研究，作者开源了可复现的基准框架与代码，便于统一评估和比较。

链接: https://arxiv.org/abs/2510.07135
作者: Karim El Khoury,Maxime Zanella,Christophe De Vleeschouwer,Benoit Macq
机构: UCLouvain (Université catholique de Louvain); UMons (Université de Mons)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: this https URL
zh

[CV-15] rackVLA: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

【速读】：该论文旨在解决具身视觉追踪（Embodied Visual Tracking, EVT）中因缺乏显式空间推理和有效时序记忆而导致的跟踪失败问题，尤其是在严重遮挡或存在相似干扰物的情况下。解决方案的关键在于提出一种新的视觉-语言-动作（Vision-Language-Action, VLA）模型 TrackVLA++，其核心创新包括两个模块：一是引入基于链式思维（Chain-of-Thought, CoT）的空间推理机制 Polar-CoT，通过编码目标相对位置为紧凑的极坐标标记（polar-coordinate token）来指导动作预测；二是设计目标识别记忆（Target Identification Memory, TIM），采用门控更新策略以维持长时程目标记忆，从而保障时空一致性并减少长时间遮挡下的目标丢失。

链接: https://arxiv.org/abs/2510.07134
作者: Jiahang Liu,Yunpeng Qi,Jiazhao Zhang,Minghan Li,Shaoan Wang,Kui Wu,Hanjing Ye,Hong Zhang,Zhibo Chen,Fangwei Zhong,Zhizheng Zhang,He Wang
机构: Peking University (北京大学); Galbot; USTC (中国科学技术大学); BAAI (北京人工智能研究院); Beihang University (北京航空航天大学); SUSTech (南方科技大学); Beijing Normal University (北京师范大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target’s relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.
zh

[CV-16] Graph Conditioned Diffusion for Controllable Histopathology Image Generation

【速读】：该论文旨在解决扩散概率模型（Diffusion Probabilistic Models, DPMs）在受控图像生成中的局限性，尤其是在医学影像等对结构敏感场景下难以实现语义层面精细控制的问题。现有DPMs通常在噪声潜空间中运行，缺乏显式的语义结构和强先验信息，导致生成内容难以保证解剖或病理结构的一致性和可解释性。解决方案的关键在于提出基于图的物体级表示方法（Graph-Conditioned-Diffusion），通过构建反映图像中主要结构及其相互关系的图结构，将每个关键结构映射为图节点并编码其特征与拓扑关系；随后利用Transformer模块处理该图表示，并通过文本条件机制将其融入扩散过程，从而实现对生成结果的细粒度控制。实验表明，该方法在真实世界组织病理学数据上可生成高质量替代数据，有效支持下游分割任务。

链接: https://arxiv.org/abs/2510.07129
作者: Sarah Cechnicka,Matthew Baugh,Weitong Zhang,Mischa Dombrowski,Zhe Li,Johannes C. Paetzold,Candice Roufosse,Bernhard Kainz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Diffusion Probabilistic Models (DPMs) have set new standards in high-quality image synthesis. Yet, controlled generation remains challenging, particularly in sensitive areas such as medical imaging. Medical images feature inherent structure such as consistent spatial arrangement, shape or texture, all of which are critical for diagnosis. However, existing DPMs operate in noisy latent spaces that lack semantic structure and strong priors, making it difficult to ensure meaningful control over generated content. To address this, we propose graph-based object-level representations for Graph-Conditioned-Diffusion. Our approach generates graph nodes corresponding to each major structure in the image, encapsulating their individual features and relationships. These graph representations are processed by a transformer module and integrated into a diffusion model via the text-conditioning mechanism, enabling fine-grained control over generation. We evaluate this approach using a real-world histopathology use case, demonstrating that our generated data can reliably substitute for annotated patient data in downstream segmentation tasks. The code is available here.
zh

[CV-17] Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?

【速读】：该论文旨在解决深度学习在医学影像领域应用中面临的数据隐私、存储与传输挑战，尤其是在多中心医疗场景下，不同机构间的数据往往呈现非独立同分布（non-IID）特性，导致模型性能下降。其解决方案的关键在于采用联邦学习（Federated Learning, FL）框架，通过在本地客户端训练模型并仅交换模型参数而非原始数据，有效保护患者隐私；同时实验表明，即使各客户端使用不同的MRI强度归一化方法造成数据异质性，FL仍能保持高分割精度（3D Dice分数达92%，接近集中式训练模型），验证了其对非IID数据的鲁棒性，从而为医疗AI模型的分布式高效训练提供了可行路径。

链接: https://arxiv.org/abs/2510.07126
作者: Jan Fiszer,Dominika Ciupek,Maciej Malawski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Deep learning (DL) has been increasingly applied in medical imaging, however, it requires large amounts of data, which raises many challenges related to data privacy, storage, and transfer. Federated learning (FL) is a training paradigm that overcomes these issues, though its effectiveness may be reduced when dealing with non-independent and identically distributed (non-IID) data. This study simulates non-IID conditions by applying different MRI intensity normalization techniques to separate data subsets, reflecting a common cause of heterogeneity. These subsets are then used for training and testing models for brain tumor segmentation. The findings provide insights into the influence of the MRI intensity normalization methods on segmentation models, both training and inference. Notably, the FL methods demonstrated resilience to inconsistently normalized data across clients, achieving the 3D Dice score of 92%, which is comparable to a centralized model (trained using all data). These results indicate that FL is a solution to effectively train high-performing models without violating data privacy, a crucial concern in medical applications. The code is available at: this https URL.
zh

[CV-18] MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency

【速读】：该论文旨在解决单目3D基础模型在跨视角一致性（cross-view consistency）和尺度对齐（scale alignment）方面的挑战，这些问题限制了其在更广泛三维视觉任务中的应用。解决方案的关键在于提出一种无需训练的单目几何精化方法（MoRe），通过帧间特征匹配建立对应关系，并引入基于图结构的优化框架，在估计的3D点和由单目基础模型推断的表面法向量基础上进行局部平面逼近，从而有效缓解单目几何先验固有的尺度模糊性（scale ambiguity），同时保持底层三维结构完整性。该方法显著提升了三维重建质量与稀疏视图合成效果。

链接: https://arxiv.org/abs/2510.07119
作者: Dongki Jung,Jaehoon Choi,Yonghan Lee,Sungmin Eum,Heesung Kwon,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院公园分校); DEVCOM Army Research Laboratory (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.
zh

[CV-19] Enhancing Concept Localization in CLIP-based Concept Bottleneck Models

【速读】：该论文旨在解决基于概念瓶颈模型（Concept Bottleneck Models, CBMs）的可解释人工智能（Explainable AI, XAI）中因CLIP模型存在概念幻觉（concept hallucination）而导致解释不忠实的问题。具体而言，CLIP在零样本场景下提取图像概念时可能错误预测概念的存在或缺失，从而削弱了解释的可靠性。解决方案的关键在于提出一种名为“局部可解释性驱动的概念幻觉抑制”（Concept Hallucination Inhibition via Localized Interpretability, CHILI）的技术，该技术通过解耦图像嵌入并定位与目标概念相关的像素区域，实现对概念的精准识别与可视化，同时支持生成基于显著性图的更可解释的解释结果。

链接: https://arxiv.org/abs/2510.07115
作者: Rémi Kazmierczak,Steve Azzolin,Eloïse Berthier,Goran Frehse,Gianni Franchi
机构: ENSTA Paris, Institut Polytechnique de Paris(巴黎国立高等先进技术学院，巴黎综合理工学院); University of Trento(特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination, incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.
zh

[CV-20] DADO: A Depth-Attention framework for Object Discovery

【速读】：该论文旨在解决无监督目标发现（Unsupervised Object Discovery）问题，即在缺乏人工标注标签的情况下，自动识别并定位图像中的对象。其核心挑战在于如何从复杂场景中准确提取潜在对象区域，同时应对注意力图噪声或深度平面变化带来的干扰。解决方案的关键在于提出一种名为 DADO（Depth-Attention self-supervised technique for Discovering unseen Objects）的新模型，该模型融合注意力机制与深度估计模块，并通过动态加权策略自适应地调整注意力特征与深度特征的贡献比例，从而提升模型对不同图像全局特性的适应能力，显著增强了对象发现的准确性与鲁棒性，且无需微调即可在标准基准上超越现有最优方法。

链接: https://arxiv.org/abs/2510.07089
作者: Federico Gonzalez,Estefania Talavera,Petia Radeva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21st International Conference in Computer Analysis of Images and Patterns (CAIP 2025)

点击查看摘要

Abstract:Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.
zh

[CV-21] Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications

【速读】：该论文旨在解决当前机器人领域中如何有效利用视觉-语言-动作（Vision-Language-Action, VLA）模型实现跨任务、跨物体、跨机器人本体和环境的泛化能力问题，从而减少对特定任务数据的依赖，提升机器人在真实场景中的灵活性与可扩展性。其解决方案的关键在于提供一个端到端的全栈式综述框架，系统整合了VLA模型的架构演进、模态处理技术、学习范式，并涵盖机器人平台、数据采集策略、公开数据集、数据增强方法及评估基准等软硬件要素，为科研人员和工程师部署VLA系统于实际机器人应用提供了全面且实用的指导。

链接: https://arxiv.org/abs/2510.07077
作者: Kento Kawaharazuka,Jihoon Oh,Jun Yamada,Ingmar Posner,Yuke Zhu
机构: The University of Tokyo (东京大学); University of Oxford (牛津大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE Access, website: this https URL

点击查看摘要

Abstract:Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention. By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments. This generalisation capability is expected to enable robots to solve novel downstream tasks with minimal or no additional task-specific data, facilitating more flexible and scalable real-world deployment. Unlike previous surveys that focus narrowly on action representations or high-level model architectures, this work offers a comprehensive, full-stack review, integrating both software and hardware components of VLA systems. In particular, this paper provides a systematic review of VLAs, covering their strategy and architectural transition, architectures and building blocks, modality-specific processing techniques, and learning paradigms. In addition, to support the deployment of VLAs in real-world robotic applications, we also review commonly used robot platforms, data collection strategies, publicly available datasets, data augmentation methods, and evaluation benchmarks. Throughout this comprehensive survey, this paper aims to offer practical guidance for the robotics community in applying VLAs to real-world robotic systems. All references categorized by training approach, evaluation method, modality, and dataset are available in the table on our project website: this https URL .
zh

[CV-22] Concept Retrieval - What and How?

【速读】：该论文旨在解决图像检索中仅依赖视觉或语义相似性而忽略深层概念一致性的问题，即如何从嵌入空间中找到与查询图像共享核心概念（concept）的其他图像，从而捕捉潜在叙事结构。其解决方案的关键在于：首先认识到嵌入空间中的邻居通常仅共享至少一个概念，而非全部共享相同概念；其次提出用双模态高斯分布建模邻域结构，从而揭示有意义的概念分组，提升概念识别的准确性。

链接: https://arxiv.org/abs/2510.07058
作者: Ori nizan,Oren Shrout,Ayellet Tal
机构: Technion(以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: this https URL
zh

[CV-23] Introspection in Learned Semantic Scene Graph Localisation FAST IROS2025

【速读】：该论文旨在解决自监督对比语义定位框架中语义信息如何影响定位性能与鲁棒性的问题。其解决方案的关键在于通过后验可解释性分析，识别模型在训练过程中是否能够过滤环境噪声并优先关注显著地标而非常规杂波；研究发现，集成梯度（Integrated Gradients）和注意力权重（Attention Weights）是最可靠的可解释性探针，并进一步揭示模型对语义类别的隐式加权机制——高频出现的物体常被弱化。结果表明，模型能学习到噪声鲁棒且语义显著的场景关系，从而实现复杂视觉和结构变化下的可解释配准。

链接: https://arxiv.org/abs/2510.07053
作者: Manshika Charvi Bissessur,Efimia Panagiotaki,Daniele De Martini
机构: University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: IEEE IROS 2025 Workshop FAST

点击查看摘要

Abstract:This work investigates how semantics influence localisation performance and robustness in a learned self-supervised, contrastive semantic localisation framework. After training a localisation network on both original and perturbed maps, we conduct a thorough post-hoc introspection analysis to probe whether the model filters environmental noise and prioritises distinctive landmarks over routine clutter. We validate various interpretability methods and present a comparative reliability analysis. Integrated gradients and Attention Weights consistently emerge as the most reliable probes of learned behaviour. A semantic class ablation further reveals an implicit weighting in which frequent objects are often down-weighted. Overall, the results indicate that the model learns noise-robust, semantically salient relations about place definition, thereby enabling explainable registration under challenging visual and structural variations.
zh

[CV-24] U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

【速读】：该论文旨在解决当前医学图像分割领域中对U-Net变体缺乏系统性、统计严谨且涵盖效率与泛化能力的全面评估问题（即：现有研究多依赖小样本验证，忽视模型在不同数据集和成像模态下的鲁棒性与部署实用性）。其解决方案的关键在于提出U-Bench——首个大规模、统计严谨的基准测试平台，通过三个维度（统计稳健性、零样本泛化能力、计算效率）对100种U-Net变体在28个数据集和10种成像模态上进行量化评估，并引入U-Score这一综合性能-效率指标，实现面向实际部署的模型进展度量；同时构建模型推荐代理（model advisor agent）以指导用户根据任务需求选择最优模型，从而为未来十年基于U-Net的分割模型提供公平、可复现且贴近应用的评测标准。

链接: https://arxiv.org/abs/2510.07041
作者: Fenghe Tang,Chengqi Dong,Wenxin Ma,Zikang Xu,Heqin Zhu,Zihang Jiang,Rongsheng Wang,Yuhao Wang,Chenxu Wu,Shaohua Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 54 pages. The project can be accessed at: this https URL . Code is available at: this https URL

点击查看摘要

Abstract:Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: this https URL. Code is available at: this https URL.
zh

[CV-25] Sharpness-Aware Data Generation for Zero-shot Quantization

【速读】：该论文旨在解决零样本量化（zero-shot quantization）中模型泛化能力不足的问题，即在无法访问原始训练数据的情况下，如何生成高质量的合成数据以训练出具有更好泛化性能的低比特量化模型。其解决方案的关键在于将量化模型的尖锐度（sharpness）作为合成数据生成过程中的优化目标：通过最大化合成数据与真实验证数据在重建损失梯度上的匹配程度来实现尖锐度最小化，从而提升模型泛化能力；为避免依赖真实验证集，作者进一步提出用每个生成样本与其邻近样本之间的梯度匹配来近似该目标，实现在无真实数据条件下有效引导合成数据生成。

链接: https://arxiv.org/abs/2510.07018
作者: Dung Hoang-Anh,Cuong Pham Trung Le,Jianfei Cai,Thanh-Toan Do
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing the full-precision model. While it is well-known that deep neural networks with low sharpness have better generalization ability, none of the previous zero-shot quantization works considers the sharpness of the quantized model as a criterion for generating training data. This paper introduces a novel methodology that takes into account quantized model sharpness in synthetic data generation to enhance generalization. Specifically, we first demonstrate that sharpness minimization can be attained by maximizing gradient matching between the reconstruction loss gradients computed on synthetic and real validation data, under certain assumptions. We then circumvent the problem of the gradient matching without real validation set by approximating it with the gradient matching between each generated sample and its neighbors. Experimental evaluations on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed method over the state-of-the-art techniques in low-bit quantization settings.
zh

[CV-26] Bayesian Modelling of Multi-Year Crop Type Classification Using Deep Neural Networks and Hidden Markov Models

【速读】：该论文旨在解决逐年土地覆盖（Land Cover）制图中时序一致性不足的问题，以更准确地建模土地覆盖随时间的演变与变化。其解决方案的关键在于提出一种融合深度学习与贝叶斯建模的新方法，通过将隐马尔可夫模型（Hidden Markov Model, HMM）与基于Transformer Encoder（TE）的深度神经网络（DNN）相结合，有效捕捉逐年卫星影像时间序列（Sentinel-2 SITS）中的复杂时序相关性及多年作物类型序列的特定模式，利用HMM层对TE输出进行级联分类，从而识别出具有时序一致性的作物类型序列，显著提升了整体性能和F1分数。

链接: https://arxiv.org/abs/2510.07008
作者: Gianmarco Perantoni,Giulio Weikmann,Lorenzo Bruzzone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure, accepted conference paper at IEEE International Geoscience and Remote Sensing Symposium, 7-12 July 2024, Athens, Greece

点击查看摘要

Abstract:The temporal consistency of yearly land-cover maps is of great importance to model the evolution and change of the land cover over the years. In this paper, we focus the attention on a novel approach to classification of yearly satellite image time series (SITS) that combines deep learning with Bayesian modelling, using Hidden Markov Models (HMMs) integrated with Transformer Encoder (TE) based DNNs. The proposed approach aims to capture both i) intricate temporal correlations in yearly SITS and ii) specific patterns in multiyear crop type sequences. It leverages the cascade classification of an HMM layer built on top of the TE, discerning consistent yearly crop-type sequences. Validation on a multiyear crop type classification dataset spanning 47 crop types and six years of Sentinel-2 acquisitions demonstrates the importance of modelling temporal consistency in the predicted labels. HMMs enhance the overall performance and F1 scores, emphasising the effectiveness of the proposed approach.
zh

[CV-27] No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在人类动作生成领域中，扩散模型（Diffusion Models）在适应未见过的动作或风格时需依赖额外动作捕捉数据并进行全量重训练的问题，这不仅成本高昂且难以扩展。解决方案的关键在于提出一种基于强化学习（Reinforcement Learning）的后训练微调框架，仅利用文本提示即可对预训练的运动扩散模型进行优化，无需任何运动真值数据；其核心机制是采用预训练的文本-动作检索网络作为奖励信号，并结合去噪扩散策略优化（Denoising Diffusion Policy Optimization, DDPO），从而有效将模型的生成分布迁移至目标域，同时保持原始分布性能，实现跨数据集和留一法动作实验中的质量与多样性提升。

链接: https://arxiv.org/abs/2510.06988
作者: Girolamo Macaluso,Lorenzo Mandelli,Mirko Bicchierai,Stefano Berretti,Andrew D. Bagdanov
机构: University of Florence (佛罗伦萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model’s generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
zh

[CV-28] Revisiting Mixout: An Overlooked Path to Robust Finetuning

【速读】：该论文旨在解决视觉基础模型（Vision Foundation Models）微调后在分布偏移（distribution shift）下鲁棒性下降的问题，即微调虽能提升域内准确率，但会削弱模型对 covariate shift、数据噪声或类别不平衡等场景的适应能力。解决方案的关键在于重新审视 Mixout 这一随机正则化方法，并将其视为一种“单次运行、权重共享的隐式集成”（single-run, weight-sharing implicit ensemble），从中识别出三个控制鲁棒性的核心因素：掩码锚点（masking anchor）、重采样频率（resampling frequency）和掩码稀疏度（mask sparsity）。基于此分析，作者提出 GMixout，其创新点包括：(i) 用训练过程中动态更新的指数移动平均快照替代固定锚点，增强适应性；(ii) 引入显式的重采样频率超参数以调控掩码周期。此外，通过稀疏核实现仅更新少量参数，无推理开销，可在消费级 GPU 上高效训练。实验表明，GMixout 在多个基准测试中（如 ImageNet-LT、DomainNet、iWildCam 和 CIFAR100-C）均在保持甚至超越零样本性能的同时，在分布偏移下显著优于 Model Soups 和主流参数高效微调基线。

链接: https://arxiv.org/abs/2510.06982
作者: Masih Aminbeidokhti,Heitor Rapela Medeiros,Eric Granger,Marco Pedersoli
机构: École de technologie supérieure (高等技术学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emphmasking anchor, \emphresampling frequency, and \emphmask sparsity. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.
zh

[CV-29] Addressing the ID-Matching Challenge in Long Video Captioning

【速读】：该论文旨在解决长视频字幕生成中的ID-Matching问题（即准确识别不同帧中同一个体的能力），这是提升文本到视频生成和多模态理解性能的关键挑战。以往方法依赖点对点匹配，泛化能力有限；而本文提出基于大视觉语言模型（LVLMs）的解决方案，其关键在于挖掘并强化LVLM自身隐含的ID-Matching能力：通过优化图像信息利用效率与增强个体描述的信息量两个策略，设计出名为RICE（Recognizing Identities for Captioning Effectively）的新方法。实验表明，该方案在GPT-4o上将ID-Matching精度从50%提升至90%，召回率从15%提升至80%，显著优于基线。

链接: https://arxiv.org/abs/2510.06973
作者: Zhantao Yang,Huangji Wang,Ruili Feng,Han Zhang,Yuting Hu,Shangwen Zhu,Junyan Li,Yu Liu,Fan Cheng
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.
zh

[CV-30] Learning Global Representation from Queries for Vectorized HD Map Construction

【速读】：该论文旨在解决当前基于DETR框架的高精地图（HD map）在线构建方法中，因依赖独立可学习的目标查询（object queries）而导致的局部查询视角局限性问题，即现有方法未能有效利用HD地图中的全局结构信息。解决方案的关键在于提出MapGR架构，其核心创新是引入两个协同模块：一是全局表示学习（Global Representation Learning, GRL）模块，通过设计的全景分割任务引导所有查询分布更好地对齐全局地图；二是全局表示引导（Global Representation Guidance, GRG）模块，为每个查询显式注入全局上下文信息，从而提升其优化能力。这一机制显著增强了模型对HD地图整体结构的理解与建模能力，实验证明在nuScenes和Argoverse2数据集上均实现了mAP指标的显著提升。

链接: https://arxiv.org/abs/2510.06969
作者: Shoumeng Qiu,Xinrun Li,Yang Long,Xiangyang Xue,Varun Ojha,Jian Pu
机构: Fudan University (复旦大学); Newcastle University (纽卡斯尔大学); Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:The online construction of vectorized high-definition (HD) maps is a cornerstone of modern autonomous driving systems. State-of-the-art approaches, particularly those based on the DETR framework, formulate this as an instance detection problem. However, their reliance on independent, learnable object queries results in a predominantly local query perspective, neglecting the inherent global representation within HD maps. In this work, we propose \textbfMapGR (\textbfGlobal \textbfRepresentation learning for HD \textbfMap construction), an architecture designed to learn and utilize a global representations from queries. Our method introduces two synergistic modules: a Global Representation Learning (GRL) module, which encourages the distribution of all queries to better align with the global map through a carefully designed holistic segmentation task, and a Global Representation Guidance (GRG) module, which endows each individual query with explicit, global-level contextual information to facilitate its optimization. Evaluations on the nuScenes and Argoverse2 datasets validate the efficacy of our approach, demonstrating substantial improvements in mean Average Precision (mAP) compared to leading baselines.
zh

[CV-31] Generating Surface for Text-to-3D using 2D Gaussian Splatting

【速读】：该论文旨在解决由自然世界中物体复杂几何形状导致的3D内容生成难题。现有方法通常依赖2D扩散先验来恢复3D结构，或基于特定3D表示直接训练模型，但存在生成质量受限或泛化能力不足的问题。其解决方案的关键在于提出一种名为DirectGaussian的新方法，该方法通过surfels（表面元素）表示3D对象表面，并利用条件文本生成模型结合多视角法向量与纹理先验，借助2D高斯点绘（2D Gaussian splatting）进行渲染；同时在优化过程中引入曲率约束以保障多视角几何一致性，从而实现多样化且高保真度的3D内容生成。

链接: https://arxiv.org/abs/2510.06967
作者: Huanning Dong,Fan Li,Ping Kuang,Jianwen Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.
zh

[CV-32] High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization WACV2026

【速读】：该论文旨在解决预训练模型在分布偏移（distribution shifts）下鲁棒性不足的问题，同时应对集成方法（ensemble）带来的高昂计算成本。现有方法如Dropout虽能轻量模拟集成效果，但在预训练模型上易过度正则化，破坏关键特征表示；而传统集成策略需训练并存储多个模型，导致显著的计算开销。论文提出一种基于随机权重替换的新型正则化技术——High-rate Mixout，其核心在于：在微调过程中以高概率（ViT为0.9，ResNet为0.8）将部分已微调权重随机替换为原始预训练权重，从而在保留先验知识的同时抑制过拟合。这一策略不仅通过强约束参数偏离促进跨域泛化性能，还大幅降低梯度计算量（最多减少45%）与内存占用（最多减少90%），实现在五项领域泛化基准测试中达到与集成方法相当的域外准确率，同时显著优化训练效率。

链接: https://arxiv.org/abs/2510.06955
作者: Masih Aminbeidokhti,Heitor Rapela Medeiros,Eric Granger,Marco Pedersoli
机构: École de technologie supérieure (École de technologie supérieure)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026: Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
zh

[CV-33] OBJVanish: Physically Realizable Text-to-3D Adv. Generation of LiDAR-Invisible Objects

【速读】：该论文旨在解决当前基于激光雷达（LiDAR）的3D目标检测系统在面对对抗攻击时存在的两大问题：一是现有攻击方法添加优化扰动后难以使目标完全消失，二是这些攻击方法难以在物理环境中实现。解决方案的关键在于提出一种物理可实现的文本到3D对抗生成方法（Physically-informed text-to-3D adversarial generation, Phy3DAdvGen），该方法通过迭代优化文本提示中的动词、物体和姿态等要素，生成能够被LiDAR检测器完全忽略的3D行人模型，并利用包含13个真实物体的3D模型库约束生成过程，确保所生成对象在现实世界中可制造与部署。实验表明，该方法可在CARLA仿真环境及物理环境中成功规避六种主流LiDAR 3D检测器，从而揭示了自动驾驶安全关键系统中的潜在漏洞。

链接: https://arxiv.org/abs/2510.06952
作者: Bing Li,Wuqi Wang,Yanan Zhang,Jingzheng Li,Haigen Min,Wei Feng,Xingyu Zhao,Jie Zhang,Qing Guo
机构: CFAR, A*STAR (新加坡科技研究局); Chang’an University (长安大学); Hefei University of Technology (合肥工业大学); Zhongguancun Laboratory (中关村实验室); Tianjin University (天津大学); University of Warwick (华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR-based 3D object detectors are fundamental to autonomous driving, where failing to detect objects poses severe safety risks. Developing effective 3D adversarial attacks is essential for thoroughly testing these detection systems and exposing their vulnerabilities before real-world deployment. However, existing adversarial attacks that add optimized perturbations to 3D points have two critical limitations: they rarely cause complete object disappearance and prove difficult to implement in physical environments. We introduce the text-to-3D adversarial generation method, a novel approach enabling physically realizable attacks that can generate 3D models of objects truly invisible to LiDAR detectors and be easily realized in the real world. Specifically, we present the first empirical study that systematically investigates the factors influencing detection vulnerability by manipulating the topology, connectivity, and intensity of individual pedestrian 3D models and combining pedestrians with multiple objects within the CARLA simulation environment. Building on the insights, we propose the physically-informed text-to-3D adversarial generation (Phy3DAdvGen) that systematically optimizes text prompts by iteratively refining verbs, objects, and poses to produce LiDAR-invisible pedestrians. To ensure physical realizability, we construct a comprehensive object pool containing 13 3D models of real objects and constrain Phy3DAdvGen to generate 3D objects based on combinations of objects in this set. Extensive experiments demonstrate that our approach can generate 3D pedestrians that evade six state-of-the-art (SOTA) LiDAR 3D detectors in both CARLA simulation and physical environments, thereby highlighting vulnerabilities in safety-critical applications.
zh

[CV-34] IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

【速读】：该论文旨在解决自回归图像生成模型在处理视觉数据时忽视其内在结构特性的问题，尤其是受限于预训练码本的刚性以及硬性、均匀聚类带来的表征不准确。其解决方案的关键在于提出IAR2框架，核心创新是引入一种语义-细节关联的双码本（Semantic-Detail Associated Dual Codebook），将图像表示解耦为语义码本（用于全局语义信息）和细节码本（用于细粒度优化），从而将量化容量从线性扩展至多项式级别，显著提升表达能力；同时设计语义-细节自回归预测机制与局部上下文增强的自回归头，实现分层预测并保持空间一致性，并通过渐进式注意力引导的自适应Classifier-Free Guidance（Progressive Attention-Guided Adaptive CFG）机制动态调节条件引导强度，兼顾生成质量和条件对齐。

链接: https://arxiv.org/abs/2510.06928
作者: Ran Yi,Teng Hu,Zihan Su,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.
zh

[CV-35] Label-frugal satellite image change detection with generative virtual exemplar learning

【速读】：该论文旨在解决遥感图像变化检测（change detection）任务中对大量人工标注数据的依赖问题，尤其是在深度学习方法中，模型性能高度依赖于涵盖不同获取条件和用户主观性的高质量标注数据。其解决方案的关键在于提出一种基于主动学习（active learning）的新颖算法，核心创新在于设计了一个能够衡量未标注样本重要性的模型，从而筛选出最具判别力的样本（称为虚拟原型样本，virtual exemplars），交由专家（oracle）进行标注。这些虚拟原型样本通过可逆图卷积网络（invertible graph convnet）生成，以对抗损失函数的最优解形式呈现，该损失函数同时优化了数据的代表性、多样性和模糊性，从而在每次迭代中有效挑战当前的变化检测标准，促使模型更准确地重新估计检测阈值，实现标签效率的最大化。

链接: https://arxiv.org/abs/2510.06926
作者: Hichem Sahbi
机构: Sorbonne University (索邦大学); CNRS (法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Change detection is a major task in remote sensing which consists in finding all the occurrences of changes in multi-temporal satellite or aerial images. The success of existing methods, and particularly deep learning ones, is tributary to the availability of hand-labeled training data that capture the acquisition conditions and the subjectivity of the user (oracle). In this paper, we devise a novel change detection algorithm, based on active learning. The main contribution of our work resides in a new model that measures how important is each unlabeled sample, and provides an oracle with only the most critical samples (also referred to as virtual exemplars) for further labeling. These exemplars are generated, using an invertible graph convnet, as the optimum of an adversarial loss that (i) measures representativity, diversity and ambiguity of the data, and thereby (ii) challenges (the most) the current change detection criteria, leading to a better re-estimate of these criteria in the subsequent iterations of active learning. Extensive experiments show the positive impact of our label-efficient learning model against comparative methods.
zh

[CV-36] Angular Constraint Embedding via SpherePair Loss for Constrained Clustering NEURIPS2025

【速读】：该论文旨在解决现有深度约束聚类（Deep Constrained Clustering, DCC）方法中存在的两大核心问题：一是基于端到端建模的锚点（anchor）限制导致模型灵活性不足；二是难以学习具有判别性的欧几里得嵌入（Euclidean embedding），从而制约了模型的可扩展性和实际应用效果。解决方案的关键在于提出一种新颖的角度约束嵌入方法（Angular Constraint Embedding），即SpherePair，其通过几何形式的SpherePair损失函数忠实编码成对约束，使嵌入空间在角度上具备聚类友好性，从而将表示学习与聚类过程解耦。该方法无需预设聚类数量、支持未见数据泛化、可快速推断聚类数，并具备严格的理论保障，显著提升了DCC方法的性能、可扩展性与实用性。

链接: https://arxiv.org/abs/2510.06907
作者: Shaojie Zhang,Ke Chen
机构: The University of Manchester (曼彻斯特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025, 6 Figures and 1 Table in Main text, 18 Figures and 5 Tables in Appendices

点击查看摘要

Abstract:Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \hrefthis https URLour repository.
zh

[CV-37] Lung Infection Severity Prediction Using Transformers with Conditional TransMix Augmentation and Cross-Attention

【速读】：该论文旨在解决肺部感染（如肺炎）严重程度的准确预测问题，尤其是在疫情背景下，亟需借助医学影像实现快速、可靠的AI辅助诊断。其解决方案的关键在于提出两个核心创新：一是QCross-Att-PVT，一种基于Transformer的架构，通过并行编码器、交叉门控注意力机制（cross-gated attention）和特征聚合模块，有效捕捉多尺度特征；二是Conditional Online TransMix，一种面向数据不平衡问题的定制化在线数据增强策略，通过生成混合标签图像块提升模型鲁棒性与泛化能力。实验表明，该方法在RALO CXR和Per-COVID-19 CT两个基准数据集上均显著优于现有先进深度学习模型，验证了数据增强与门控注意力机制对提升预测准确性与临床适用性的关键作用。

链接: https://arxiv.org/abs/2510.06887
作者: Bouthaina Slika,Fadi Dornaika,Fares Bougourzi,Karim Hammoudi
机构: University of the Basque Country (巴斯克大学); Ho Chi Minh City Open University (胡志明市开放大学); IKERBASQUE (IKERBASQUE); Junia, UMR 8520, CNRS, Central Lille (Junia，UMR 8520，法国国家科学研究中心，中央里尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lung infections, particularly pneumonia, pose serious health risks that can escalate rapidly, especially during pandemics. Accurate AI-based severity prediction from medical imaging is essential to support timely clinical decisions and optimize patient outcomes. In this work, we present a novel method applicable to both CT scans and chest X-rays for assessing lung infection severity. Our contributions are twofold: (i) QCross-Att-PVT, a Transformer-based architecture that integrates parallel encoders, a cross-gated attention mechanism, and a feature aggregator to capture rich multi-scale features; and (ii) Conditional Online TransMix, a custom data augmentation strategy designed to address dataset imbalance by generating mixed-label image patches during training. Evaluated on two benchmark datasets, RALO CXR and Per-COVID-19 CT, our method consistently outperforms several state-of-the-art deep learning models. The results emphasize the critical role of data augmentation and gated attention in improving both robustness and predictive accuracy. This approach offers a reliable, adaptable tool to support clinical diagnosis, disease monitoring, and personalized treatment planning. The source code of this work is available at this https URL.
zh

[CV-38] HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation IROS2025

【速读】：该论文旨在解决LiDAR语义分割在资源受限嵌入式系统中面临的准确率与实时性难以兼顾的问题。现有方法如基于点云和稀疏卷积的方法虽精度高但计算复杂度大，投影法虽速度快却损失几何信息，且多数方法依赖测试时增强（TTA）提升性能，进一步增加推理延迟。解决方案的关键在于：提出一种新型预处理方法以显著降低计算开销；设计Conv-SE-NeXt特征提取模块，在不依赖深层堆叠的情况下高效捕获表征；并采用多尺度距离-点融合骨干网络，通过多抽象层级的信息融合保留关键几何细节，从而在nuScenes和SemanticKITTI基准上实现优于当前最先进方法的精度-速度权衡，且无需集成模型或TTA即可达到顶级性能，同时推理速度提升24倍。

链接: https://arxiv.org/abs/2510.06876
作者: Samir Abou Haidar,Alexandre Chariot,Mehdi Darouich,Cyril Joly,Jean-Emmanuel Deschaud
机构: Paris-Saclay University, CEA, List (法国替代能源与原子能委员会); Mines Paris, PSL University, Centre for Robotics (CAOR) (巴黎矿业大学，PSL大学机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at IROS 2025 (IEEE/RSJ International Conference on Intelligent Robots and Systems)

点击查看摘要

Abstract:LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24 \times faster. The code is available at this https URL
zh

[CV-39] SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

【速读】：该论文旨在解决多模态大推理模型（Multimodal Large Reasoning Models, MLRMs）在面对恶意或不安全提示时，因推理过程缺乏约束而放大安全风险的问题，即所谓的“推理税”（Reasoning Tax）。现有防御方法仅作用于输出层面，未能有效管控推理路径中的潜在风险。其解决方案的关键在于提出 SaFeR-VLM 框架，通过将安全性直接嵌入多模态推理过程，实现从被动防护到主动驱动的转变：该框架包含四个核心组件——（I）QI-Safe-10K 安全敏感数据集，聚焦高风险推理场景；（II）安全感知的 rollout 机制，对不安全生成进行反思与修正而非丢弃；（III）结构化奖励建模，引入多维加权准则并显式惩罚幻觉和矛盾；（IV）GRPO 优化策略，强化安全且修正后的推理轨迹。这一设计使安全成为推理的内在驱动力，从而实现可扩展、泛化性强的安全感知推理能力。

链接: https://arxiv.org/abs/2510.06871
作者: Huahui Yi,Kun Wang,Qiankun Li,Miao Yu,Liang Lin,Gongli Xi,Hao Wu,Xuming Hu,Kang Li,Yang Liu
机构: West China Biomedical Big Data Center, West China Hospital, SCU; NTU; TeleAI, China Telecom; BUPT; Tsinghua University; HKUST(Guangzhou)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textitReasoning Tax. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance 70.13 and 78.97 on safety and helpfulness across six benchmarks, surpassing both same-scale and 10\times larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num6.47 and \num16.76 points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at this https URL.
zh

[CV-40] Explaining raw data complexity to improve satellite onboard processing

【速读】：该论文试图解决在遥感卫星上直接部署人工智能（AI）模型时，使用原始传感器数据（raw data）而非预处理的地基产品所面临的性能挑战。其核心问题是：如何有效利用未经处理的原始遥感数据进行目标检测与分类任务，同时保持模型精度和可解释性。解决方案的关键在于提出了一种模拟工作流，用于从高分辨率L1级影像生成类原始数据产品，从而系统评估不同数据形式对深度学习模型的影响；并通过对比YOLOv11s和YOLOX-S两个目标检测模型在原始数据与L1数据上的表现，发现原始数据训练的模型在高置信度下边界识别能力下降，提示未来需改进AI架构中的轮廓提取方法以提升基于原始数据的目标检测性能，进而推动遥感场景中星载AI的实用化发展。

链接: https://arxiv.org/abs/2510.06858
作者: Adrien Dorise,Marjorie Bellizzi,Adrien Girard,Benjamin Francesconi,Stéphane May
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint: European Data Handling Data Processing Conference (EDHPC) 2025

点击查看摘要

Abstract:With increasing processing power, deploying AI models for remote sensing directly onboard satellites is becoming feasible. However, new constraints arise, mainly when using raw, unprocessed sensor data instead of preprocessed ground-based products. While current solutions primarily rely on preprocessed sensor images, few approaches directly leverage raw data. This study investigates the effects of utilising raw data on deep learning models for object detection and classification tasks. We introduce a simulation workflow to generate raw-like products from high-resolution L1 imagery, enabling systemic evaluation. Two object detection models (YOLOv11s and YOLOX-S) are trained on both raw and L1 datasets, and their performance is compared using standard detection metrics and explainability tools. Results indicate that while both models perform similarly at low to medium confidence thresholds, the model trained on raw data struggles with object boundary identification at high confidence levels. It suggests that adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.
zh

[CV-41] Online Generic Event Boundary Detection ICCV2025

【速读】：该论文旨在解决通用事件边界检测（Generic Event Boundary Detection, GEBD）中在线处理流式视频时的延迟问题，即现有方法需依赖完整视频帧进行预测，无法模拟人类实时感知事件变化的能力。为实现真正的在线检测，作者提出新的任务——在线通用事件边界检测（Online Generic Event Boundary Detection, On-GEBD），其核心挑战在于如何在无未来帧信息的情况下识别细微且无类别标签的事件转换。解决方案的关键是提出名为Estimator的框架，该框架受事件分割理论（Event Segmentation Theory, EST）启发，包含两个核心组件：一致事件预测器（Consistent Event Anticipator, CEA）用于基于历史帧预测未来帧以表征当前事件动态；在线边界判别器（Online Boundary Discriminator, OBD）则通过统计测试自适应调整阈值，量化预测误差以捕捉多样且细微的事件过渡。实验表明，Estimator在Kinetics-GEBD和TAPOS数据集上性能优于现有在线模型，并达到与离线GEBD方法相当的水平。

链接: https://arxiv.org/abs/2510.06855
作者: Hyungrok Jung,Daneul Kim,Seunggyun Lim,Jeany Son,Jonghyun Choi
机构: GIST(韩国科学技术院); Seoul National University(首尔国立大学); POSTECH(浦项工科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: ICCV 2025

点击查看摘要

Abstract:Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.
zh

[CV-42] Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization ECCV2024

【速读】：该论文旨在解决动作质量评估（Action Quality Assessment, AQA）在真实场景中因质量分布非平稳性导致的泛化能力不足问题，即传统方法难以适应随时间演化的动作质量分布并避免灾难性遗忘。其核心解决方案是提出持续学习框架下的Continual AQA（CAQA），并通过自适应流形对齐图正则化（Adaptive Manifold-Aligned Graph Regularization, MAGR++）实现有效改进：关键在于通过两阶段特征修正机制——首先利用流形投影器将历史特征映射至当前表示空间以稳定浅层特征，其次借助图正则化器对齐局部与全局特征分布，从而在保持深层网络充分微调的同时抑制过拟合与特征流形偏移，显著提升模型在在线和离线场景下的性能表现。

链接: https://arxiv.org/abs/2510.06842
作者: Kanglei Zhou,Qingyi Pan,Xingxing Zhang,Hubert P. H. Shum,Frederick W. B. Li,Xiaohui Liang,Liyuan Wang
机构: Tsinghua University (清华大学); Durham University (杜伦大学); Beihang University (北京航空航天大学); Zhongguancun Laboratory (中关村实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended Version of MAGR (ECCV 2024 Oral Presentation)

点击查看摘要

Abstract:Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at this https URL.
zh

[CV-43] Lattice-allocated Real-time Line Segment Feature Detection and Tracking Using Only an Event-based Camera ICCV

【速读】：该论文旨在解决仅使用事件相机（event camera）实现实时线段检测与跟踪的问题，尤其针对高事件率场景下传统方法依赖额外帧相机或性能下降的局限性。其解决方案的关键在于提出一种基于格点分配（lattice-allocated）的处理流水线，包含三个核心模块：(i) 速度不变的事件表示方法以增强对运动变化的鲁棒性；(ii) 基于拟合评分的线段检测机制，提升几何特征提取精度；(iii) 通过端点扰动实现线段跟踪，确保连续性和稳定性。实验表明，该方案在自建及公开数据集上均实现了优于现有纯事件和事件-帧融合基线的实时性能与更高准确率，支持完全独立运行的事件相机系统在真实环境中的应用。

链接: https://arxiv.org/abs/2510.06829
作者: Mikihiro Ikura,Arren Glover,Masayoshi Mizuno,Chiara Bartolozzi
机构: Istituto Italiano di Tecnologia (意大利技术研究院); Sony Interactive Entertainment Inc. (索尼互动娱乐公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 13 figures, 6 tables, ICCV Workshop NeVi2025

点击查看摘要

Abstract:Line segment extraction is effective for capturing geometric features of human-made environments. Event-based cameras, which asynchronously respond to contrast changes along edges, enable efficient extraction by reducing redundant data. However, recent methods often rely on additional frame cameras or struggle with high event rates. This research addresses real-time line segment detection and tracking using only a modern, high-resolution (i.e., high event rate) event-based camera. Our lattice-allocated pipeline consists of (i) velocity-invariant event representation, (ii) line segment detection based on a fitting score, (iii) and line segment tracking by perturbating endpoints. Evaluation using ad-hoc recorded dataset and public datasets demonstrates real-time performance and higher accuracy compared to state-of-the-art event-only and event-frame hybrid baselines, enabling fully stand-alone event camera operation in real-world settings.
zh

[CV-44] StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance ICCV2025 CVPR

【速读】：该论文旨在解决文本到图像生成中视觉提示（visual prompting）方法存在的内容泄露（content leakage）问题，即在利用图像作为风格提示时，不仅传递了期望的风格信息，还无意引入了原图中的无关内容元素。解决方案的关键在于提出两种改进策略：一是将无分类器引导（classifier-free guidance, CFG）扩展至使用交换自注意力机制（swapping self-attention），以更精准地分离风格与内容；二是引入负向视觉查询引导（negative visual query guidance, NVQG），通过模拟内容泄露场景——故意在自注意力层中交换查询（query）而非键（key）和值（value）——来估计并抑制不希望转移的内容特征，从而显著降低内容泄露现象。该方法在多种风格和文本提示下均表现出优越性能，能准确反映参考图像的风格且保持与文本提示的一致性。

链接: https://arxiv.org/abs/2510.06827
作者: Jaeseok Jeong,Junho Kim,Gayoung Lee,Yunjey Choi,Youngjung Uh
机构: Yonsei University (延世大学); NAVER AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025; CVPRW AI4CC 2024 (Best Paper + Oral)

点击查看摘要

Abstract:In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts. Our code is available \hrefthis https URLhere.
zh

[CV-45] Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

【速读】：该论文旨在解决多模态检索中视觉-语言重排序（reranking）模型部署效率低的问题，尤其针对现有联合编码器（如BLIP）因昂贵的视觉特征提取阶段导致难以大规模应用的瓶颈。其解决方案的关键在于提出EDJE（Efficient Discriminative Joint Encoder），通过离线预计算并压缩视觉token，利用轻量级基于注意力的适配器（adapter）减少在线推理时的计算负担，使模型仅需在少量视觉token和文本上运行紧凑的联合编码器，从而显著降低存储需求与在线计算开销，同时保持优异的检索性能。

链接: https://arxiv.org/abs/2510.06820
作者: Mitchell Keren Taraday,Shahaf Wagner,Chaim Baskin
机构: INSIGHT Lab, Ben-Gurion University of the Negev, Israel
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision–language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image–text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.
zh

[CV-46] VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

【速读】：该论文旨在解决心脏超声检查中因操作难度高导致的高质量图像获取困难问题，尤其针对初级超声技师缺乏专业指导而难以获得优质图像的瓶颈。其核心解决方案是提出一种参数高效的视觉-动作适配器（Vision-Action Adapter, VA-Adapter），通过将预训练超声基础模型的图像编码器扩展为能够编码视觉-动作序列的能力，从而在仅微调少量参数的前提下，实现对探头调整策略的精准学习与实时操作建议生成。VA-Adapter的设计借鉴了专家基于历史探索优化决策的实践，具备紧凑结构下的序列推理能力，显著提升了探头引导性能。

链接: https://arxiv.org/abs/2510.06809
作者: Teng Wang,Haojun Jiang,Yuxuan Wang,Zhenguo Sun,Shiji Song,Gao Huang
机构: Tsinghua University (清华大学); Xidian University (西安电子科技大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Echocardiography is a critical tool for detecting heart diseases. Recently, ultrasound foundation models have demonstrated remarkable capabilities in cardiac ultrasound image analysis. However, obtaining high-quality ultrasound images is a prerequisite for accurate diagnosis. Due to the exceptionally high operational difficulty of cardiac ultrasound, there is a shortage of highly skilled personnel, which hinders patients from receiving timely examination services. In this paper, we aim to adapt the medical knowledge learned by foundation models from vast datasets to the probe guidance task, which is designed to provide real-time operational recommendations for junior sonographers to acquire high-quality ultrasound images. Moreover, inspired by the practice where experts optimize action decisions based on past explorations, we meticulously design a parameter-efficient Vision-Action Adapter (VA-Adapter) to enable foundation model’s image encoder to encode vision-action sequences, thereby enhancing guidance performance. With built-in sequential reasoning capabilities in a compact design, the VA-Adapter enables a pre-trained ultrasound foundation model to learn precise probe adjustment strategies by fine-tuning only a small subset of parameters. Extensive experiments demonstrate that the VA-Adapter can surpass strong probe guidance models. Our code will be released after acceptance.
zh

[CV-47] Capture and Interact: Rapid 3D Object Acquisition and Rendering with Gaussian Splatting in Unity

【速读】：该论文旨在解决实时捕获与渲染三维（3D）物体的挑战，以支持增强现实、数字孪生系统、远程协作和原型设计等应用场景。其解决方案的关键在于提出了一套端到端的处理流程，利用3D高斯点绘（3D Gaussian Splatting, 3D GS）技术实现从移动设备视频扫描、云端自动化重建到本地计算机交互式渲染的高效集成，从而在笔记本电脑上实现平均150帧每秒（fps）的实时渲染性能，显著提升了真实世界物体的实时三维重建与可视化能力。

链接: https://arxiv.org/abs/2510.06802
作者: Islomjon Shukhratov,Sergey Gorinsky
机构: IMDEA Networks Institute(IMDEA 网络研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capturing and rendering three-dimensional (3D) objects in real time remain a significant challenge, yet hold substantial potential for applications in augmented reality, digital twin systems, remote collaboration and prototyping. We present an end-to-end pipeline that leverages 3D Gaussian Splatting (3D GS) to enable rapid acquisition and interactive rendering of real-world objects using a mobile device, cloud processing and a local computer. Users scan an object with a smartphone video, upload it for automated 3D reconstruction, and visualize it interactively in Unity at an average of 150 frames per second (fps) on a laptop. The system integrates mobile capture, cloud-based 3D GS and Unity rendering to support real-time telepresence. Our experiments show that the pipeline processes scans in approximately 10 minutes on a graphics processing unit (GPU) achieving real-time rendering on the laptop.
zh

[CV-48] Extreme Amodal Face Detection

【速读】：该论文旨在解决极端非可见目标检测（extreme amodal detection）问题，即在输入图像中仅部分可见或完全不可见的目标（如被裁剪出画面的面部），需通过上下文信息推断其完整二维位置。不同于传统非可见目标检测（amodal detection）仅处理图像内部分遮挡的情况，该任务更具挑战性，因目标可能完全超出图像边界。解决方案的关键在于提出一种基于热图（heatmap-based）的单图检测框架，采用选择性粗粒度到细粒度的解码器结构，高效利用图像中的局部和全局上下文线索来预测缺失区域中的目标位置，无需依赖图像序列或生成模型采样，从而实现更高效且准确的检测性能。

链接: https://arxiv.org/abs/2510.06791
作者: Changlin Song,Yunzhong Hou,Michael Randall Barnes,Rahul Shome,Dylan Campbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.
zh

[CV-49] Bionetta: Efficient Client-Side Zero-Knowledge Machine Learning Proving

【速读】：该论文旨在解决零知识证明（Zero-Knowledge Proof, ZKP）在机器学习模型验证中的效率与部署难题，特别是如何在资源受限设备上实现高效证明并兼容以太坊虚拟机（EVM）原生部署。其解决方案的关键在于基于UltraGroth的零知识机器学习框架Bionetta，通过优化证明生成过程，显著提升定制神经网络的证明速度，使证明可在移动设备上完成；同时，在保持较小证明尺寸和低验证开销的前提下，实现了无需额外中间层即可直接部署于EVM智能合约的能力，这在同类方案中尚属首次。

链接: https://arxiv.org/abs/2510.06784
作者: Dmytro Zakharov,Oleksandr Kurbatov,Artem Sdobnov,Lev Soukhanov,Yevhenii Sekhin,Vitalii Volovyk,Mykhailo Velykodnyi,Mark Cherepovskyi,Kyrylo Baibula,Lasha Antadze,Pavlo Kravchenko,Volodymyr Dubinin,Yaroslav Panasenko
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this report, we compare the performance of our UltraGroth-based zero-knowledge machine learning framework Bionetta to other tools of similar purpose such as EZKL, Lagrange’s deep-prove, or zkml. The results show a significant boost in the proving time for custom-crafted neural networks: they can be proven even on mobile devices, enabling numerous client-side proving applications. While our scheme increases the cost of one-time preprocessing steps, such as circuit compilation and generating trusted setup, our approach is, to the best of our knowledge, the only one that is deployable on the native EVM smart contracts without overwhelming proof size and verification overheads.
zh

[CV-50] RV: Test-Time Reinforcement Learning for Vision Language Models

【速读】：该论文旨在解决强化学习中奖励信号提取依赖标注数据和专用训练集的问题，这一设定与人类从环境中直接学习的方式不一致。为实现无需标注数据的在线适应，作者提出TTRV（Test-Time Reward Variation）方法，其核心在于对Group Relative Policy Optimization (GRPO)框架进行改进：通过设计基于基础模型输出频率的奖励机制，并在推理时对每个测试样本多次采样以生成动态奖励；同时引入输出经验分布熵的奖励项来控制模型输出多样性。该方案在物体识别和视觉问答（VQA）任务上均取得显著提升，最高分别达52.4%和29.8%，且在无标签条件下仍能实现有效增强，证明了测试时强化学习（test-time RL）可媲美甚至超越主流商业模型。

链接: https://arxiv.org/abs/2510.06783
作者: Akshit Singh,Shyam Marjit,Wei Lin,Paul Gavrikov,Serena Yeung-Levy,Hilde Kuehne,Rogerio Feris,Sivan Doveh,James Glass,M. Jehanzeb Mirza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model’s output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model’s output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 this http URL, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
zh

[CV-51] A deep multiple instance learning approach based on coarse labels for high-resolution land-cover mapping

【速读】：该论文旨在解决高分辨率土地覆盖制图中训练标签数量不足与质量不佳的问题，特别是当使用高分辨率影像（如Sentinel-2）进行分类时，难以获取精确的像素级标注。其解决方案的关键在于引入一种基于深度多实例学习（Deep Multiple Instance Learning, DMIL）的方法，通过弱低分辨率参考数据（如MODIS衍生的土地覆盖图）间接监督训练像素级多类分类器。该方法利用灵活的池化层将高分辨率影像中的像素语义映射到低分辨率标签，并在多类和多标签两种设定下重构多实例学习问题：前者假设低分辨率标签代表图像块中多数像素类别，后者则允许一个图像块包含多个有效标签，此时采用正-未标记学习（Positive-Unlabeled Learning, PUL）策略进行训练。实验表明，该框架在2020 IEEE GRSS Data Fusion Contest数据集上优于传统训练策略。

链接: https://arxiv.org/abs/2510.06769
作者: Gianmarco Perantoni,Lorenzo Bruzzone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures, accepted conference paper at SPIE REMOTE SENSING, 3-7 September 2023, Amsterdam, Netherlands

点击查看摘要

Abstract:The quantity and the quality of the training labels are central problems in high-resolution land-cover mapping with machine-learning-based solutions. In this context, weak labels can be gathered in large quantities by leveraging on existing low-resolution or obsolete products. In this paper, we address the problem of training land-cover classifiers using high-resolution imagery (e.g., Sentinel-2) and weak low-resolution reference data (e.g., MODIS -derived land-cover maps). Inspired by recent works in Deep Multiple Instance Learning (DMIL), we propose a method that trains pixel-level multi-class classifiers and predicts low-resolution labels (i.e., patch-level classification), where the actual high-resolution labels are learned implicitly without direct supervision. This is achieved with flexible pooling layers that are able to link the semantics of the pixels in the high-resolution imagery to the low-resolution reference labels. Then, the Multiple Instance Learning (MIL) problem is re-framed in a multi-class and in a multi-label setting. In the former, the low-resolution annotation represents the majority of the pixels in the patch. In the latter, the annotation only provides us information on the presence of one of the land-cover classes in the patch and thus multiple labels can be considered valid for a patch at a time, whereas the low-resolution labels provide us only one label. Therefore, the classifier is trained with a Positive-Unlabeled Learning (PUL) strategy. Experimental results on the 2020 IEEE GRSS Data Fusion Contest dataset show the effectiveness of the proposed framework compared to standard training strategies.
zh

[CV-52] ransforming Noise Distributions with Histogram Matching: Towards a Single Denoiser for All

【速读】：该论文旨在解决监督式高斯去噪器（Supervised Gaussian denoisers）在面对分布外噪声（out-of-distribution noise）时泛化能力有限的问题，其核心挑战在于不同噪声类型的分布特性差异显著。解决方案的关键在于提出一种直方图匹配（histogram matching）方法，将任意噪声转换为目标已知强度的高斯分布，并构建噪声变换与去噪之间的相互增强循环机制：该循环通过迭代优化逐步逼近真实噪声分布，从而提升噪声变换效果并进一步改善去噪性能。具体实现中引入三种针对性策略——局部直方图匹配处理信号依赖噪声、块内置换操作应对通道相关噪声、频域直方图匹配结合像素洗牌下采样打破空间相关性，使得单一高斯去噪器具备处理多种合成噪声（如泊松噪声、椒盐噪声、重复模式噪声）及复杂真实世界噪声的能力。

链接: https://arxiv.org/abs/2510.06757
作者: Sheng Fu,Junchao Zhang,Kailun Yang
机构: Hunan Provincial Key Laboratory of Optic-Electronic Intelligent Measurement and Control, School of Automation, Central South University (中南大学自动化学院); School of Artificial Intelligence and Robotics, Hunan University (湖南大学人工智能与机器人学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Supervised Gaussian denoisers exhibit limited generalization when confronted with out-of-distribution noise, due to the diverse distributional characteristics of different noise types. To bridge this gap, we propose a histogram matching approach that transforms arbitrary noise towards a target Gaussian distribution with known intensity. Moreover, a mutually reinforcing cycle is established between noise transformation and subsequent denoising. This cycle progressively refines the noise to be converted, making it approximate the real noise, thereby enhancing the noise transformation effect and further improving the denoising performance. We tackle specific noise complexities: local histogram matching handles signal-dependent noise, intrapatch permutation processes channel-related noise, and frequency-domain histogram matching coupled with pixel-shuffle down-sampling breaks spatial correlation. By applying these transformations, a single Gaussian denoiser gains remarkable capability to handle various out-of-distribution noises, including synthetic noises such as Poisson, salt-and-pepper and repeating pattern noises, as well as complex real-world noises. Extensive experiments demonstrate the superior generalization and effectiveness of our method.
zh

[CV-53] UniFField: A Generalizable Unified Neural Feature Field for Visual Semantic and Spatial Uncertainties in Any Scene

【速读】：该论文旨在解决当前3D神经特征场（Neural Feature Fields）在机器人任务中面临的两大关键问题：一是现有方法通常局限于特定场景，缺乏泛化能力；二是无法对预测结果进行不确定性建模，导致机器人难以评估感知信息的可靠性。解决方案的关键在于提出UniFField——一种统一的、具备不确定性感知能力的神经特征场，能够将视觉、语义和几何特征融合为单一可泛化的表示，并在增量式集成RGB-D图像的过程中同步更新各模态的不确定性估计。该方法支持零样本迁移至任意新环境，在场景重建和语义特征预测中准确描述模型误差，并成功应用于移动操作机器人主动目标搜索任务，验证了其在鲁棒决策中的有效性。

链接: https://arxiv.org/abs/2510.06754
作者: Christian Maurer,Snehal Jauhri,Sophie Lueth,Georgia Chalvatzaki
机构: TU Darmstadt (达姆施塔特工业大学); Hessian.AI; Robotics Institute Germany (德国机器人研究所); Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.
zh

[CV-54] OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

【速读】：该论文旨在解决大规模文本到图像扩散模型在推理阶段计算成本高昂的问题，尤其是现有一次性网络剪枝方法难以直接应用于具有迭代去噪特性的扩散模型。其解决方案的关键在于提出OBS-Diff框架，该框架通过三个核心创新实现训练-free的高效压缩：首先，重构经典的最优大脑手术（Optimal Brain Surgeon, OBS）算法以适配现代扩散模型的复杂结构，并支持多种剪枝粒度（包括非结构化、N:M半结构化以及结构化剪枝，如多头注意力（MHA）头和前馈神经网络（FFN）神经元）；其次，从误差累积角度出发，设计了一种时间步感知的海森矩阵构建方法，引入对数递减权重方案，强化早期时间步的重要性以抑制误差累积；最后，提出一种计算高效的分组顺序剪枝策略，分摊昂贵的校准过程，从而实现高精度且无训练的模型压缩。

链接: https://arxiv.org/abs/2510.06751
作者: Junhan Zhu,Hesong Wang,Mingluo Su,Zefang Wang,Huan Wang
机构: Westlake University (西湖大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.
zh

[CV-55] DeRainMamba: A Frequency-Aware State Space Model with Detail Enhancement for Image Deraining

【速读】：该论文旨在解决单图像去雨（single image deraining）任务中现有基于状态空间模型（state-space model）的方法在捕捉细粒度细节和缺乏频域感知能力方面的局限性，从而影响去雨效果与细节保留之间的平衡。解决方案的关键在于提出DeRainMamba架构，其核心创新为引入频率感知状态空间模块（Frequency-Aware State-Space Module, FASSM）和多方向感知卷积（Multi-Directional Perception Convolution, MDPConv）：FASSM通过傅里叶变换区分雨线与高频图像细节，实现雨痕去除与结构保真的协同优化；MDPConv则通过捕获各向异性梯度特征并高效融合多分支卷积，增强局部结构恢复能力。该方法在多个公开数据集上实现了更高的PSNR和SSIM指标，同时参数量更少、计算成本更低，验证了频域建模与空间细节增强相结合的有效性。

链接: https://arxiv.org/abs/2510.06746
作者: Zhiliang Zhu,Tao Zeng,Tao Yang,Guoliang Luo,Jiyong Zeng
机构: East China Jiaotong University (华东交通大学); Lianchuang Electronic Technology Company, Ltd. (联创电子科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IEEE SPL

点击查看摘要

Abstract:Image deraining is crucial for improving visual quality and supporting reliable downstream vision tasks. Although Mamba-based models provide efficient sequence modeling, their limited ability to capture fine-grained details and lack of frequency-domain awareness restrict further improvements. To address these issues, we propose DeRainMamba, which integrates a Frequency-Aware State-Space Module (FASSM) and Multi-Directional Perception Convolution (MDPConv). FASSM leverages Fourier transform to distinguish rain streaks from high-frequency image details, balancing rain removal and detail preservation. MDPConv further restores local structures by capturing anisotropic gradient features and efficiently fusing multiple convolution branches. Extensive experiments on four public benchmarks demonstrate that DeRainMamba consistently outperforms state-of-the-art methods in PSNR and SSIM, while requiring fewer parameters and lower computational costs. These results validate the effectiveness of combining frequency-domain modeling and spatial detail enhancement within a state-space framework for single image deraining.
zh

[CV-56] SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis

【速读】：该论文旨在解决动态场景建模中难以同时实现高精度形变捕捉与计算效率的问题，尤其是在目标跟踪和新视角合成任务中的挑战。其解决方案的关键在于提出了一种级联优化框架 SCas4D，该框架利用 3D 高斯点绘（3D Gaussian Splatting）中的结构模式，通过从粗粒度的局部区域到细粒度的单点级别逐步优化形变参数，从而在每帧仅需约 100 次迭代即可收敛，并且性能可媲美现有方法，但训练迭代次数仅为后者的二十分之一。这一策略有效利用了真实世界形变的层次性特征，即一组高斯点常共享相似变换，显著提升了动态场景建模的效率与精度。

链接: https://arxiv.org/abs/2510.06694
作者: Jipeng Lyu,Jiahua Dong,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Transactions on Machine Learning Research (06/2025)

点击查看摘要

Abstract:Persistent dynamic scene modeling for tracking and novel-view synthesis remains challenging due to the difficulty of capturing accurate deformations while maintaining computational efficiency. We propose SCas4D, a cascaded optimization framework that leverages structural patterns in 3D Gaussian Splatting for dynamic scenes. The key idea is that real-world deformations often exhibit hierarchical patterns, where groups of Gaussians share similar transformations. By progressively refining deformations from coarse part-level to fine point-level, SCas4D achieves convergence within 100 iterations per time frame and produces results comparable to existing methods with only one-twentieth of the training iterations. The approach also demonstrates effectiveness in self-supervised articulated object segmentation, novel view synthesis, and dense point tracking tasks.
zh

[CV-57] Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion

【速读】：该论文旨在解决自动驾驶场景中复杂条件下语义分割的挑战，特别是遮挡（occlusion）问题以及光场（light field）与激光雷达（LiDAR）模态之间因视点多样性有限和固有差异导致的有效融合难题。其解决方案的关键在于提出首个整合光场数据与点云数据的多模态语义分割数据集，并设计了多模态光场-点云融合分割网络（Mlpfseg），通过特征补全模块缓解点云与图像像素之间的密度不匹配问题，利用深度感知模块增强对遮挡目标的注意力机制，从而提升分割精度。实验表明，该方法在mIoU指标上分别优于纯图像分割和纯点云分割方法，验证了其有效性。

链接: https://arxiv.org/abs/2510.06687
作者: Jie Luo,Yuxuan Jiang,Xin Jin,Mingyu Liu,Yihui Fan
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; how- ever, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. To address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi-modal light field point-cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential re- construction of point-cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image- only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud-only segmentation by 2.38 mIoU, demonstrating its effectiveness.
zh

[CV-58] DreamOmni2: Multimodal Instruction-based Editing and Generation

【速读】：该论文旨在解决当前指令驱动图像编辑（instruction-based image editing）与主体驱动生成（subject-driven generation）任务中存在的两大局限：一是纯文本指令难以精确表达编辑细节，需依赖参考图像补充信息；二是现有方法仅限于具体对象或人物的组合，无法处理抽象概念。为此，作者提出两个新任务——多模态指令驱动编辑与生成（multimodal instruction-based editing and generation），支持文本和图像双重输入，并涵盖具象与抽象概念。解决方案的关键在于：首先设计了一套三阶段数据合成流程，利用特征混合方法生成抽象与具象概念的提取数据，进而构建多模态训练样本；其次提出索引编码与位置编码偏移机制（index encoding and position encoding shift scheme），有效区分多图输入并避免像素混淆；最后通过视觉语言模型（VLM）与编辑/生成模型的联合训练，提升对复杂指令的理解与执行能力。

链接: https://arxiv.org/abs/2510.06679
作者: Bin Xia,Bohao Peng,Yuechen Zhang,Junjia Huang,Jiyang Liu,Jingyao Li,Haoru Tan,Sitong Wu,Chengyao Wang,Yitong Wang,Xinglong Wu,Bei Yu,Jiaya Jia
机构: CUHK (香港中文大学); HKUST (香港科技大学); HKU (香港大学); ByteDance Inc (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
zh

[CV-59] Heptapod: Language Modeling on Visual Signals

【速读】：该论文旨在解决图像生成中传统自回归模型（autoregressive model）在建模效率与语义完整性之间的矛盾问题，尤其是针对现有方法对条件控制生成（Conditional Generative Flow, CFG）的依赖以及使用语义分词器（semantic tokenizer）带来的局限性。其解决方案的关键在于提出了一种名为 Heptapod 的新型图像自回归模型，其核心创新是引入“下一二维分布预测”（next 2D distribution prediction）机制：通过一个因果 Transformer 架构配合以重建为导向的视觉分词器，在每个时间步学习预测整个图像二维空间网格上的像素分布，从而将自回归序列建模与掩码自编码（masked autoencoding）的全局自监督学习统一起来，使模型能够通过生成式训练捕获更完整的图像语义信息。该方法在 ImageNet 图像生成基准上实现了 FID = 2.70，显著优于以往因果自回归方法。

链接: https://arxiv.org/abs/2510.06673
作者: Yongxin Zhu,Jiawei Chen,Yuanzhe Chen,Zhuo Chen,Dongya Jia,Jian Cong,Xiaobin Zhuang,Yuping Wang,Yuxuan Wang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbfcausal attention, \textbfeliminates reliance on CFG, and \textbfeschews the trend of semantic tokenizers. Our key innovation is \textitnext 2D distribution prediction: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of 2.70 , significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.
zh

[CV-60] Automated Neural Architecture Design for Industrial Defect Detection

【速读】：该论文针对工业表面缺陷检测（Industrial Surface Defect Detection, SDD）中因缺陷形态多样导致的类内差异大和类间相似性强两大挑战，提出了一种自动化神经架构设计框架AutoNAD。其核心解决方案在于：1）通过联合搜索卷积、Transformer与多层感知机（Multi-Layer Perceptron, MLP）的混合架构，同时捕捉细粒度局部特征与长程语义上下文；2）引入跨权重共享策略以加速超网络（supernet）收敛并提升子网络性能；3）集成可搜索的多级特征聚合模块（Multi-Level Feature Aggregation Module, MFAM）增强多尺度特征学习能力；4）结合延迟感知先验指导高效架构选择，兼顾检测精度与部署时延。

链接: https://arxiv.org/abs/2510.06669
作者: Yuxi Liu,Yunfeng Ma,Yi Tang,Min Liu,Shuai Jiang,Yaonan Wang
机构: Hunan University (湖南大学); National Engineering Research Center for Robot Visual Perception and Control Technology (国家机器人视觉感知与控制技术工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code will be available at this https URL.
zh

[CV-61] he False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators

【速读】：该论文旨在解决科学机器学习中一个核心挑战：如何对离散表示的连续现象进行建模，特别是评估机器学习算子（Machine-learned Operators, MLOs）是否具备“零样本超分辨率”（zero-shot super-resolution）能力，即在未见过的更高分辨率数据上实现准确推理。研究发现，尽管MLO架构理论上可支持任意分辨率推理，但其在零样本条件下无法有效外推频率信息或跨分辨率插值，导致模型在训练分辨率之外表现脆弱并易受混叠（aliasing）影响。为此，作者提出一种简单、计算高效且基于数据驱动的多分辨率训练协议，通过引入多分辨率训练数据来克服混叠问题，并显著提升模型在不同分辨率下的泛化能力。

链接: https://arxiv.org/abs/2510.06646
作者: Mansi Sakarvadia,Kareem Hegazy,Amin Totounferoush,Kyle Chard,Yaoqing Yang,Ian Foster,Michael W. Mahoney
机构: University of Chicago (芝加哥大学); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); University of Stuttgart (斯图加特大学); Dartmouth College (达特茅斯学院); International Computer Science Institute (国际计算机科学研究所); University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A core challenge in scientific machine learning, and scientific computing more generally, is modeling continuous phenomena which (in practice) are represented discretely. Machine-learned operators (MLOs) have been introduced as a means to achieve this modeling goal, as this class of architecture can perform inference at arbitrary resolution. In this work, we evaluate whether this architectural innovation is sufficient to perform “zero-shot super-resolution,” namely to enable a model to serve inference on higher-resolution data than that on which it was originally trained. We comprehensively evaluate both zero-shot sub-resolution and super-resolution (i.e., multi-resolution) inference in MLOs. We decouple multi-resolution inference into two key behaviors: 1) extrapolation to varying frequency information; and 2) interpolating across varying resolutions. We empirically demonstrate that MLOs fail to do both of these tasks in a zero-shot manner. Consequently, we find MLOs are not able to perform accurate inference at resolutions different from those on which they were trained, and instead they are brittle and susceptible to aliasing. To address these failure modes, we propose a simple, computationally-efficient, and data-driven multi-resolution training protocol that overcomes aliasing and that provides robust multi-resolution generalization.
zh

[CV-62] StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

【速读】：该论文针对隐式知识视觉问答（IK-KVQA）中多模态大语言模型（MLLM）缺乏显式推理监督、生成理由不一致且标准监督微调（SFT）后泛化能力差的问题展开研究。其解决方案的关键在于提出StaR-KVQA框架，通过构建并选择基于路径的结构化推理轨迹（structured reasoning traces），即双符号关系路径与路径锚定的自然语言解释，实现推理过程的透明化和可验证性；利用一个开源MLLM离线生成并筛选这些轨迹形成增强数据集，再通过结构化自蒸馏（structured self-distillation）进行微调，从而在无需外部检索器、验证器或知识库的情况下，显著提升模型准确率与跨域泛化性能。

链接: https://arxiv.org/abs/2510.06638
作者: Zhihao Wen,Wenkang Wei,Yuan Fang,Xingtong Yu,Hui Zhang,Weicheng Zhu,Xin Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.
zh

[CV-63] Control-Augmented Autoregressive Diffusion for Data Assimilation

【速读】：该论文旨在解决自回归扩散模型（ARDM）中引导机制（guidance）研究不足的问题，特别是在混沌时空偏微分方程（PDEs）的数据同化（Data Assimilation, DA）场景下，现有方法常因计算成本高且在稀疏观测下易产生预报漂移（forecast drift）。解决方案的关键在于提出一种可泛化的框架，通过一个轻量级控制器网络（controller network）对预训练的ARDM进行增强，该控制器在网络离线训练阶段通过预览未来ARDM轨迹并学习逐步控制策略，以最小化终端代价函数为目标。该方法将数据同化推理简化为一次前向滚动过程，并结合实时校正，避免了推理时昂贵的伴随计算或优化步骤，从而在稳定性、精度和物理保真度上显著优于四个主流基线方法。

链接: https://arxiv.org/abs/2510.06637
作者: Prakhar Srivastava,Farrin Marouf Sofian,Francesco Immorlano,Kushagra Pandey,Stephan Mandt
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in test-time scaling and finetuning of diffusion models, guidance in Auto-Regressive Diffusion Models (ARDMs) remains underexplored. We introduce an amortized framework that augments pretrained ARDMs with a lightweight controller network, trained offline by previewing future ARDM rollouts and learning stepwise controls that anticipate upcoming observations under a terminal cost objective. We evaluate this framework in the context of data assimilation (DA) for chaotic spatiotemporal partial differential equations (PDEs), a setting where existing methods are often computationally prohibitive and prone to forecast drift under sparse observations. Our approach reduces DA inference to a single forward rollout with on-the-fly corrections, avoiding expensive adjoint computations and/or optimizations during inference. We demonstrate that our method consistently outperforms four state-of-the-art baselines in stability, accuracy, and physical fidelity across two canonical PDEs and six observation regimes. We will release code and checkpoints publicly.
zh

[CV-64] StruSR: Structure-Aware Symbolic Regression with Physics-Informed Taylor Guidance

【速读】：该论文旨在解决传统符号回归（Symbolic Regression）方法在科学建模中难以从时间序列观测数据中提取结构化物理先验信息的问题，从而导致所发现的符号表达式无法准确反映系统的全局行为。其解决方案的关键在于提出一种结构感知的符号回归框架StruSR，该框架利用训练好的物理信息神经网络（Physics-Informed Neural Networks, PINNs）从时序数据中提取局部结构化的物理先验；通过在PINN输出上进行局部泰勒展开（Taylor expansion），获得基于导数的结构信息以引导符号表达式的演化，并引入基于掩码的归因机制量化每个子树对结构一致性和物理残差减少的贡献，进而指导遗传编程中的变异与交叉操作，保留高物理或结构重要性的子结构并优化低信息量组件；同时设计混合适应度函数联合最小化物理残差与泰勒系数偏差，确保符号表达式既符合控制方程又匹配局部解析行为，显著提升了收敛速度、结构保真度和表达式的可解释性。

链接: https://arxiv.org/abs/2510.06635
作者: Yunpeng Gong,Sihan Lan,Can Yang,Kunpeng Xu,Min Jiang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Symbolic regression aims to find interpretable analytical expressions by searching over mathematical formula spaces to capture underlying system behavior, particularly in scientific modeling governed by physical laws. However, traditional methods lack mechanisms for extracting structured physical priors from time series observations, making it difficult to capture symbolic expressions that reflect the system’s global behavior. In this work, we propose a structure-aware symbolic regression framework, called StruSR, that leverages trained Physics-Informed Neural Networks (PINNs) to extract locally structured physical priors from time series data. By performing local Taylor expansions on the outputs of the trained PINN, we obtain derivative-based structural information to guide symbolic expression evolution. To assess the importance of expression components, we introduce a masking-based attribution mechanism that quantifies each subtree’s contribution to structural alignment and physical residual reduction. These sensitivity scores steer mutation and crossover operations within genetic programming, preserving substructures with high physical or structural significance while selectively modifying less informative components. A hybrid fitness function jointly minimizes physics residuals and Taylor coefficient mismatch, ensuring consistency with both the governing equations and the local analytical behavior encoded by the PINN. Experiments on benchmark PDE systems demonstrate that StruSR improves convergence speed, structural fidelity, and expression interpretability compared to conventional baselines, offering a principled paradigm for physics-grounded symbolic discovery.
zh

[CV-65] Unsupervised Backdoor Detection and Mitigation for Spiking Neural Networks RAID2025

【速读】：该论文旨在解决脉冲神经网络（Spiking Neural Networks, SNNs）在面对后门攻击时的安全性问题，尤其是现有针对人工神经网络（Artificial Neural Networks, ANNs）的防御方法在SNN中效果不佳或易被绕过的问题。其关键解决方案在于提出了一种无监督的后训练检测框架Temporal Membrane Potential Backdoor Detection (TMPBD)，该框架利用最终脉冲层中时间膜电位（Temporal Membrane Potential, TMP）的最大边际统计特性来识别目标标签，无需任何攻击知识或数据访问；同时引入神经树突抑制机制Neural Dendrites Suppression Backdoor Mitigation (NDSBM)，通过钳制早期卷积层间的树突连接以抑制恶意神经元，同时保持良性行为，该机制基于少量干净无标签数据提取的TMP进行引导。实验表明，TMPBD实现100%检测准确率，NDSBM将攻击成功率从100%降至8.44%，结合检测后进一步降至2.81%，且不损害正常准确率。

链接: https://arxiv.org/abs/2510.06629
作者: Jiachen Li,Bang Wu,Xiaoyu Xia,Xiaoning Liu,Xun Yi,Xiuzhen Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in The 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2025)

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have gained increasing attention for their superior energy efficiency compared to Artificial Neural Networks (ANNs). However, their security aspects, particularly under backdoor attacks, have received limited attention. Existing defense methods developed for ANNs perform poorly or can be easily bypassed in SNNs due to their event-driven and temporal dependencies. This paper identifies the key blockers that hinder traditional backdoor defenses in SNNs and proposes an unsupervised post-training detection framework, Temporal Membrane Potential Backdoor Detection (TMPBD), to overcome these challenges. TMPBD leverages the maximum margin statistics of temporal membrane potential (TMP) in the final spiking layer to detect target labels without any attack knowledge or data access. We further introduce a robust mitigation mechanism, Neural Dendrites Suppression Backdoor Mitigation (NDSBM), which clamps dendritic connections between early convolutional layers to suppress malicious neurons while preserving benign behaviors, guided by TMP extracted from a small, clean, unlabeled dataset. Extensive experiments on multiple neuromorphic benchmarks and state-of-the-art input-aware dynamic trigger attacks demonstrate that TMPBD achieves 100% detection accuracy, while NDSBM reduces the attack success rate from 100% to 8.44%, and to 2.81% when combined with detection, without degrading clean accuracy.
zh

[CV-66] MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking

【速读】：该论文旨在解决现实场景中基于RGB的视觉目标跟踪方法在遮挡、相似物体干扰及复杂背景等挑战下性能受限的问题。其解决方案的关键在于构建MSITrack——目前规模最大、多样性最高的多光谱单目标跟踪数据集，通过引入像素级光谱反射率信息提升目标可区分性，并以更丰富的自然场景（55类目标、300个场景）、更全面的挑战属性（如颜色和纹理相似性）以及超过129k帧的高质量标注数据，显著优于仅依赖RGB的基线方法，从而推动多光谱跟踪技术的发展。

链接: https://arxiv.org/abs/2510.06619
作者: Tao Feng,Tingfa Xu,Haolin Qin,Tianhao Li,Shuaihao Han,Xuyang Zou,Zhan Lv,Jianan Li
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds-all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes-including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes-spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale-300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: this https URL.
zh

[CV-67] A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

【速读】：该论文旨在解决多语言语音驱动人脸动画合成（Speech-driven Talking Face Synthesis, TFS）中因训练数据以英语为主导致的非英语语言表现不佳问题，具体表现为错误的嘴型和僵硬的表情。其核心解决方案是提出Multilingual Experts (MuEx) 框架，关键在于引入基于音素（phoneme）和视觉音素（viseme）的通用中间表示，并设计 Phoneme-Guided Mixture-of-Experts (PG-MoE) 架构来实现跨模态对齐与多语言泛化。通过将音频特征映射为音素、视频特征映射为视觉音素，并借助 Phoneme-Viseme Alignment Mechanism (PV-Align) 建立鲁棒的跨模态对应关系，MuEx 能够有效缓解语言差异和数据偏置的影响，从而在12种语言上实现高质量的TFS合成，并具备零样本迁移至未见语言的能力。

链接: https://arxiv.org/abs/2510.06612
作者: Zibo Su,Kun Wei,Jiahua Li,Xu Yang,Cheng Deng
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.
zh

[CV-68] Self-supervised Physics-guided Model with Implicit Representation Regularization for Fast MRI Reconstruction

【速读】：该论文旨在解决磁共振成像（MRI）扫描时间过长限制其广泛应用的问题，提出了一种无需外部训练数据的零样本自监督重建框架UnrollINR。其关键在于将物理引导的展开迭代重建架构与隐式神经表示（Implicit Neural Representation, INR）相结合，利用INR作为正则化先验以有效约束解空间，从而在高加速因子（如10倍）下实现优于监督学习方法的重建性能，同时提升了模型的可解释性与泛化能力。

链接: https://arxiv.org/abs/2510.06611
作者: Jingran Xu,Yuanyuan Liu,Yanjie Zhu
机构: Paul C. Lauterbur Research Center for Biomedical Imaging (保罗·劳特伯生物医学成像研究中心); Shenzhen Institute of Advanced Technology (深圳先进技术研究院); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a vital clinical diagnostic tool, yet its widespread application is limited by prolonged scan times. Fast MRI reconstruction techniques effectively reduce acquisition duration by reconstructing high-fidelity MR images from undersampled k-space data. In recent years, deep learning-based methods have demonstrated remarkable progress in this field, with self-supervised and unsupervised learning approaches proving particularly valuable in scenarios where fully sampled data are difficult to obtain. This paper proposes a novel zero-shot self-supervised reconstruction framework named UnrollINR, which enables scan-specific MRI reconstruction without relying on external training data. The method adopts a physics-guided unrolled iterative reconstruction architecture and introduces Implicit Neural Representation (INR) as a regularization prior to effectively constrain the solution space. By combining a deep unrolled structure with the powerful implicit representation capability of INR, the model’s interpretability and reconstruction performance are enhanced. Experimental results demonstrate that even at a high acceleration rate of 10, UnrollINR achieves superior reconstruction performance compared to the supervised learning method, validating the superiority of the proposed method.
zh

[CV-69] AIM 2025 Challenge on Real-World RAW Image Denoising

【速读】：该论文旨在解决低光照条件下RAW图像去噪问题，特别是如何在仅使用合成数据训练的情况下实现跨相机型号的鲁棒去噪性能。其关键解决方案在于构建一个基于真实世界采集的挑战性低光RAW图像基准测试平台，并鼓励参赛者开发创新的噪声合成流程、网络架构和训练方法，以在多种相机模型上实现高精度去噪效果，同时结合全参考指标（PSNR、SSIM、LPIPS）与无参考指标（ARNIQA、TOPIQ）进行综合评估，从而推动生成式AI (Generative AI) 在图像复原领域的实用化发展。

链接: https://arxiv.org/abs/2510.06601
作者: Feiran Li,Jiacheng Li,Marcos V. Conde,Beril Besbinar,Vlad Hosu,Daisuke Iso,Radu Timofte
机构: Sony Research(索尼研究); University of Würzburg, Computer Vision Lab(维尔茨堡大学计算机视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the AIM 2025 Real-World RAW Image Denoising Challenge, aiming to advance efficient and effective denoising techniques grounded in data synthesis. The competition is built upon a newly established evaluation benchmark featuring challenging low-light noisy images captured in the wild using five different DSLR cameras. Participants are tasked with developing novel noise synthesis pipelines, network architectures, and training methodologies to achieve high performance across different camera models. Winners are determined based on a combination of performance metrics, including full-reference measures (PSNR, SSIM, LPIPS), and non-reference ones (ARNIQA, TOPIQ). By pushing the boundaries of camera-agnostic low-light RAW image denoising trained on synthetic data, the competition promotes the development of robust and practical models aligned with the rapid progress in digital photography. We expect the competition outcomes to influence multiple domains, from image restoration to night-time autonomous driving.
zh

[CV-70] SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

【速读】：该论文旨在解决合成数据（synthetic data）质量评估难题，特别是在资源受限的对象检测任务中，如何高效筛选和生成高质量数据集的问题。传统方法依赖模型训练收敛后的性能指标（如mAP）来评估数据质量，存在成本高、效率低的缺陷。其解决方案的关键在于提出一种无需模型训练即可评估合成数据质量的新指标——合成数据集质量度量（Synthetic Dataset Quality Metric, SDQM），该指标与YOLOv11模型的mAP具有强相关性，且能提供可操作的改进建议，从而显著提升合成数据生成与选择的效率，为对象检测任务中的数据质量评估树立了新标准。

链接: https://arxiv.org/abs/2510.06596
作者: Ayush Zenith,Arnold Zumbrun,Neel Raut,Jing Lin
机构: Khoury College of Computer Sciences, Northeastern University (东北大学计算机科学学院); School of Computing, Binghamton University (宾汉姆顿大学计算机学院); Mission Applications & Infrastructure Section, Air Force Research Laboratory (美国空军研究实验室任务应用与基础设施部分)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at this https URL
zh

[CV-71] Adaptive Stain Normalization for Cross-Domain Medical Histology MICCAI2025

【速读】：该论文旨在解决数字病理图像分析中因染色协议和成像条件差异导致的颜色不一致性问题，这种问题在深度学习模型部署时会引发域偏移（domain shift），从而显著降低模型性能。解决方案的关键在于提出一种可训练的颜色归一化模型，该模型基于Beer-Lambert定律构建，通过算法展开非负矩阵分解（NMF）的优化过程，提取与染色无关的结构信息作为下游任务（如目标检测和分类）的输入，避免了传统方法对模板图像依赖或引入伪影的问题，从而实现更鲁棒的跨域泛化能力。

链接: https://arxiv.org/abs/2510.06592
作者: Tianyue Xu,Yanlin Wu,Abhai K. Tripathi,Matthew M. Ippolito,Benjamin D. Haeffele
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2025)

点击查看摘要

Abstract:Deep learning advances have revolutionized automated digital pathology analysis. However, differences in staining protocols and imaging conditions can introduce significant color variability. In deep learning, such color inconsistency often reduces performance when deploying models on data acquired under different conditions from the training data, a challenge known as domain shift. Many existing methods attempt to address this problem via color normalization but suffer from several notable drawbacks such as introducing artifacts or requiring careful choice of a template image for stain mapping. To address these limitations, we propose a trainable color normalization model that can be integrated with any backbone network for downstream tasks such as object detection and classification. Based on the physics of the imaging process per the Beer-Lambert law, our model architecture is derived via algorithmic unrolling of a nonnegative matrix factorization (NMF) model to extract stain-invariant structural information from the original pathology images, which serves as input for further processing. Experimentally, we evaluate the method on publicly available pathology datasets and an internally curated collection of malaria blood smears for cross-domain object detection and classification, where our method outperforms many state-of-the-art stain normalization methods. Our code is available at this https URL.
zh

[CV-72] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

【速读】：该论文旨在解决视觉理解与生成任务在自回归范式下统一建模时的核心挑战——即现有离散潜在空间的视觉分词器（visual tokenizer）因量化误差限制语义表达能力，从而削弱视觉-语言理解性能的问题。其解决方案的关键在于提出MingTok系列连续潜在空间的视觉分词器，采用三阶段顺序架构：低级编码、语义扩展与视觉重建，以兼顾理解任务所需的高维判别特征和生成任务所需的紧凑低级码本；在此基础上构建的Ming-UniVision进一步消除了任务特定的视觉表示需求，通过将理解和生成均建模为共享连续空间中的下一个token预测，实现多轮、上下文感知的统一自回归处理，实验证明该方法能有效调和理解与生成对分词器的不同需求，显著提升两个领域的性能表现。

链接: https://arxiv.org/abs/2510.06590
作者: Ziyuan Huang,DanDan Zheng,Cheng Zou,Rui Liu,Xiaolong Wang,Kaixiang Ji,Weilong Chai,Jianxin Sun,Libin Wang,Yongjie Lv,Taozhi Huang,Jiajia Liu,Qingpei Guo,Ming Yang,Jingdong Chen,Jun Zhou
机构: Inclusion AI; Ant Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code released at this https URL

点击查看摘要

Abstract:Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
zh

[CV-73] Improving Artifact Robustness for CT Deep Learning Models Without Labeled Artifact Images via Domain Adaptation

【速读】：该论文试图解决深度学习模型在医学影像中因分布偏移（distribution shift）导致性能下降的问题，特别是当CT图像引入训练数据中未见的新伪影（如探测器增益误差引起的环状伪影）时，模型分类准确率显著降低。解决方案的关键在于采用域自适应（domain adaptation）技术，具体使用域对抗神经网络（Domain Adversarial Neural Networks, DANN），通过在训练过程中引入未标注的伪影图像，使模型学习到与源域（干净图像）无关的特征表示，从而在无需额外专家标注新伪影数据的情况下，保持对未知伪影域的高分类准确性。实验表明，DANN方法不仅在环状伪影测试集上达到与显式标注伪影数据训练模型相当的性能，还展现出对均匀噪声的意外泛化能力，验证了域自适应在临床场景中应对新型伪影的有效性。

链接: https://arxiv.org/abs/2510.06584
作者: Justin Cheung,Samuel Savine,Calvin Nguyen,Lin Lu,Alhassan S. Yasin
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注: 8 pages, 12 figures, 1 table

点击查看摘要

Abstract:Deep learning models which perform well on images from their training distribution can degrade substantially when applied to new distributions. If a CT scanner introduces a new artifact not present in the training labels, the model may misclassify the images. Although modern CT scanners include design features which mitigate these artifacts, unanticipated or difficult-to-mitigate artifacts can still appear in practice. The direct solution of labeling images from this new distribution can be costly. As a more accessible alternative, this study evaluates domain adaptation as an approach for training models that maintain classification performance despite new artifacts, even without corresponding labels. We simulate ring artifacts from detector gain error in sinogram space and evaluate domain adversarial neural networks (DANN) against baseline and augmentation-based approaches on the OrganAMNIST abdominal CT dataset. Our results demonstrate that baseline models trained only on clean images fail to generalize to images with ring artifacts, and traditional augmentation with other distortion types provides no improvement on unseen artifact domains. In contrast, the DANN approach successfully maintains high classification accuracy on ring artifact images using only unlabeled artifact data during training, demonstrating the viability of domain adaptation for artifact robustness. The domain-adapted model achieved classification performance on ring artifact test data comparable to models explicitly trained with labeled artifact images, while also showing unexpected generalization to uniform noise. These findings provide empirical evidence that domain adaptation can effectively address distribution shift in medical imaging without requiring expensive expert labeling of new artifact distributions, suggesting promise for deployment in clinical settings where novel artifacts may emerge.
zh

[CV-74] hrough the Perspective of LiDAR: A Feature-Enriched and Uncertainty-Aware Annotation Pipeline for Terrestrial Point Cloud Segmentation

【速读】：该论文旨在解决地面激光扫描（TLS）点云语义分割中因人工标注成本高昂而导致的准确率受限问题。其核心解决方案是一个半自动化、不确定性感知的处理流程，关键在于：首先将3D点云投影至2D球面网格并融合多源特征，进而训练集成学习模型生成伪标签与不确定性图，后者用于引导对模糊区域的重点标注；随后通过反投影将2D结果映射回3D空间，并借助三层次可视化工具（2D特征图、3D彩色点云及紧凑虚拟球体）实现快速筛查与标注指导。该方法显著降低了标注工作量，同时保持高精度（mIoU达0.76），并通过构建Mangrove3D数据集和跨数据集验证，为生态监测等场景提供了可扩展、高质量的TLS点云分割方案。

链接: https://arxiv.org/abs/2510.06582
作者: Fei Zhang,Rob Chancia,Josie Clapp,Amirhossein Hassanzadeh,Dimah Dera,Richard MacKenzie,Jan van Aardt
机构: Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology, Rochester, NY, USA; U.S. Forest Service, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi-automated, uncertainty-aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi-source features, and trains an ensemble of segmentation networks to produce pseudo-labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back-projected to 3D, yielding densely annotated point clouds supported by a three-tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine-channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature-enrichment strategy through cross-dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty-aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high-quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2510.06582 [cs.CV] (or arXiv:2510.06582v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.06582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-75] HSNet: Heterogeneous Subgraph Network for Single Image Super-resolution

【速读】：该论文旨在解决现有图像超分辨率（Image Super-Resolution, ISR）方法中深度学习模型结构僵化与图模型计算复杂度高的问题。具体而言，基于卷积神经网络（CNN）和注意力机制的方法在建模复杂空间关系时缺乏灵活性，而纯图结构方法虽具更强表达能力却常因计算开销过大难以实用。其解决方案的关键在于提出异质子图网络（Heterogeneous Subgraph Network, HSNet），通过将全局图分解为可管理的子图组件实现高效建模：首先设计构造性子图集模块（Constructive Subgraph Set Block, CSSB）生成多样互补的子图以捕获局部与全局的异质特征；其次利用子图聚合模块（Subgraph Aggregation Block, SAB）自适应融合多图特征，构建判别性强的综合表示；并引入节点采样策略（Node Sampling Strategy, NSS）保留关键特征，在提升精度的同时降低计算负担。

链接: https://arxiv.org/abs/2510.06564
作者: Qiongyang Hu,Wenyang Liu,Wenbin Zou,Yuejiao Su,Lap-Pui Chau,Yi Wang
机构: The Hong Kong Polytechnic University (香港理工大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing deep learning approaches for image super-resolution, particularly those based on CNNs and attention mechanisms, often suffer from structural inflexibility. Although graph-based methods offer greater representational adaptability, they are frequently impeded by excessive computational complexity. To overcome these limitations, this paper proposes the Heterogeneous Subgraph Network (HSNet), a novel framework that efficiently leverages graph modeling while maintaining computational feasibility. The core idea of HSNet is to decompose the global graph into manageable sub-components. First, we introduce the Constructive Subgraph Set Block (CSSB), which generates a diverse set of complementary subgraphs. Rather than relying on a single monolithic graph, CSSB captures heterogeneous characteristics of the image by modeling different relational patterns and feature interactions, producing a rich ensemble of both local and global graph structures. Subsequently, the Subgraph Aggregation Block (SAB) integrates the representations embedded across these subgraphs. Through adaptive weighting and fusion of multi-graph features, SAB constructs a comprehensive and discriminative representation that captures intricate interdependencies. Furthermore, a Node Sampling Strategy (NSS) is designed to selectively retain the most salient features, thereby enhancing accuracy while reducing computational overhead. Extensive experiments demonstrate that HSNet achieves state-of-the-art performance, effectively balancing reconstruction quality with computational efficiency. The code will be made publicly available.
zh

[CV-76] Cluster Paths: Navigating Interpretability in Neural Networks

【速读】：该论文旨在解决深度神经网络在视觉任务中表现优异但决策过程缺乏可解释性的问题，这可能导致不恰当的信任、未被发现的偏见以及意外失败。其解决方案的关键在于提出“聚类路径”（cluster paths）这一后处理可解释性方法：通过在选定层对激活值进行聚类，并将每个输入表示为其对应的聚类ID序列，从而生成简洁且人类可读的解释。该方法不仅能够揭示模型在不同网络深度所依赖的视觉概念（如颜色调色板、纹理或物体上下文），还能通过四个量化指标评估其认知负荷、类别对齐度、预测保真度及扰动稳定性，最终实现对分布外样本的有效检测，且不影响模型准确性。

链接: https://arxiv.org/abs/2510.06541
作者: Nicholas M. Kroeger,Vincent Bindschaedler
机构: University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.
zh

[CV-77] VUGEN: Visual Understanding priors for GENeration

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在具备强大图文理解能力的同时，难以高效且高质量地实现图像生成的问题。现有方法通常依赖于重建导向的自编码器（autoencoders）或复杂的桥梁机制，导致理解与生成表征之间存在偏差或架构复杂度高。其解决方案的关键在于提出VUGEN框架，通过将VLM预训练视觉编码器的高维潜在空间映射到低维、可处理的分布以最大化保留视觉信息，并在此基础上训练VLM直接在该缩减空间中采样，从而确保生成过程与原始视觉理解能力对齐；同时引入无需变分自编码器（VAE-free）的像素级扩散解码器，显著简化结构并提升生成质量，在保持原有理解性能的前提下实现了更优的图像生成效果。

链接: https://arxiv.org/abs/2510.06529
作者: Xiangyi Chen,Théophane Vallaeys,Maha Elbayad,John Nguyen,Jakob Verbeek
机构: Meta Fundamental AI Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM’s pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM’s native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM’s original understanding capabilities.
zh

[CV-78] Real-Time Glass Detection and Reprojection using Sensor Fusion Onboard Aerial Robots ICRA2026

【速读】：该论文旨在解决透明障碍物（如玻璃）对自主飞行机器人导航与建图带来的挑战，因其缺乏明显特征且会导致传统深度传感器失效，从而引发地图误差和碰撞风险。解决方案的关键在于提出一种计算高效的融合框架，通过将飞行器上的时间-of-flight (Time-of-Flight, ToF) 相机与超声波传感器数据相结合，并利用一个轻量级二维卷积模型检测镜面反射并将其深度信息传播至深度图中的空区域，使透明障碍物在感知层面“可见”。该方法可在嵌入式处理器上实时运行，仅占用极小的CPU资源，是首个在低SWaP（尺寸、重量与功耗）四旋翼飞行器上实现基于纯CPU的实时透明障碍物建图系统的工作。

链接: https://arxiv.org/abs/2510.06518
作者: Malakhi Hopkins,Varun Murali,Vijay Kumar,Camillo J Taylor
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 8 pages, 8 figures, submitted to ICRA 2026

点击查看摘要

Abstract:Autonomous aerial robots are increasingly being deployed in real-world scenarios, where transparent obstacles present significant challenges to reliable navigation and mapping. These materials pose a unique problem for traditional perception systems because they lack discernible features and can cause conventional depth sensors to fail, leading to inaccurate maps and potential collisions. To ensure safe navigation, robots must be able to accurately detect and map these transparent obstacles. Existing methods often rely on large, expensive sensors or algorithms that impose high computational burdens, making them unsuitable for low Size, Weight, and Power (SWaP) robots. In this work, we propose a novel and computationally efficient framework for detecting and mapping transparent obstacles onboard a sub-300g quadrotor. Our method fuses data from a Time-of-Flight (ToF) camera and an ultrasonic sensor with a custom, lightweight 2D convolution model. This specialized approach accurately detects specular reflections and propagates their depth into corresponding empty regions of the depth map, effectively rendering transparent obstacles visible. The entire pipeline operates in real-time, utilizing only a small fraction of a CPU core on an embedded processor. We validate our system through a series of experiments in both controlled and real-world environments, demonstrating the utility of our method through experiments where the robot maps indoor environments containing glass. Our work is, to our knowledge, the first of its kind to demonstrate a real-time, onboard transparent obstacle mapping system on a low-SWaP quadrotor using only the CPU.
zh

[CV-79] Limited-Angle Tomography Reconstruction via Projector Guided 3D Diffusion

【速读】：该论文旨在解决有限角度电子断层扫描（Limited-angle electron tomography）中因缺失楔形问题（missing-wedge problem）导致的重建伪影严重、图像质量下降的问题。现有深度学习方法虽能缓解此类伪影，但通常依赖大量带有已知三维真实结构（3D ground truth）的高质量训练数据，而这类数据在电子显微镜领域难以获取。论文提出了一种基于扩散模型的三维迭代重建框架 TEMDiff，其关键在于利用易于获得的 FIB-SEM（聚焦离子束-扫描电子显微镜）三维数据作为训练基础，并通过一个模拟器将这些数据映射为虚拟的 TEM 倾斜系列（tilt series），从而让模型在无需清洁 TEM 真实标签的情况下学习到真实的结构先验信息。该方法直接在三维体积上操作，隐式地保证了切片间的一致性，无需额外正则化项，在低角度覆盖（如仅8度倾斜范围）下仍可实现高精度重建，且具备良好的泛化能力，无需再训练或微调即可应用于不同实验条件下的真实TEM数据。

链接: https://arxiv.org/abs/2510.06516
作者: Zhantao Deng,Mériem Er-Rafik,Anna Sushko,Cécile Hébert,Pascal Fua
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figures

点击查看摘要

Abstract:Limited-angle electron tomography aims to reconstruct 3D shapes from 2D projections of Transmission Electron Microscopy (TEM) within a restricted range and number of tilting angles, but it suffers from the missing-wedge problem that causes severe reconstruction artifacts. Deep learning approaches have shown promising results in alleviating these artifacts, yet they typically require large high-quality training datasets with known 3D ground truth which are difficult to obtain in electron microscopy. To address these challenges, we propose TEMDiff, a novel 3D diffusion-based iterative reconstruction framework. Our method is trained on readily available volumetric FIB-SEM data using a simulator that maps them to TEM tilt series, enabling the model to learn realistic structural priors without requiring clean TEM ground truth. By operating directly on 3D volumes, TEMDiff implicitly enforces consistency across slices without the need for additional regularization. On simulated electron tomography datasets with limited angular coverage, TEMDiff outperforms state-of-the-art methods in reconstruction quality. We further demonstrate that a trained TEMDiff model generalizes well to real-world TEM tilts obtained under different conditions and can recover accurate structures from tilt ranges as narrow as 8 degrees, with 2-degree increments, without any retraining or fine-tuning.
zh

[CV-80] LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval

【速读】：该论文旨在解决如何将局部属性得分（如视频中检测到的物体或音频中识别出的情绪）有效提升为序列层面的时序属性得分问题，即为时序属性赋分（Scores for TempOral Properties, STOPs），以支持下游任务如时序查询匹配和基于时序逻辑的排序检索。其核心挑战在于处理来自潜在噪声的局部预测器所输出的分数，并准确建模跨时间步的时序依赖关系。解决方案的关键是提出了一种名为 LogSTOP 的评分函数，它能够高效计算由线性时序逻辑（Linear Temporal Logic, LTL）表示的时序属性得分，通过结合局部检测模型（如 YOLO、HuBERT、Grounding DINO 和 SlowR50）与 LTL 语义的对数空间优化策略，在多个基准任务上显著优于大型视觉/语音语言模型及其他基于时序逻辑的基线方法。

链接: https://arxiv.org/abs/2510.06512
作者: Avishree Khare,Hideki Okamoto,Bardh Hoxha,Georgios Fainekos,Rajeev Alur
机构: University of Pennsylvania (宾夕法尼亚大学); Toyota (丰田)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural models such as YOLO and HuBERT can be used to detect local properties such as objects (“car”) and emotions (“angry”) in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., “does the speaker eventually sound happy in this audio clip?”), and ranked retrieval (e.g., “retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected”). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.
zh

[CV-81] From Captions to Keyframes: Efficient Video Summarization via Caption- and Context-Aware Frame Scoring

【速读】：该论文旨在解决长视频中语义信息冗余导致的计算效率低下问题，即如何从长时间视频中高效选取少量关键帧以保留语义和上下文信息，从而提升视频-语言理解任务的性能与可扩展性。其解决方案的关键在于提出KeyScore框架，该框架通过联合利用字幕（caption）与视觉上下文信息，从语义相似度、时间多样性以及上下文损失影响三个维度对帧进行评分，从而精准识别出对下游任务（如检索、字幕生成和视频-语言推理）最具信息量的帧；同时引入STACFP（Spatio-Temporal Adaptive Clustering for Frame Proposals）模块生成紧凑且多样化的候选帧集，二者协同实现高达99%的帧数压缩率，并显著优于传统8帧编码器在MSRVTT、MSVD和DiDeMo数据集上的表现，证明了多模态对齐在无需显式视频摘要的情况下即可实现高效、可扩展的视频理解。

链接: https://arxiv.org/abs/2510.06509
作者: Shih-Yao Lin,Sibendu Paul,Caren Chen
机构: Amazon Prime Video (亚马逊Prime视频)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Efficient video-language understanding requires selecting a small set of frames that retain semantic and contextual information from long videos. We propose KeyScore, a multimodal frame scoring framework that jointly leverages captions and visual context to estimate frame-level importance. By combining semantic similarity, temporal diversity, and contextual drop impact, KeyScore identifies the most informative frames for downstream tasks such as retrieval, captioning, and video-language reasoning. To complement KeyScore, we introduce STACFP (Spatio-Temporal Adaptive Clustering for Frame Proposals), which generates compact and diverse frame candidates for long-form videos. Together, these modules achieve up to 99% frame reduction compared to full-frame inference and substantially outperform standard 8-frame encoders on MSRVTT, MSVD, and DiDeMo. Our results demonstrate that emphasizing multimodal alignment between visual and textual signals enables scalable, efficient, and caption-grounded video understanding – without explicit video summarization.
zh

[CV-82] xt2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

【速读】：该论文旨在解决从文本中建模人类间交互（Human-Human Interaction, HHIs）的两大挑战：一是双人交互训练数据有限，难以覆盖多样化的人类互动模式；二是文本到交互的建模粒度不足，现有方法将复杂的语言提示压缩为单一句子嵌入，导致交互细节丢失。解决方案的关键在于提出Text2Interact框架，其核心由两个模块构成：一是InterCompose，一种基于合成组合的可扩展数据生成管道，通过结合大语言模型（LLM）生成的交互描述与强单人动作先验，实现无需额外采集即可扩展交互覆盖范围；二是InterActor，一种具备词级条件控制的文本到交互模型，保留token级别的语义线索（如发起、响应、接触顺序），并引入自适应交互损失函数以强化上下文相关的关节对耦合，从而提升交互的时空一致性与物理合理性。

链接: https://arxiv.org/abs/2510.06504
作者: Qingxuan Wu,Zhiyang Dou,Chuan Guo,Yiming Huang,Qiao Feng,Bing Zhou,Jian Wang,Lingjie Liu
机构: University of Pennsylvania (宾夕法尼亚大学); The University of Hong Kong (香港大学); Snap Inc (Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.
zh

[CV-83] Superpixel Integrated Grids for Fast Image Segmentation

【速读】：该论文旨在解决超像素（superpixel）在深度学习图像分割任务中因不规则空间分布而导致的计算效率低下问题，这一特性迫使模型依赖特殊训练算法和架构，削弱了超像素原本用于简化数据、提升处理效率的设计初衷。解决方案的关键在于提出一种新的数据结构SIGRID（Superpixel-Integrated Grid），通过引入经典形状描述符对超像素的颜色与形状信息进行编码，在大幅降低输入维度的同时保留关键语义特征，从而实现与像素级表示相当甚至更优的分割性能，并显著加速模型训练过程。

链接: https://arxiv.org/abs/2510.06487
作者: Jack Roberts,Jeova Farias Sales Rocha Neto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Superpixels have long been used in image simplification to enable more efficient data processing and storage. However, despite their computational potential, their irregular spatial distribution has often forced deep learning approaches to rely on specialized training algorithms and architectures, undermining the original motivation for superpixelations. In this work, we introduce a new superpixel-based data structure, SIGRID (Superpixel-Integrated Grid), as an alternative to full-resolution images in segmentation tasks. By leveraging classical shape descriptors, SIGRID encodes both color and shape information of superpixels while substantially reducing input dimensionality. We evaluate SIGRIDs on four benchmark datasets using two popular convolutional segmentation architectures. Our results show that, despite compressing the original data, SIGRIDs not only match but in some cases surpass the performance of pixel-level representations, all while significantly accelerating model training. This demonstrates that SIGRIDs achieve a favorable balance between accuracy and computational efficiency.
zh

[CV-84] Active Next-Best-View Optimization for Risk-Averse Path Planning

【速读】：该论文旨在解决在不确定环境中实现安全导航的问题，核心挑战在于如何将风险规避与主动感知（active perception）有效融合，以生成局部安全且可行的轨迹。解决方案的关键在于提出一个统一框架：首先利用在线更新的三维高斯泼溅辐射场（3D Gaussian-splat Radiance Field）计算平均风险价值（Average Value-at-Risk, AVaR）统计量，构建尾部敏感的风险图（tail-sensitive risk maps），从而对粗略参考路径进行风险规避优化；同时，将下一最佳视角（Next-Best-View, NBV）选择建模为SE(3)位姿流形上的优化问题，通过黎曼梯度下降最大化期望信息增益目标，显著降低对近期运动最关键的不确定性。该方法通过引入可扩展的梯度分解策略，支持复杂环境中的高效在线更新，实现了风险规避路径精化与NBV规划的协同推进，显著优于现有方法。

链接: https://arxiv.org/abs/2510.06481
作者: Amirhossein Mollaei Khass,Guangyi Liu,Vivek Pandey,Wen Jiang,Boshu Lei,Kostas Daniilidis,Nader Motee
机构: Lehigh University (莱赫igh大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safe navigation in uncertain environments requires planning methods that integrate risk aversion with active perception. In this work, we present a unified framework that refines a coarse reference path by constructing tail-sensitive risk maps from Average Value-at-Risk statistics on an online-updated 3D Gaussian-splat Radiance Field. These maps enable the generation of locally safe and feasible trajectories. In parallel, we formulate Next-Best-View (NBV) selection as an optimization problem on the SE(3) pose manifold, where Riemannian gradient descent maximizes an expected information gain objective to reduce uncertainty most critical for imminent motion. Our approach advances the state-of-the-art by coupling risk-averse path refinement with NBV planning, while introducing scalable gradient decompositions that support efficient online updates in complex environments. We demonstrate the effectiveness of the proposed framework through extensive computational studies.
zh

[CV-85] SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

【速读】：该论文旨在解决多身份保持的图像生成问题，即在单次生成过程中同时保留多个主体的身份特征，并在结构和空间约束下实现高质量图像合成。此前方法难以在单一框架中兼顾多主体的身份一致性与灵活的空间控制，且通常需要多阶段处理或依赖复杂的人工干预。解决方案的关键在于提出SIGMA-GEN框架，其创新性地实现了基于统一模型的单次多主体身份保持生成，支持从粗粒度（如2D/3D边界框）到细粒度（如像素级分割和深度图）的多层次用户引导；同时引入SIGMA-SET27K这一新型合成数据集，为超过10万唯一主体提供身份、结构和空间信息，从而显著提升生成质量与效率，实现在身份保真度、图像质量和生成速度上的最优表现。

链接: https://arxiv.org/abs/2510.06469
作者: Oindrila Saha,Vojtech Krs,Radomir Mech,Subhransu Maji,Kevin Blackburn-Matzen,Matheus Gadelha
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision – from coarse 2D or 3D boxes to pixel-level segmentations and depth – with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at this https URL
zh

[CV-86] Diff: Thermal Plug-And-Play Prior with Patch-Based Diffusion

【速读】：该论文旨在解决低成本热成像相机获取的图像中存在的低分辨率、固定模式噪声及其他局部退化问题，同时应对现有热成像数据集在规模和多样性上的局限性。其解决方案的关键在于提出一种基于图像块（patch-based）的扩散框架（TDiff），通过在小尺寸热图像块上训练模型来利用退化的局部特性，并采用重叠块去噪与平滑空间加权融合策略实现全分辨率图像恢复。该方法首次在多个热图像恢复任务（去噪、超分辨率、去模糊）中建模了热图像的先验知识，构建了一个统一的修复流程，在模拟和真实热数据上均取得了优异性能。

链接: https://arxiv.org/abs/2510.06460
作者: Piyush Dashpute,Niki Nezakati,Wolfgang Heidrich,Vishwanath Saragadam
机构: University of California, Riverside (加州大学河滨分校); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Thermal images from low-cost cameras often suffer from low resolution, fixed pattern noise, and other localized degradations. Available datasets for thermal imaging are also limited in both size and diversity. To address these challenges, we propose a patch-based diffusion framework (TDiff) that leverages the local nature of these distortions by training on small thermal patches. In this approach, full-resolution images are restored by denoising overlapping patches and blending them using smooth spatial windowing. To our knowledge, this is the first patch-based diffusion framework that models a learned prior for thermal image restoration across multiple tasks. Experiments on denoising, super-resolution, and deblurring demonstrate strong results on both simulated and real thermal data, establishing our method as a unified restoration pipeline.
zh

[CV-87] Road Surface Condition Detection with Machine Learning using New York State Department of Transportation Camera Images and Weather Forecast Data

【速读】：该论文旨在解决冬季天气条件下道路状况评估的劳动密集型问题，即纽约州交通部（NYSDOT）依赖人工巡检和实时摄像头观察来判断路面状态，难以高效支持决策。解决方案的关键在于利用机器学习模型自动分类道路表面状况：研究采用卷积神经网络（Convolutional Neural Networks, CNNs）和随机森林（Random Forests）模型，基于约22,000张人工标注的摄像头图像及气象数据进行训练，将路面状态划分为六类（严重积雪、积雪、湿滑、干燥、能见度差、遮挡），最终在未见过的摄像头场景下实现了81.5%的准确率，显著提升了模型的泛化能力以满足实际业务需求。

链接: https://arxiv.org/abs/2510.06440
作者: Carly Sutter,Kara J. Sulia,Nick P. Bassill,Christopher D. Wirz,Christopher D. Thorncroft,Jay C. Rothenberger,Vanessa Przybylo,Mariana G. Cains,Jacob Radford,David Aaron Evans
机构: University at Albany (纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The New York State Department of Transportation (NYSDOT) has a network of roadside traffic cameras that are used by both the NYSDOT and the public to observe road conditions. The NYSDOT evaluates road conditions by driving on roads and observing live cameras, tasks which are labor-intensive but necessary for making critical operational decisions during winter weather events. However, machine learning models can provide additional support for the NYSDOT by automatically classifying current road conditions across the state. In this study, convolutional neural networks and random forests are trained on camera images and weather data to predict road surface conditions. Models are trained on a hand-labeled dataset of ~22,000 camera images, each classified by human labelers into one of six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed. Model generalizability is prioritized to meet the operational needs of the NYSDOT decision makers, and the weather-related road surface condition model in this study achieves an accuracy of 81.5% on completely unseen cameras.
zh

[CV-88] ransFIRA: Transfer Learning for Face Image Recognizability Assessment

【速读】：该论文旨在解决无约束环境下人脸图像可识别性评估（Face Image Recognizability Assessment, FIQA）的问题，即如何在姿态、模糊、光照和遮挡等极端变化条件下准确预测输入图像是否能被部署的特征编码器（encoder）有效识别。传统FIQA方法依赖视觉启发式规则、人工标注或计算密集型生成流程，其预测结果与编码器决策边界脱节。解决方案的关键在于提出TransFIRA框架，通过在嵌入空间（embedding space）中直接建模可识别性：首先定义基于类中心相似度（Class-Center Similarity, CCS）和类中心角分离度（Class-Center Angular Separation, CCAS）的可识别性准则，实现与决策边界对齐的过滤与加权机制；其次设计一种受可识别性启发的聚合策略，在无需外部标签、启发式规则或特定主干网络训练的情况下，显著提升验证准确率并增强与真实可识别性的相关性；最后扩展至人体识别等场景，提供编码器驱动的可解释性分析，揭示退化因素与个体特异性因素对可识别性的影响。

链接: https://arxiv.org/abs/2510.06353
作者: Allen Tu,Kartik Narayan,Joshua Gleason,Jennifer Xu,Matthew Meyn,Tom Goldstein,Vishal M. Patel
机构: Systems and Technology Research; University of Maryland, College Park; Johns Hopkins University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder’s decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary–aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment – encoder-specific, accurate, interpretable, and extensible across modalities – significantly advancing FIQA in accuracy, explainability, and scope.
zh

[CV-89] Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

【速读】：该论文旨在解决多模态生成与理解任务中现有统一模型在采样效率和任务泛化能力方面的局限性问题。传统方法多采用自回归（Autoregressive, AR）或AR-扩散混合范式，存在生成速度慢、难以高效处理多种模态输入输出等问题。其解决方案的关键在于提出Lumina-DiMOO——一个基于全离散扩散建模（fully discrete diffusion modeling）的开源基础模型，通过统一的离散表示框架实现跨模态输入与输出的高效建模，从而显著提升采样效率，并支持包括文本到图像生成、图像到图像转换（如图像编辑、主体驱动生成、图像修复等）及图像理解在内的多样化多模态任务。

链接: https://arxiv.org/abs/2510.06308
作者: Yi Xin,Qi Qin,Siqi Luo,Kaiwen Zhu,Juncheng Yan,Yan Tai,Jiayi Lei,Yuewen Cao,Keqi Wang,Yibin Wang,Jinbin Bai,Qian Yu,Dengyang Jiang,Yuandong Pu,Haoxing Chen,Le Zhuo,Junjun He,Gen Luo,Tianbin Li,Ming Hu,Jin Ye,Shenglong Ye,Bo Zhang,Chang Xu,Wenhai Wang,Hongsheng Li,Guangtao Zhai,Tianfan Xue,Bin Fu,Xiaohong Liu,Yu Qiao,Yihao Liu
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Nanjing University (南京大学); Stanford University (斯坦福大学); The University of Sydney (悉尼大学); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 13 figures, 10 tables

点击查看摘要

Abstract:We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: this https URL.
zh

[CV-90] Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping

【速读】：该论文旨在解决全球森林结构复杂度（forest structural complexity）高分辨率、连续时空监测的难题。传统星载激光雷达（lidar）如GEDI虽能提供高质量的结构复杂度指标，但其稀疏采样限制了连续性和高分辨率制图能力。解决方案的关键在于构建一个可扩展的深度学习框架，融合GEDI观测与多模态合成孔径雷达（SAR）数据，利用改进的EfficientNetV2架构在超过1.3亿个GEDI足迹上训练，仅用不到40万参数即实现全球尺度（25米分辨率）的精确预测（全局R²=0.82），并具备校准的不确定性估计和跨生物群落与时间的稳定性，从而支持气候变化背景下森林生态系统的动态持续监测与管理。

链接: https://arxiv.org/abs/2510.06299
作者: Tiago de Conto,John Armston,Ralph Dubayah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Forest structural complexity metrics integrate multiple canopy attributes into a single value that reflects habitat quality and ecosystem function. Spaceborne lidar from the Global Ecosystem Dynamics Investigation (GEDI) has enabled mapping of structural complexity in temperate and tropical forests, but its sparse sampling limits continuous high-resolution mapping. We present a scalable, deep learning framework fusing GEDI observations with multimodal Synthetic Aperture Radar (SAR) datasets to produce global, high-resolution (25 m) wall-to-wall maps of forest structural complexity. Our adapted EfficientNetV2 architecture, trained on over 130 million GEDI footprints, achieves high performance (global R2 = 0.82) with fewer than 400,000 parameters, making it an accessible tool that enables researchers to process datasets at any scale without requiring specialized computing infrastructure. The model produces accurate predictions with calibrated uncertainty estimates across biomes and time periods, preserving fine-scale spatial patterns. It has been used to generate a global, multi-temporal dataset of forest structural complexity from 2015 to 2022. Through transfer learning, this framework can be extended to predict additional forest structural variables with minimal computational cost. This approach supports continuous, multi-temporal monitoring of global forest structural dynamics and provides tools for biodiversity conservation and ecosystem management efforts in a changing climate.
zh

[CV-91] RGBD Gaze Tracking Using Transformer for Feature Fusion

【速读】：该论文旨在解决基于RGBD图像的注视方向估计（Gaze Angle Estimation）问题，其核心挑战在于如何有效融合颜色（RGB）与深度（Depth）信息以提升模型精度。解决方案的关键在于设计一种基于Transformer架构的特征融合模块，用于整合多模态输入特征；同时，为弥补现有数据集缺乏深度信息或标签不适用于注视角度估计的问题，作者构建了一个新的训练数据集。实验表明，尽管引入Transformer模块在某些场景下未显著优于传统方法（如MLP），但通过移除预训练生成对抗网络（GAN）模块反而能显著降低误差（从55.3mm降至30.1mm），说明模型结构优化与数据适配对性能提升具有决定性作用。

链接: https://arxiv.org/abs/2510.06298
作者: Tobias J. Bauer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Master Thesis with 125 pages, 59 figures, 17 tables

点击查看摘要

Abstract:Subject of this thesis is the implementation of an AI-based Gaze Tracking system using RGBD images that contain both color (RGB) and depth (D) information. To fuse the features extracted from the images, a module based on the Transformer architecture is used. The combination of RGBD input images and Transformers was chosen because it has not yet been investigated. Furthermore, a new dataset is created for training the AI models as existing datasets either do not contain depth information or only contain labels for Gaze Point Estimation that are not suitable for the task of Gaze Angle Estimation. Various model configurations are trained, validated and evaluated on a total of three different datasets. The trained models are then to be used in a real-time pipeline to estimate the gaze direction and thus the gaze point of a person in front of a computer screen. The AI model architecture used in this thesis is based on an earlier work by Lian et al. It uses a Generative Adversarial Network (GAN) to simultaneously remove depth map artifacts and extract head pose features. Lian et al. achieve a mean Euclidean error of 38.7mm on their own dataset ShanghaiTechGaze+. In this thesis, a model architecture with a Transformer module for feature fusion achieves a mean Euclidean error of 55.3mm on the same dataset, but we show that using no pre-trained GAN module leads to a mean Euclidean error of 30.1mm. Replacing the Transformer module with a Multilayer Perceptron (MLP) improves the error to 26.9mm. These results are coherent with the ones on the other two datasets. On the ETH-XGaze dataset, the model with Transformer module achieves a mean angular error of 3.59° and without Transformer module 3.26°, whereas the fundamentally different model architecture used by the dataset authors Zhang et al. achieves a mean angular error of 2.04°. On the OTH-Gaze-Estimation dataset created for…
zh

[CV-92] Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling

【速读】：该论文旨在解决高分辨率（4K）图像到图像合成在资源受限设备上部署时面临的内存占用高和图像质量差的问题。现有扩散模型在移动端进行图像编辑时，难以兼顾计算效率与视觉质量。其解决方案的关键在于提出MobilePicasso系统，该系统包含三个阶段：首先在标准分辨率下进行图像编辑并引入幻觉感知损失（hallucination-aware loss）以减少错误生成；其次通过潜在空间投影（latent projection）避免直接进入像素空间，从而降低计算复杂度；最后采用自适应上下文保持分块上采样（adaptive context-preserving tiling）策略将低维潜在表示高效升频至高分辨率。这一设计显著提升了图像质量（提升18–48%）并减少了幻觉现象（减少14–51%），同时实现高达55.8倍的速度提升，且仅增加9%的运行时内存开销。

链接: https://arxiv.org/abs/2510.06295
作者: Young D. Kwon,Abhinav Mehrotra,Malcolm Chadwick,Alberto Gil Ramos,Sourav Bhattacharya
机构: Samsung AI Center-Cambridge(三星人工智能中心-剑桥)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Under review

点击查看摘要

Abstract:High-resolution (4K) image-to-image synthesis has become increasingly important for mobile applications. Existing diffusion models for image editing face significant challenges, in terms of memory and image quality, when deployed on resource-constrained devices. In this paper, we present MobilePicasso, a novel system that enables efficient image editing at high resolutions, while minimising computational cost and memory usage. MobilePicasso comprises three stages: (i) performing image editing at a standard resolution with hallucination-aware loss, (ii) applying latent projection to overcome going to the pixel space, and (iii) upscaling the edited image latent to a higher resolution with adaptive context-preserving tiling. Our user study with 46 participants reveals that MobilePicasso not only improves image quality by 18-48% but reduces hallucinations by 14-51% over existing methods. MobilePicasso demonstrates significantly lower latency, e.g., up to 55.8 \times speed-up, yet with a small increase in runtime memory, e.g., a mere 9% increase over prior work. Surprisingly, the on-device runtime of MobilePicasso is observed to be faster than a server-based high-resolution image editing model running on an A100 GPU.
zh

[CV-93] ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在多模态任务中普遍存在但研究不足的关系幻觉（relation hallucinations）问题，此类幻觉占所有幻觉类型的大多数且严重影响模型可靠性。解决方案的关键在于提出一种无需训练的 ChainMPQ（Multi-Perspective Questions guided Interleaved Chain of Image and Text）方法，通过利用累积的文本和视觉记忆来增强关系推理能力：首先提取问题中的主体和客体关键词以强化图像对应区域，随后构建聚焦于关系三要素（主体、客体、关联关系）的多视角问题，并按顺序输入模型，使前期的文本与视觉记忆为后续推理提供支持，从而形成图像与文本交错的推理链，实现渐进式的关系推理优化。

链接: https://arxiv.org/abs/2510.06292
作者: Yike Wu,Yiwei Wang,Yujun Cai
机构: Southeast University (东南大学); University of California, Merced (加州大学默塞德分校); University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to hinder their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this issue, we propose ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of images and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.
zh

[CV-94] On knot detection via picture recognition

【速读】：该论文旨在解决从图像中自动识别 knots（纽结）的问题，目标是实现仅通过一张照片即可由手机自动识别出具体纽结类型。其解决方案的关键在于采用两阶段策略：第一阶段利用现代机器学习方法（如卷积神经网络 CNN 和 Transformer 模型）对输入图像进行感知建模，直接预测纽结的交叉数（crossing number），验证了轻量级模型也能提取有意义的结构信息；第二阶段将感知模块与符号重建模块结合，生成平面图（planar diagram, PD）编码，并进一步计算拓扑不变量（如 Jones 多项式），从而实现鲁棒的纽结分类。该方案体现了机器学习在处理噪声视觉数据方面的优势与拓扑不变量在严格区分纽结结构上的互补性。

链接: https://arxiv.org/abs/2510.06284
作者: Anne Dranowski,Yura Kabkov,Daniel Tubbenhauer
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Geometric Topology (math.GT)
备注: 21 pages, many figures, comments welcome

点击查看摘要

Abstract:Our goal is to one day take a photo of a knot and have a phone automatically recognize it. In this expository work, we explain a strategy to approximate this goal, using a mixture of modern machine learning methods (in particular convolutional neural networks and transformers for image recognition) and traditional algorithms (to compute quantum invariants like the Jones polynomial). We present simple baselines that predict crossing number directly from images, showing that even lightweight CNN and transformer architectures can recover meaningful structural information. The longer-term aim is to combine these perception modules with symbolic reconstruction into planar diagram (PD) codes, enabling downstream invariant computation for robust knot classification. This two-stage approach highlights the complementarity between machine learning, which handles noisy visual data, and invariants, which enforce rigorous topological distinctions.
zh

[CV-95] Improving the Spatial Resolution of GONG Solar Images to GST Quality Using Deep Learning ICDM2025

【速读】：该论文旨在解决低分辨率（LR）全日面Hα图像在观测太阳细结构（如暗条、纤维等）时空间分辨率不足的问题，以提升对太阳小尺度动态特征的解析能力。其解决方案的关键在于采用基于生成对抗网络（GAN）的超分辨率方法，具体使用Real-ESRGAN架构，该模型包含残差内嵌的密集块（Residual-in-Residual Dense Blocks）和相对论判别器（relativistic discriminator），并通过精确配准GONG与BBSO/GST图像对进行训练，从而有效恢复黑子本影区域内的精细结构，并清晰分辨暗条和纤维等细节，实现高质量图像重建。

链接: https://arxiv.org/abs/2510.06281
作者: Chenyang Li,Qin Li,Haimin Wang,Bo Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages; accepted as a workshop paper in ICDM 2025

点击查看摘要

Abstract:High-resolution (HR) solar imaging is crucial for capturing fine-scale dynamic features such as filaments and fibrils. However, the spatial resolution of the full-disk H \alpha images is limited and insufficient to resolve these small-scale structures. To address this, we propose a GAN-based superresolution approach to enhance low-resolution (LR) full-disk H \alpha images from the Global Oscillation Network Group (GONG) to a quality comparable with HR observations from the Big Bear Solar Observatory/Goode Solar Telescope (BBSO/GST). We employ Real-ESRGAN with Residual-in-Residual Dense Blocks and a relativistic discriminator. We carefully aligned GONG-GST pairs. The model effectively recovers fine details within sunspot penumbrae and resolves fine details in filaments and fibrils, achieving an average mean squared error (MSE) of 467.15, root mean squared error (RMSE) of 21.59, and cross-correlation (CC) of 0.7794. Slight misalignments between image pairs limit quantitative performance, which we plan to address in future work alongside dataset expansion to further improve reconstruction quality.
zh

[CV-96] Surgeons Are Indian Males and Speech Therapists Are White Females: Auditing Biases in Vision-Language Models for Healthcare Professionals

【速读】：该论文旨在解决视觉语言模型（Vision Language Models, VLMs）在医疗场景中可能隐含并放大性别、种族等社会偏见的问题，这些问题源于模型从网络规模数据中学到的关于医疗职业与人口属性之间的刻板关联。解决方案的关键在于提出了一套针对医疗领域的评估协议，其核心包括：(i) 构建涵盖临床及辅助医疗角色（如外科医生、心血管科医生、牙医、护士、药剂师、技术人员）的分类体系；(ii) 设计基于职业敏感性的提示语集以探测模型行为；(iii) 以平衡的人脸数据集为基准衡量模型在不同人口群体上的偏差程度。通过该方法，研究发现多个VLM在不同医疗角色中均存在系统性偏见，强调了在AI驱动的医疗招聘和人力资源分析中识别和缓解偏见的重要性，以保障公平性、合规性和患者信任。

链接: https://arxiv.org/abs/2510.06280
作者: Zohaib Hasan Siddiqui,Dayam Nadeem,Mohammad Masudur Rahman,Mohammad Nadeem,Shahab Saquib Sohail,Beenish Moalla Chaudhry
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision language models (VLMs), such as CLIP and OpenCLIP, can encode and reflect stereotypical associations between medical professions and demographic attributes learned from web-scale data. We present an evaluation protocol for healthcare settings that quantifies associated biases and assesses their operational risk. Our methodology (i) defines a taxonomy spanning clinicians and allied healthcare roles (e.g., surgeon, cardiologist, dentist, nurse, pharmacist, technician), (ii) curates a profession-aware prompt suite to probe model behavior, and (iii) benchmarks demographic skew against a balanced face corpus. Empirically, we observe consistent demographic biases across multiple roles and vision models. Our work highlights the importance of bias identification in critical domains such as healthcare as AI-enabled hiring and workforce analytics can have downstream implications for equity, compliance, and patient trust.
zh

[CV-97] General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

【速读】：该论文旨在解决目标条件强化学习（Goal-conditioned Reinforcement Learning, GCRL）中因目标表示方式不当而导致的泛化能力差、收敛速度慢及对特定传感器依赖等问题。现有方法如目标状态图像、三维坐标或独热向量等，在面对未见过的物体时表现不佳，且部分方法需额外配置特殊相机设备。其解决方案的关键在于提出一种基于掩码（mask-based）的目标表示系统，通过提供与物体无关的视觉线索，使智能体在无需位置信息的情况下实现高效学习与卓越泛化性能；同时，利用真实掩码生成密集奖励信号，避免了易出错的距离计算问题，从而在仿真环境中实现了训练和未见测试对象上高达99.9%的抓取准确率，并成功应用于物理机器人上的从零开始学习与仿真到现实世界的迁移任务。

链接: https://arxiv.org/abs/2510.06277
作者: Fahim Shahriar,Cheryl Wang,Alireza Azimi,Gautham Vasan,Hany Hamed Elanwar,A. Rupam Mahmood,Colin Bellinger
机构: University of Alberta (阿尔伯塔大学); McGill University (麦吉尔大学); University of Ottawa (渥太华大学); AMII; CIFAR Canada AI Chair; Vector Institute
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.
zh

[CV-98] Vision Transformer for Transient Noise Classification

【速读】：该论文旨在解决引力波探测中由瞬态噪声（glitches）引起的干扰问题，这些噪声会显著降低引力波信号的检测效率。为应对这一挑战，作者基于Gravity Spy项目已有的噪声分类体系，并结合LIGO O3a运行新增的两类噪声事件，构建了一个包含22个既有类别和2个新类别的混合数据集。解决方案的关键在于采用预训练的视觉Transformer（Vision Transformer, ViT-B/32）模型进行端到端的多类别噪声分类，通过在该扩展数据集上微调模型，实现了92.26%的分类准确率，验证了ViT在提升引力波探测中噪声识别精度方面的有效性。

链接: https://arxiv.org/abs/2510.06273
作者: Divyansh Srivastava,Andrzej Niedzielski
机构: Institute of Astronomy, Nicolaus Copernicus University in Toruń (天文学研究所，尼古拉斯·哥白尼托伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Transient noise (glitches) in LIGO data hinders the detection of gravitational waves (GW). The Gravity Spy project has categorized these noise events into various classes. With the O3 run, there is the inclusion of two additional noise classes and thus a need to train new models for effective classification. We aim to classify glitches in LIGO data into 22 existing classes from the first run plus 2 additional noise classes from O3a using the Vision Transformer (ViT) model. We train a pre-trained Vision Transformer (ViT-B/32) model on a combined dataset consisting of the Gravity Spy dataset with the additional two classes from the LIGO O3a run. We achieve a classification efficiency of 92.26%, demonstrating the potential of Vision Transformer to improve the accuracy of gravitational wave detection by effectively distinguishing transient noise. Key words: gravitational waves --vision transformer --machine learning Comments: 9 pages, 4 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc) Cite as: arXiv:2510.06273 [cs.CV] (or arXiv:2510.06273v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.06273 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Acta Astronomica Vol. 74 (2024), No. 3 pp. 231-238 Related DOI: https://doi.org/10.32023/0001-5237/74.3.3 Focus to learn more DOI(s) linking to related resources
zh

[CV-99] Ensemble Deep Learning and LLM -Assisted Reporting for Automated Skin Lesion Diagnosis

【速读】：该论文旨在解决皮肤恶性肿瘤早期诊断中因诊断者间差异（inter-observer variability）和医疗资源获取不均导致的临床挑战，同时克服现有人工智能（AI）系统在架构同质化、肤色数据偏差以及自然语言处理（Natural Language Processing, NLP）与诊断流程割裂等方面的局限性。其解决方案的关键在于提出一个统一框架：一是构建由多种架构的卷积神经网络（Convolutional Neural Networks, CNNs）组成的异构集成模型，通过内在不确定性机制识别分歧病例以供专科医生复核，从而模拟临床最佳实践；二是将大语言模型（Large Language Model, LLM）能力深度嵌入诊断流程，将分类结果转化为结构化、可读性强的临床评估报告，涵盖病变精确定性、可理解的推理逻辑及患者可执行的随访指导，实现从诊断到患者教育的闭环支持，显著提升皮肤癌早期干预效率与临床落地可行性。

链接: https://arxiv.org/abs/2510.06260
作者: Sher Khan,Raz Muhammad,Adil Hussain,Muhammad Sajjad,Muhammad Rashid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cutaneous malignancies demand early detection for favorable outcomes, yet current diagnostics suffer from inter-observer variability and access disparities. While AI shows promise, existing dermatological systems are limited by homogeneous architectures, dataset biases across skin tones, and fragmented approaches that treat natural language processing as separate post-hoc explanations rather than integral to clinical decision-making. We introduce a unified framework that fundamentally reimagines AI integration for dermatological diagnostics through two synergistic innovations. First, a purposefully heterogeneous ensemble of architecturally diverse convolutional neural networks provides complementary diagnostic perspectives, with an intrinsic uncertainty mechanism flagging discordant cases for specialist review – mimicking clinical best practices. Second, we embed large language model capabilities directly into the diagnostic workflow, transforming classification outputs into clinically meaningful assessments that simultaneously fulfill medical documentation requirements and deliver patient-centered education. This seamless integration generates structured reports featuring precise lesion characterization, accessible diagnostic reasoning, and actionable monitoring guidance – empowering patients to recognize early warning signs between visits. By addressing both diagnostic reliability and communication barriers within a single cohesive system, our approach bridges the critical translational gap that has prevented previous AI implementations from achieving clinical impact. The framework represents a significant advancement toward deployable dermatological AI that enhances diagnostic precision while actively supporting the continuum of care from initial detection through patient education, ultimately improving early intervention rates for skin lesions.
zh

[CV-100] Enhanced Self-Distillation Framework for Efficient Spiking Neural Network Training

【速读】：该论文旨在解决脉冲神经网络（Spiking Neural Networks, SNNs）在受限计算资源下训练效率低、性能落后于人工神经网络（Artificial Neural Networks, ANNs）的问题，尤其针对传统基于替代梯度（surrogate gradients）和通过时间反向传播（Backpropagation Through Time, BPTT）方法所导致的高计算与内存开销（随时间维度线性增长）这一瓶颈。其解决方案的关键在于提出一种增强型自蒸馏框架，结合基于发放率的反向传播（rate-based backpropagation），将SNN中间层的发放率投影至轻量级ANN分支，并利用模型自身生成的高质量知识通过ANN路径优化子结构；同时，为避免低质量自生成知识干扰收敛，该方法进一步将教师信号解耦为可靠与不可靠成分，仅用可靠知识指导优化，从而显著降低训练复杂度并实现高性能SNN训练。

链接: https://arxiv.org/abs/2510.06254
作者: Xiaochen Zhao,Chengting Yu,Kairong Yu,Lei Liu,Aili Wang
机构: ZJU-UIUC Institute, Zhejiang University (浙江大学-伊利诺伊大学厄巴纳香槟分校联合学院); College of Information Science and Electronic Engineering, Zhejiang University (浙江大学信息科学与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) exhibit exceptional energy efficiency on neuromorphic hardware due to their sparse activation patterns. However, conventional training methods based on surrogate gradients and Backpropagation Through Time (BPTT) not only lag behind Artificial Neural Networks (ANNs) in performance, but also incur significant computational and memory overheads that grow linearly with the temporal dimension. To enable high-performance SNN training under limited computational resources, we propose an enhanced self-distillation framework, jointly optimized with rate-based backpropagation. Specifically, the firing rates of intermediate SNN layers are projected onto lightweight ANN branches, and high-quality knowledge generated by the model itself is used to optimize substructures through the ANN pathways. Unlike traditional self-distillation paradigms, we observe that low-quality self-generated knowledge may hinder convergence. To address this, we decouple the teacher signal into reliable and unreliable components, ensuring that only reliable knowledge is used to guide the optimization of the model. Extensive experiments on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate that our method reduces training complexity while achieving high-performance SNN training. Our code is available at this https URL.
zh

[CV-101] Does Physics Knowledge Emerge in Frontier Models?

【速读】：该论文试图解决当前领先视觉语言模型（Vision-Language Models, VLMs）在物理动态理解与预测能力上的局限性问题，即尽管这些模型在视觉感知和通用推理任务中表现优异，但其对物理规律的理解仍不清晰。解决方案的关键在于设计一套系统性的基准测试框架，包括三个物理模拟数据集（CLEVRER、Physion、Physion++）以及诊断子测试，以分离感知能力（如物体识别、颜色判断、遮挡处理）与物理推理能力（如运动预测、空间关系推断）。通过分析发现，感知与物理推理能力之间存在弱相关性，表明当前VLMs的感知与推理模块仍处于碎片化状态，无法整合为因果理解，从而揭示了未来需构建更紧密耦合感知与推理机制的架构必要性。

链接: https://arxiv.org/abs/2510.06251
作者: Ieva Bagdonaviciute,Vibhav Vineet
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures. Preprint

点击查看摘要

Abstract:Leading Vision-Language Models (VLMs) show strong results in visual perception and general reasoning, but their ability to understand and predict physical dynamics remains unclear. We benchmark six frontier VLMs on three physical simulation datasets - CLEVRER, Physion, and Physion++ - where the evaluation tasks test whether a model can predict outcomes or hypothesize about alternative situations. To probe deeper, we design diagnostic subtests that isolate perception (objects, colors, occluders) from physics reasoning (motion prediction, spatial relations). Intuitively, stronger diagnostic performance should support higher evaluation accuracy. Yet our analysis reveals weak correlations: models that excel at perception or physics reasoning do not consistently perform better on predictive or counterfactual evaluation. This counterintuitive gap exposes a central limitation of current VLMs: perceptual and physics skills remain fragmented and fail to combine into causal understanding, underscoring the need for architectures that bind perception and reasoning more tightly.
zh

[CV-102] multimodars: A Rust-powered toolkit for multi-modality cardiac image fusion and registration

【速读】：该论文旨在解决多模态冠状动脉影像融合中缺乏开放、灵活且可确定性行为的分析工具问题，尤其针对不同生理状态（静息/负荷）和治疗阶段（支架前/后）的多状态分析需求。现有方法虽已实现血管内成像与冠状动脉计算机断层扫描血管造影（CCTA）的融合，但缺乏支持高保真度、可复现实验的通用框架。其解决方案的关键在于：提出一个名为multimodars的开源工具包，采用确定性配准算法、以NumPy为中心的紧凑数据模型以及优化的Rust后端，从而实现高效、可扩展且易于集成到分析流水线中的多模态影像融合与处理能力。

链接: https://arxiv.org/abs/2510.06241
作者: Anselm W. Stark,Marc Ilic,Ali Mokhtari,Pooya Mohammadi Kazaj,Christoph Graeni,Isaac Shiri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Combining complementary imaging modalities is critical to build reliable 3D coronary models: intravascular imaging gives sub-millimetre resolution but limited whole-vessel context, while CCTA supplies 3D geometry but suffers from limited spatial resolution and artefacts (e.g., blooming). Prior work demonstrated intravascular/CCTA fusion, yet no open, flexible toolkit is tailored for multi-state analysis (rest/stress, pre-/post-stenting) while offering deterministic behaviour, high performance, and easy pipeline integration. multimodars addresses this gap with deterministic alignment algorithms, a compact NumPy-centred data model, and an optimised Rust backend suitable for scalable, reproducible experiments. The package accepts CSV/NumPy inputs including data formats produced by the AIVUS-CAA software
zh

[CV-103] Uncertainty Quantification In Surface Landmines and UXO Classification Using MC Dropout

【速读】：该论文旨在解决深度学习模型在人道主义排雷任务中对地雷和未爆弹药（UXO）检测的可靠性问题，尤其是在噪声环境和对抗攻击下易出现误判或漏检的问题。解决方案的关键在于引入基于蒙特卡洛Dropout（Monte Carlo Dropout）的不确定性量化方法，并将其集成到微调后的ResNet-50架构中，以量化认知不确定性（epistemic uncertainty），从而为预测结果提供可靠性指标，帮助决策者在复杂条件下做出更可靠的判断。

链接: https://arxiv.org/abs/2510.06238
作者: Sagar Lekhak,Emmett J. Ientilucci,Dimah Dera,Susmita Ghosh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Other Statistics (stat.OT)
备注: This work has been accepted and presented at IGARSS 2025 and will appear in the IEEE IGARSS 2025 proceedings

点击查看摘要

Abstract:Detecting surface landmines and unexploded ordnances (UXOs) using deep learning has shown promise in humanitarian demining. However, deterministic neural networks can be vulnerable to noisy conditions and adversarial attacks, leading to missed detection or misclassification. This study introduces the idea of uncertainty quantification through Monte Carlo (MC) Dropout, integrated into a fine-tuned ResNet-50 architecture for surface landmine and UXO classification, which was tested on a simulated dataset. Integrating the MC Dropout approach helps quantify epistemic uncertainty, providing an additional metric for prediction reliability, which could be helpful to make more informed decisions in demining operations. Experimental results on clean, adversarially perturbed, and noisy test images demonstrate the model’s ability to flag unreliable predictions under challenging conditions. This proof-of-concept study highlights the need for uncertainty quantification in demining, raises awareness about the vulnerability of existing neural networks in demining to adversarial threats, and emphasizes the importance of developing more robust and reliable models for practical applications.
zh

[CV-104] User to Video: A Model for Spammer Detection Inspired by Video Classification Technology IJCNN

【速读】：该论文旨在解决社交网络中垃圾用户（spammer）的识别问题，其核心挑战在于如何有效建模用户行为的复杂时序特征与潜在对抗性。解决方案的关键在于提出一种基于用户视频化（user videoization）的检测模型UVSD，其创新性地将用户行为子空间映射为类视频结构：首先通过user2pixel算法将用户视为像素，其立场特征量化为RGB值；接着利用behavior2image算法将行为子空间转化为帧图像，结合低秩密集向量表示与剪裁扩散算法完成图像化；最终构建包含时间特征的用户行为视频，并采用视频分类算法实现垃圾用户识别。该方法在WEIBO和TWITTER公开数据集上优于现有最先进方法。

链接: https://arxiv.org/abs/2510.06233
作者: Haoyang Zhang,Zhou Yang,Yucai Pang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:This article is inspired by video classification technology. If the user behavior subspace is viewed as a frame image, consecutive frame images are viewed as a video. Following this novel idea, a model for spammer detection based on user videoization, called UVSD, is proposed. Firstly, a user2piexl algorithm for user pixelization is proposed. Considering the adversarial behavior of user stances, the user is viewed as a pixel, and the stance is quantified as the pixel’s RGB. Secondly, a behavior2image algorithm is proposed for transforming user behavior subspace into frame images. Low-rank dense vectorization of subspace user relations is performed using representation learning, while cutting and diffusion algorithms are introduced to complete the frame imageization. Finally, user behavior videos are constructed based on temporal features. Subsequently, a video classification algorithm is combined to identify the spammers. Experiments using publicly available datasets, i.e., WEIBO and TWITTER, show an advantage of the UVSD model over state-of-the-art methods.
zh

[CV-105] Milestone Determination for Autonomous Railway Operation

【速读】：该论文旨在解决铁路自动化领域中计算机视觉系统开发所面临的挑战，即高质量、序列化数据的稀缺性问题。传统数据集在时空上下文上存在局限，难以支持实时决策，而替代方案则面临真实性和适用性不足的问题。其解决方案的关键在于聚焦于特定路线的、具有情境相关性的关键线索（milestone determination），通过构建基于规则的目标模型，简化学习过程——无需对动态部件进行泛化识别，而是专注于路径上的关键决策点。这种方法为在可控且可预测环境中训练视觉代理提供了实用框架，有助于提升铁路自动化中机器学习系统的安全性与效率。

链接: https://arxiv.org/abs/2510.06229
作者: Josh Hunter,John McDermid,Simon Burton,Poppy Fynes,Mia Dempster
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Paper submitted and partially accepted to ICART 2025, paper is 8 pages and has 1 figure, 2 tables

点击查看摘要

Abstract:In the field of railway automation, one of the key challenges has been the development of effective computer vision systems due to the limited availability of high-quality, sequential data. Traditional datasets are restricted in scope, lacking the spatio temporal context necessary for real-time decision-making, while alternative solutions introduce issues related to realism and applicability. By focusing on route-specific, contextually relevant cues, we can generate rich, sequential datasets that align more closely with real-world operational logic. The concept of milestone determination allows for the development of targeted, rule-based models that simplify the learning process by eliminating the need for generalized recognition of dynamic components, focusing instead on the critical decision points along a route. We argue that this approach provides a practical framework for training vision agents in controlled, predictable environments, facilitating safer and more efficient machine learning systems for railway automation.
zh

[CV-106] FEAorta: A Fully Automated Framework for Finite Element Analysis of the Aorta From 3D CT Images

【速读】：该论文旨在解决胸主动脉瘤（Thoracic Aortic Aneurysm, TAA）破裂风险评估中临床应用受限的两大障碍，其中首要问题是患者特异性解剖建模依赖人工分割导致的劳动密集型三维重建过程，难以实现大规模临床推广。解决方案的关键在于开发了一个端到端的深度神经网络（Deep Neural Network, DNN），能够直接从3D CT图像生成患者特异性的主动脉有限元网格（Finite Element Mesh），从而自动化完成几何建模步骤，显著提升建模效率与可扩展性。

链接: https://arxiv.org/abs/2510.06621
作者: Jiasong Chen,Linchen Qian,Ruonan Gong,Christina Sun,Tongran Qin,Thuy Pham,Caitlin Martin,Mohammad Zafar,John Elefteriades,Wei Sun,Liang Liang
机构: 未知
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aortic aneurysm disease ranks consistently in the top 20 causes of death in the U.S. population. Thoracic aortic aneurysm is manifested as an abnormal bulging of thoracic aortic wall and it is a leading cause of death in adults. From the perspective of biomechanics, rupture occurs when the stress acting on the aortic wall exceeds the wall strength. Wall stress distribution can be obtained by computational biomechanical analyses, especially structural Finite Element Analysis. For risk assessment, probabilistic rupture risk of TAA can be calculated by comparing stress with material strength using a material failure model. Although these engineering tools are currently available for TAA rupture risk assessment on patient specific level, clinical adoption has been limited due to two major barriers: labor intensive 3D reconstruction current patient specific anatomical modeling still relies on manual segmentation, making it time consuming and difficult to scale to a large patient population, and computational burden traditional FEA simulations are resource intensive and incompatible with time sensitive clinical workflows. The second barrier was successfully overcome by our team through the development of the PyTorch FEA library and the FEA DNN integration framework. By incorporating the FEA functionalities within PyTorch FEA and applying the principle of static determinacy, we reduced the FEA based stress computation time to approximately three minutes per case. Moreover, by integrating DNN and FEA through the PyTorch FEA library, our approach further decreases the computation time to only a few seconds per case. This work focuses on overcoming the first barrier through the development of an end to end deep neural network capable of generating patient specific finite element meshes of the aorta directly from 3D CT images.
zh

[CV-107] Conditional Denoising Diffusion Model-Based Robust MR Image Reconstruction from Highly Undersampled Data

【速读】：该论文旨在解决磁共振成像（MRI）中因采集时间过长而导致的临床应用受限问题，尤其是在需要快速诊断的时间敏感场景下。现有欠采样策略虽能加速成像，但常引入伪影并降低图像质量；尽管扩散模型在从欠采样数据重建高质量图像方面展现出潜力，但多数方法要么依赖无监督得分函数而缺乏配对监督，要么仅将数据一致性作为后处理步骤，未能充分融合物理模型与生成能力。该研究的关键解决方案是提出一种条件去噪扩散框架，并嵌入迭代数据一致性修正机制——通过在每个逆向扩散步骤中直接整合测量模型，并基于配对的欠采样-真实图像数据进行训练，从而实现生成灵活性与MRI物理约束的有机结合。这种混合设计显著提升了像素级保真度和感知真实性，在fastMRI数据集上的实验表明其在SSIM、PSNR和LPIPS指标上均优于当前最先进的深度学习与扩散模型方法。

链接: https://arxiv.org/abs/2510.06335
作者: Mohammed Alsubaie,Wenxi Liu,Linxia Gu,Ovidiu C. Andronesi,Sirani M. Perera,Xianqi Li
机构: Florida Institute of Technology (佛罗里达理工学院); Massachusetts General Hospital/Harvard Medical School (马萨诸塞州总医院/哈佛医学院); Embry-Riddle Aeronautical University (Embry-Riddle 航空大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a critical tool in modern medical diagnostics, yet its prolonged acquisition time remains a critical limitation, especially in time-sensitive clinical scenarios. While undersampling strategies can accelerate image acquisition, they often result in image artifacts and degraded quality. Recent diffusion models have shown promise for reconstructing high-fidelity images from undersampled data by learning powerful image priors; however, most existing approaches either (i) rely on unsupervised score functions without paired supervision or (ii) apply data consistency only as a post-processing step. In this work, we introduce a conditional denoising diffusion framework with iterative data-consistency correction, which differs from prior methods by embedding the measurement model directly into every reverse diffusion step and training the model on paired undersampled-ground truth data. This hybrid design bridges generative flexibility with explicit enforcement of MRI physics. Experiments on the fastMRI dataset demonstrate that our framework consistently outperforms recent state-of-the-art deep learning and diffusion-based methods in SSIM, PSNR, and LPIPS, with LPIPS capturing perceptual improvements more faithfully. These results demonstrate that integrating conditional supervision with iterative consistency updates yields substantial improvements in both pixel-level fidelity and perceptual realism, establishing a principled and practical advance toward robust, accelerated MRI reconstruction.
zh

[CV-108] SER-Diff: Synthetic Error Replay Diffusion for Incremental Brain Tumor Segmentation

【速读】：该论文旨在解决增量脑肿瘤分割（incremental brain tumor segmentation）中的灾难性遗忘（catastrophic forgetting）问题，即模型在适应新临床数据时会丢失对先前任务的知识。解决方案的关键在于提出一种名为Synthetic Error Replay Diffusion (SER-Diff) 的新框架，其核心创新是将扩散模型（diffusion model）与增量学习相结合：利用一个冻结的教师扩散模型生成历史任务的合成误差图（synthetic error maps），并在训练新任务时进行回放；同时采用双损失函数——Dice损失用于优化新数据的分割性能，知识蒸馏损失用于保留历史任务的特征表示，从而实现对新旧知识的有效平衡与保持。

链接: https://arxiv.org/abs/2510.06283
作者: Sashank Makanaboyina
机构: DePaul University (德保罗大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incremental brain tumor segmentation is critical for models that must adapt to evolving clinical datasets without retraining on all prior data. However, catastrophic forgetting, where models lose previously acquired knowledge, remains a major obstacle. Recent incremental learning frameworks with knowledge distillation partially mitigate forgetting but rely heavily on generative replay or auxiliary storage. Meanwhile, diffusion models have proven effective for refining tumor segmentations, but have not been explored in incremental learning contexts. We propose Synthetic Error Replay Diffusion (SER-Diff), the first framework that unifies diffusion-based refinement with incremental learning. SER-Diff leverages a frozen teacher diffusion model to generate synthetic error maps from past tasks, which are replayed during training on new tasks. A dual-loss formulation combining Dice loss for new data and knowledge distillation loss for replayed errors ensures both adaptability and retention. Experiments on BraTS2020, BraTS2021, and BraTS2023 demonstrate that SER-Diff consistently outperforms prior methods. It achieves the highest Dice scores of 95.8%, 94.9%, and 94.6%, along with the lowest HD95 values of 4.4 mm, 4.7 mm, and 4.9 mm, respectively. These results indicate that SER-Diff not only mitigates catastrophic forgetting but also delivers more accurate and anatomically coherent segmentations across evolving datasets.
zh

[CV-109] A Total Variation Regularized Framework for Epilepsy-Related MRI Image Segmentation

【速读】：该论文旨在解决在3D多模态脑部磁共振成像（MRI）中对局灶性皮质发育不良（Focal Cortical Dysplasia, FCD）区域进行精确分割的难题，该问题因病变细微、尺寸小、对比度弱以及三维多模态数据处理复杂等因素而极具挑战性。解决方案的关键在于提出一种基于Transformer增强的编码器-解码器架构，并引入一种结合Dice损失与各向异性总变差（anisotropic Total Variation, TV）项的新颖损失函数，该设计能够有效提升分割结果的空间平滑性和解剖一致性，同时减少假阳性簇的数量，从而无需后处理即可实现更准确和稳定的FCD区域分割。

链接: https://arxiv.org/abs/2510.06276
作者: Mehdi Rabiee,Sergio Greco,Reza Shahbazian,Irina Trubitsyna
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Focal Cortical Dysplasia (FCD) is a primary cause of drug-resistant epilepsy and is difficult to detect in brain magnetic resonance imaging (MRI) due to the subtle and small-scale nature of its lesions. Accurate segmentation of FCD regions in 3D multimodal brain MRI images is essential for effective surgical planning and treatment. However, this task remains highly challenging due to the limited availability of annotated FCD datasets, the extremely small size and weak contrast of FCD lesions, the complexity of handling 3D multimodal inputs, and the need for output smoothness and anatomical consistency, which is often not addressed by standard voxel-wise loss functions. This paper presents a new framework for segmenting FCD regions in 3D brain MRI images. We adopt state-of-the-art transformer-enhanced encoder-decoder architecture and introduce a novel loss function combining Dice loss with an anisotropic Total Variation (TV) term. This integration encourages spatial smoothness and reduces false positive clusters without relying on post-processing. The framework is evaluated on a public FCD dataset with 85 epilepsy patients and demonstrates superior segmentation accuracy and consistency compared to standard loss formulations. The model with the proposed TV loss shows an 11.9% improvement on the Dice coefficient and 13.3% higher precision over the baseline model. Moreover, the number of false positive clusters is reduced by 61.6%
zh

[CV-110] Stacked Regression using Off-the-shelf Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)

【速读】：该论文旨在解决如何利用多模态预训练模型（如大语言模型、视频编码器、音频模型和视觉-语言模型）来准确预测人脑在观看电影时的fMRI响应问题。其关键解决方案在于：首先，通过整合多种来源的多模态表征，并对文本输入增强为详细字幕与摘要以提升语义丰富性；其次，采用刺激调优（stimulus-tuning）和微调策略优化语言与视觉模型；最后，使用堆叠回归（stacked regression）融合各模型预测结果，从而实现更稳健的脑活动建模性能。

链接: https://arxiv.org/abs/2510.06235
作者: Robert Scholz,Kunal Bagga,Christine Ahrends,Carlo Alberto Barbano
机构: Université Paris Cité (巴黎第四大学); University of Oxford (牛津大学); University of Turin (都灵大学); Universität Leipzig (莱比锡大学); Max Planck School of Cognition (马克斯普朗克认知学校)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We present our submission to the Algonauts 2025 Challenge, where the goal is to predict fMRI brain responses to movie stimuli. Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models, combining both off-the-shelf and fine-tuned variants. To improve performance, we enhanced textual inputs with detailed transcripts and summaries, and we explored stimulus-tuning and fine-tuning strategies for language and vision models. Predictions from individual models were combined using stacked regression, yielding solid results. Our submission, under the team name Seinfeld, ranked 10th. We make all code and resources publicly available, contributing to ongoing efforts in developing multimodal encoding models for brain activity.
zh

人工智能

[AI-0] h1: Bootstrapping LLM s to Reason over Longer Horizons via Reinforcement Learning

【速读】：该论文旨在解决大语言模型在长推理（long-horizon reasoning）任务中性能显著下降的问题，现有方法依赖于推理时的结构化辅助（inference-time scaffolding）或昂贵的步骤级监督（step-level supervision），难以扩展。其解决方案的关键在于：利用现有的丰富短推理数据，通过合成构建任意长度的多步依赖链（multi-step dependency chains），并采用仅基于结果奖励（outcome-only rewards）的课程学习（curriculum learning）策略进行强化学习（RL）训练，从而实现无需额外标注即可有效提升模型对复杂、长链条推理任务的泛化能力。实验证明，该方法在多个竞赛级基准测试中显著优于基线，且在高通过率（pass@k）下仍能获得更强的长推理性能，理论分析进一步表明该方法相比全周期训练可实现样本复杂度的指数级改进。

链接: https://arxiv.org/abs/2510.07312
作者: Sumeet Ramesh Motwani,Alesia Ivanova,Ziyang Cai,Philip Torr,Riashat Islam,Shital Shah,Christian Schroeder de Witt,Charles London
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint, 31 pages, 8 figures

点击查看摘要

Abstract:Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.
zh

[AI-1] MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

【速读】：该论文旨在解决机器学习工程（Machine Learning Engineering, MLE）领域中高质量训练数据获取困难的问题。现有MLE基准测试受限于静态、人工构建的任务，导致可扩展性差且适用范围有限。解决方案的关键在于提出MLE-Smith，一个全自动化多智能体流水线，通过“生成-验证-执行”范式将原始数据集转化为具有竞争性质的MLE挑战任务，从而实现任务规模的高效扩展，并保证任务的质量、真实世界可用性和多样性。其核心创新包括结构化任务设计与标准化重构机制，以及结合严格结构规则和高层语义一致性的混合验证策略，辅以交互式执行来验证实际求解能力和现实贴合度。

链接: https://arxiv.org/abs/2510.07307
作者: Rushi Qiang,Yuchen Zhuang,Anikait Singh,Percy Liang,Chao Zhang,Sherry Yang,Bo Dai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.
zh

[AI-2] Cocoon: A System Architecture for Differentially Private Training with Correlated Noises

【速读】：该论文旨在解决机器学习模型在训练过程中因记忆和泄露训练数据而导致的隐私问题，特别是针对差分隐私（Differential Privacy, DP）训练方法如DP-SGD因每轮迭代添加噪声而降低模型准确率的问题。解决方案的关键在于提出一种软硬件协同设计框架Cocoon，通过预计算并以聚合格式存储相关噪声（Cocoon-Emb）来加速含嵌入表（embedding table）的模型训练，并利用定制的近内存处理设备（Cocoon-NMP）支持大规模模型训练，从而有效缓解噪声引入带来的性能开销，实现在保障隐私的同时提升训练效率。

链接: https://arxiv.org/abs/2510.07304
作者: Donghwan Kim,Xin Gu,Jinho Baek,Timothy Lo,Younghoon Min,Kwangsik Shin,Jongryool Kim,Jongse Park,Kiwan Maeng
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning (ML) models memorize and leak training data, causing serious privacy issues to data owners. Training algorithms with differential privacy (DP), such as DP-SGD, have been gaining attention as a solution. However, DP-SGD adds a noise at each training iteration, which degrades the accuracy of the trained model. To improve accuracy, a new family of approaches adds carefully designed correlated noises, so that noises cancel out each other across iterations. We performed an extensive characterization study of these new mechanisms, for the first time to the best of our knowledge, and show they incur non-negligible overheads when the model is large or uses large embedding tables. Motivated by the analysis, we propose Cocoon, a hardware-software co-designed framework for efficient training with correlated noises. Cocoon accelerates models with embedding tables through pre-computing and storing correlated noises in a coalesced format (Cocoon-Emb), and supports large models through a custom near-memory processing device (Cocoon-NMP). On a real system with an FPGA-based NMP device prototype, Cocoon improves the performance by 2.33-10.82x(Cocoon-Emb) and 1.55-3.06x (Cocoon-NMP).
zh

[AI-3] Agent ic generative AI for media content discovery at the national football league

【速读】：该论文旨在解决媒体内容检索效率低下的问题，传统基于过滤和点击的界面难以支持媒体研究人员快速定位历史视频片段。解决方案的关键在于构建一个基于生成式AI（Generative AI）的智能代理工作流（agentic workflow），该流程能够将自然语言查询分解为结构化元素，并转换为底层数据库查询语言；同时通过精心设计的语义缓存机制提升准确率与响应速度，最终实现平均查询时间从10分钟缩短至30秒，准确率超过95%，显著提升了操作效率并释放了用户在创意内容生产上的精力。

链接: https://arxiv.org/abs/2510.07297
作者: Henry Wang,Sirajus Salekin,Jake Lee,Ross Claytor,Shinan Zhang,Michael Chi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, International Sports Analytics Conference and Exhibition

点击查看摘要

Abstract:Generative AI has unlocked new possibilities in content discovery and management. Through collaboration with the National Football League (NFL), we demonstrate how a generative-AI based workflow enables media researchers and analysts to query relevant historical plays using natural language rather than traditional filter-and-click interfaces. The agentic workflow takes a user query as input, breaks it into elements, and translates them into the underlying database query language. Accuracy and latency are further improved through carefully designed semantic caching. The solution achieves over 95 percent accuracy and reduces the average time to find relevant videos from 10 minutes to 30 seconds, significantly increasing the NFL’s operational efficiency and allowing users to focus on producing creative content and engaging storylines.
zh

[AI-4] Evolutionary Profiles for Protein Fitness Prediction

【速读】：该论文旨在解决蛋白质工程中突变 fitness 预测的难题，即如何在有限实验数据条件下准确评估海量序列空间中的突变影响。现有方法受限于实验测定数量远小于序列空间规模，而生成式 AI (Generative AI) 驱动的蛋白质语言模型（pLMs）虽具备零样本预测能力，但其性能仍有提升空间。论文的关键创新在于提出 EvoIF 模型，其核心思想是将自然进化视为隐式奖励最大化过程，并将 MLM（掩码语言建模）解释为逆强化学习（IRL），其中现存序列作为专家示范，pLM 对数几率（log-odds）作为 fitness 估计。EvoIF 通过融合两类互补的进化信号——同家族序列特征（来自检索到的同源序列）与跨家族结构-进化约束（从逆折叠对数几率中蒸馏得到），并利用紧凑的过渡模块整合序列与结构表征，从而实现更校准的概率输出和高精度 log-odds 评分，在 ProteinGym 数据集上以仅 0.15% 的训练数据和少参数量达到当前最优或具有竞争力的性能。

链接: https://arxiv.org/abs/2510.07286
作者: Jigang Fan,Xiaoran Jiao,Shengdong Lin,Zhanming Liang,Weian Mao,Chenchen Jing,Hao Chen,Chunhua Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; 2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at this https URL.
zh

[AI-5] GTCN-G: A Residual Graph-Temporal Fusion Network for Imbalanced Intrusion Detection (Preprint)

【速读】：该论文旨在解决现代入侵检测系统（Intrusion Detection System, IDS）在面对网络威胁复杂性提升和流量数据固有类别不平衡问题时的性能瓶颈。现有方法难以同时有效建模网络流的时间序列依赖性和拓扑结构特征，且对少数类恶意行为的检测敏感度不足。其解决方案的关键在于提出一种名为GTCN-G的深度学习框架，该框架创新性地融合了门控时间卷积网络（Gated Temporal Convolutional Network, G-TCN）与图卷积网络（Graph Convolutional Network, GCN），并通过引入基于图注意力网络（Graph Attention Network, GAT）的残差学习机制，保留原始特征信息，从而显著缓解类别不平衡问题并提升对稀有攻击类型的检测能力。

链接: https://arxiv.org/abs/2510.07285
作者: Tianxiang Xu,Zhichao Wen,Xinyu Zhao,Qi Hu,Yan Li,Chang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This preprint was submitted to IEEE TrustCom 2025. The accepted version will be published under copyright 2025 IEEE

点击查看摘要

Abstract:The escalating complexity of network threats and the inherent class imbalance in traffic data present formidable challenges for modern Intrusion Detection Systems (IDS). While Graph Neural Networks (GNNs) excel in modeling topological structures and Temporal Convolutional Networks (TCNs) are proficient in capturing time-series dependencies, a framework that synergistically integrates both while explicitly addressing data imbalance remains an open challenge. This paper introduces a novel deep learning framework, named Gated Temporal Convolutional Network and Graph (GTCN-G), engineered to overcome these limitations. Our model uniquely fuses a Gated TCN (G-TCN) for extracting hierarchical temporal features from network flows with a Graph Convolutional Network (GCN) designed to learn from the underlying graph structure. The core innovation lies in the integration of a residual learning mechanism, implemented via a Graph Attention Network (GAT). This mechanism preserves original feature information through residual connections, which is critical for mitigating the class imbalance problem and enhancing detection sensitivity for rare malicious activities (minority classes). We conducted extensive experiments on two public benchmark datasets, UNSW-NB15 and ToN-IoT, to validate our approach. The empirical results demonstrate that the proposed GTCN-G model achieves state-of-the-art performance, significantly outperforming existing baseline models in both binary and multi-class classification tasks.
zh

[AI-6] Multi-Objective Multi-Agent Path Finding with Lexicographic Cost Preferences

【速读】：该论文旨在解决多目标多智能体路径规划（Multi-Objective Multi-Agent Path Finding, MO-MAPF）中缺乏对用户定义偏好显式优化的问题，同时克服现有方法在目标数量增加时的可扩展性瓶颈。其解决方案的关键在于提出一种字典序（lexicographic）框架，并设计了Lexicographic Conflict-Based Search (LCBS)算法：该算法通过引入优先级感知的低层A*搜索与冲突基于搜索（Conflict-Based Search, CBS）相结合的方式，直接生成符合字典序偏好排序的单一最优解，从而避免构造帕累托前沿（Pareto frontier），显著提升了计算效率与可扩展性，实验证明其可在高达十维目标场景下仍保持最优性并优于当前主流基线方法。

链接: https://arxiv.org/abs/2510.07276
作者: Pulkit Rustagi,Kyle Hollins Wray,Sandhya Saisubramanian
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Many real-world scenarios require multiple agents to coordinate in shared environments, while balancing trade-offs between multiple, potentially competing objectives. Current multi-objective multi-agent path finding (MO-MAPF) algorithms typically produce conflict-free plans by computing Pareto frontiers. They do not explicitly optimize for user-defined preferences, even when the preferences are available, and scale poorly with the number of objectives. We propose a lexicographic framework for modeling MO-MAPF, along with an algorithm \textitLexicographic Conflict-Based Search (LCBS) that directly computes a single solution aligned with a lexicographic preference over objectives. LCBS integrates a priority-aware low-level A^* search with conflict-based search, avoiding Pareto frontier construction and enabling efficient planning guided by preference over objectives. We provide insights into optimality and scalability, and empirically demonstrate that LCBS computes optimal solutions while scaling to instances with up to ten objectives – far beyond the limits of existing MO-MAPF methods. Evaluations on standard and randomized MAPF benchmarks show consistently higher success rates against state-of-the-art baselines, especially with increasing number of objectives.
zh

[AI-7] On the false election between regulation and innovation. Ideas for regulation through the responsible use of artificial intelligence in research and education.[Spanish version]

【速读】：该论文旨在解决如何在人工智能（Artificial Intelligence, AI）发展中平衡监管与创新、推动负责任创新以保障公共利益，以及在全球竞争背景下确保国际协作不演变为“权利竞赛下坡”等问题。其解决方案的关键在于：第一，通过构建以基本权利（如隐私权、反歧视、自主性等）为核心的前瞻性监管框架，避免将监管视为创新的阻碍；第二，借鉴已验证有效的政策案例（如欧盟《人工智能法案》），证明在风险治理中可实现技术发展与伦理责任并重；第三，建立具有约束力的国际共识机制（如多边合作标准与问责体系），防止因美国等国家强调灵活性而导致全球监管标准弱化，从而推动形成统一且高水准的AI治理范式。

链接: https://arxiv.org/abs/2510.07268
作者: Pompeu Casanovas(IIIA-CSIC)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 20 pages, in Spanish language, 1 figure, 1 table, AI Hub-CSIC / EduCaixa, Escuela de Verano, Auditorio CaixaForum, Zaragoza, Spain, 4 July 2025

点击查看摘要

Abstract:This short essay is a reworking of the answers offered by the author at the Debate Session of the AIHUB (CSIC) and EduCaixa Summer School, organized by Marta Garcia-Matos and Lissette Lemus, and coordinated by Albert Sabater (OEIAC, UG), with the participation of Vanina Martinez-Posse (IIIA-CSIC), Eulalia Soler (Eurecat) and Pompeu Casanovas (IIIA-CSIC) on July 4th 2025. Albert Sabater posed three questions: (1) How can regulatory frameworks priori-tise the protection of fundamental rights (privacy, non-discrimination, autonomy, etc.) in the development of AI, without falling into the false dichotomy between regulation and innova-tion? (2) Given the risks of AI (bias, mass surveillance, manipulation), what examples of regu-lations or policies have demonstrated that it is possible to foster responsible innovation, putting the public interest before profitability, without giving in to competitive pressure from actors such as China or the US? (3) In a scenario where the US prioritizes flexibility, what mecha-nisms could ensure that international cooperation in AI does not become a race to the bottom in rights, but rather a global standard of accountability? The article attempts to answer these three questions and concludes with some reflections on the relevance of the answers for education and research.
zh

[AI-8] HyPlan: Hybrid Learning-Assisted Planning Under Uncertainty for Safe Autonomous Driving

【速读】：该论文旨在解决自动驾驶汽车在部分可观测交通环境中的无碰撞导航问题（collision-free navigation）。其解决方案的关键在于提出了一种名为HyPlan的混合学习辅助规划方法，该方法融合了多智能体行为预测、基于近端策略优化（Proximal Policy Optimization, PPO）的深度强化学习，以及启发式置信度驱动的垂直剪枝近似在线部分可观测马尔可夫决策过程（POMDP）规划，从而在不牺牲驾驶安全性的前提下显著降低执行时间。

链接: https://arxiv.org/abs/2510.07210
作者: Donald Pfaffmann,Matthias Klusch,Marcel Steinmetz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel hybrid learning-assisted planning method, named HyPlan, for solving the collision-free navigation problem for self-driving cars in partially observable traffic environments. HyPlan combines methods for multi-agent behavior prediction, deep reinforcement learning with proximal policy optimization and approximated online POMDP planning with heuristic confidence-based vertical pruning to reduce its execution time without compromising safety of driving. Our experimental performance analysis on the CARLA-CTS2 benchmark of critical traffic scenarios with pedestrians revealed that HyPlan may navigate safer than selected relevant baselines and perform significantly faster than considered alternative online POMDP planners.
zh

[AI-9] NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

【速读】：该论文旨在解决当前科学定律发现（scientific law discovery）任务中基准测试存在的方法论三难困境（methodological trilemma），即在科学相关性、可扩展性和抗记忆性之间难以兼顾，同时现有基准将发现过程简化为静态函数拟合，未能反映真实科学探索中通过交互式实验揭示复杂系统隐藏规律的本质。其解决方案的关键在于提出NewtonBench基准，包含12个物理领域共324个科学定律发现任务，通过引入“形而上学变换”（metaphysical shifts）——对经典定律进行系统性扰动以生成大量问题，从而实现可扩展、科学相关且抗记忆的评估；此外，将评估范式从静态函数拟合升级为交互式模型发现，要求智能体通过实验探测模拟的复杂系统来识别隐含原理，从而更真实地刻画科学发现过程。

链接: https://arxiv.org/abs/2510.07172
作者: Tianshi Zheng,Kelvin Kiu-Wai Tam,Newt Hue-Nam K. Nguyen,Baixuan Xu,Zhaowei Wang,Jiayang Cheng,Hong Ting Tsang,Weiqi Wang,Jiaxin Bai,Tianqing Fang,Yangqiu Song,Ginny Y. Wong,Simon See
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 60 pages, 18 figures, 13 tables

点击查看摘要

Abstract:Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.
zh

[AI-10] Integrating Domain Knowledge into Process Discovery Using Large Language Models

【速读】：该论文旨在解决传统过程发现（process discovery）方法因仅依赖不完整或含噪的事件日志而难以准确刻画实际业务流程的问题，从而导致模型在合规性检查和流程优化等下游任务中可靠性不足。其解决方案的关键在于提出一个交互式框架，利用大语言模型（Large Language Models, LLMs）将领域专家提供的自然语言描述转化为可执行的声明式规则（declarative rules），并以此引导IMr算法在建模过程中融合事件日志与领域知识，递归构建更符合现实逻辑的过程模型，有效规避与领域知识相悖的结构问题。

链接: https://arxiv.org/abs/2510.07161
作者: Ali Norouzifar,Humam Kourani,Marcus Dees,Wil van der Aalst
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is currently under review for publication in a journal

点击查看摘要

Abstract:Process discovery aims to derive process models from event logs, providing insights into operational behavior and forming a foundation for conformance checking and process improvement. However, models derived solely from event data may not accurately reflect the real process, as event logs are often incomplete or affected by noise, and domain knowledge, an important complementary resource, is typically disregarded. As a result, the discovered models may lack reliability for downstream tasks. We propose an interactive framework that incorporates domain knowledge, expressed in natural language, into the process discovery pipeline using Large Language Models (LLMs). Our approach leverages LLMs to extract declarative rules from textual descriptions provided by domain experts. These rules are used to guide the IMr discovery algorithm, which recursively constructs process models by combining insights from both the event log and the extracted rules, helping to avoid problematic process structures that contradict domain knowledge. The framework coordinates interactions among the LLM, domain experts, and a set of backend services. We present a fully implemented tool that supports this workflow and conduct an extensive evaluation of multiple LLMs and prompt engineering strategies. Our empirical study includes a case study based on a real-life event log with the involvement of domain experts, who assessed the usability and effectiveness of the framework.
zh

[AI-11] ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL

【速读】：该论文旨在解决机器人智能体在部分可观测性和长时决策场景下的性能瓶颈问题，即如何有效利用历史信息以支持长期规划与决策。现有方法通常仅依赖即时观测，难以建模跨长时间跨度的依赖关系；而标准的循环神经网络或Transformer模型受限于上下文窗口长度，且简单的记忆扩展策略在大规模和稀疏奖励环境下表现不佳。解决方案的关键在于提出ELMUR（External Layer Memory with Update/Rewrite）架构——一种具有结构化外部记忆的Transformer变体，其每一层维护局部记忆嵌入，并通过双向交叉注意力机制与其交互，同时借助LRU（Least Recently Used）内存模块实现记忆的替换或凸组合更新，从而显著延长有效决策范围（可达注意力窗口的10万倍），并在多个复杂任务中实现性能突破。

链接: https://arxiv.org/abs/2510.07151
作者: Egor Cherepanov,Alexey K. Kovalev,Aleksandr I. Panov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Real-world robotic agents must act under partial observability and long horizons, where key cues may appear long before they affect decision making. However, most modern approaches rely solely on instantaneous information, without incorporating insights from the past. Standard recurrent or transformer models struggle with retaining and leveraging long-term dependencies: context windows truncate history, while naive memory extensions fail under scale and sparsity. We propose ELMUR (External Layer Memory with Update/Rewrite), a transformer architecture with structured external memory. Each layer maintains memory embeddings, interacts with them via bidirectional cross-attention, and updates them through an Least Recently Used (LRU) memory module using replacement or convex blending. ELMUR extends effective horizons up to 100,000 times beyond the attention window and achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps. In POPGym, it outperforms baselines on more than half of the tasks. On MIKASA-Robo sparse-reward manipulation tasks with visual observations, it nearly doubles the performance of strong baselines. These results demonstrate that structured, layer-local external memory offers a simple and scalable approach to decision making under partial observability.
zh

[AI-12] A Digital Twin Framework for Metamorphic Testing of Autonomous Driving Systems Using Generative Model

【速读】：该论文旨在解决自动驾驶汽车在复杂和不可预测的真实驾驶环境中确保安全性的重大挑战，特别是传统测试方法面临的“oracle问题”（oracle problem）——即难以判断系统行为是否正确，以及无法覆盖自动驾驶车辆可能遇到的全部场景。解决方案的关键在于提出一种基于数字孪生（digital twin）驱动的变异测试（metamorphic testing）框架，通过构建自驾车系统及其运行环境的虚拟副本，并结合生成式 AI（Generative AI）如 Stable Diffusion 模型，系统性地生成语义一致但多样化的高保真驾驶场景（包括天气、道路拓扑和环境特征的变化）。该框架在同步仿真环境中定义了三条受现实交通规则和车辆行为启发的变异关系，从而实现可控、可重复且高效的测试验证，显著提升了测试覆盖率与有效性，在 Udacity 自动驾驶模拟器中实现了最高的真阳性率（0.719）、F1 分数（0.689）和精度（0.662）。

链接: https://arxiv.org/abs/2510.07133
作者: Tony Zhang,Burak Kantarci,Umair Siddique
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring the safety of self-driving cars remains a major challenge due to the complexity and unpredictability of real-world driving environments. Traditional testing methods face significant limitations, such as the oracle problem, which makes it difficult to determine whether a system’s behavior is correct, and the inability to cover the full range of scenarios an autonomous vehicle may encounter. In this paper, we introduce a digital twin-driven metamorphic testing framework that addresses these challenges by creating a virtual replica of the self-driving system and its operating environment. By combining digital twin technology with AI-based image generative models such as Stable Diffusion, our approach enables the systematic generation of realistic and diverse driving scenes. This includes variations in weather, road topology, and environmental features, all while maintaining the core semantics of the original scenario. The digital twin provides a synchronized simulation environment where changes can be tested in a controlled and repeatable manner. Within this environment, we define three metamorphic relations inspired by real-world traffic rules and vehicle behavior. We validate our framework in the Udacity self-driving simulator and demonstrate that it significantly enhances test coverage and effectiveness. Our method achieves the highest true positive rate (0.719), F1 score (0.689), and precision (0.662) compared to baseline approaches. This paper highlights the value of integrating digital twins with AI-powered scenario generation to create a scalable, automated, and high-fidelity testing solution for autonomous vehicle safety.
zh

[AI-13] he Contingencies of Physical Embodiment Allow for Open-Endedness and Care

【速读】：该论文试图解决当前人工代理在开放环境中适应性差、难以提供对齐关怀的问题，其核心在于理解生物体为何能在物理世界中高效生存与互助，而人工代理却常因缺乏物理脆弱性（physical vulnerability）和死亡意识（mortality）而导致行为僵化。解决方案的关键在于引入两个受马丁·海德格尔存在主义现象学启发的最小物理具身条件：在世存在（being-in-the-world，即代理是环境的一部分）和向死存在（being-towards-death，即代理因热力学第二定律趋向终态）。基于此，论文提出一种内生驱动机制——维持稳态的驱动力（homeostatic drive）和扩展控制能力的内在动机（intrinsic drive），并借鉴弗里德里希·尼采的“权力意志”（will-to-power）概念，将增强未来状态可控性（如赋能，empowerment）作为优化目标。该框架通过强化学习形式化实现，使代理能够在多智能体开放环境中持续学习与演化，从而发展出开放性（open-endedness）和长期物理完整性维持能力。

链接: https://arxiv.org/abs/2510.07117
作者: Leonardo Christov-Moore(1),Arthur Juliani(1),Alex Kiefer(1 and 2 and 3),Nicco Reggente(1),B. Scott Rousse(4),Adam Safron(1 and 5),Nicol’as Hinrichs(6 and 7),Daniel Polani(8),Antonio Damasio(9) ((1) Institute for Advanced Consciousness Studies, Santa Monica, CA, (2) VERSES, (3) Monash Centre for Consciousness and Contemplative Studies, (4) Allen Discovery Center, (5) Allen Discovery Center, (6) Okinawa Institute of Science and Technology, (7) Max Planck Institute for Human Cognitive and Brain Sciences, (8) University of Hertfordshire, (9) Brain and Creativity Institute)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 1 figure

点击查看摘要

Abstract:Physical vulnerability and mortality are often seen as obstacles to be avoided in the development of artificial agents, which struggle to adapt to open-ended environments and provide aligned care. Meanwhile, biological organisms survive, thrive, and care for each other in an open-ended physical world with relative ease and efficiency. Understanding the role of the conditions of life in this disparity can aid in developing more robust, adaptive, and caring artificial agents. Here we define two minimal conditions for physical embodiment inspired by the existentialist phenomenology of Martin Heidegger: being-in-the-world (the agent is a part of the environment) and being-towards-death (unless counteracted, the agent drifts toward terminal states due to the second law of thermodynamics). We propose that from these conditions we can obtain both a homeostatic drive - aimed at maintaining integrity and avoiding death by expending energy to learn and act - and an intrinsic drive to continue to do so in as many ways as possible. Drawing inspiration from Friedrich Nietzsche’s existentialist concept of will-to-power, we examine how intrinsic drives to maximize control over future states, e.g., empowerment, allow agents to increase the probability that they will be able to meet their future homeostatic needs, thereby enhancing their capacity to maintain physical integrity. We formalize these concepts within a reinforcement learning framework, which enables us to examine how intrinsically driven embodied agents learning in open-ended multi-agent environments may cultivate the capacities for open-endedness and this http URL
zh

[AI-14] Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

【速读】：该论文旨在解决机器人在真实世界中进行人形交互时，如何有效构建世界模型（World Model）以预测未来视觉帧或紧凑的潜在状态的问题。其核心挑战在于从复杂的真实环境数据中学习动态表征，并实现高保真度的未来状态预测。解决方案的关键在于提出两种互补的任务路径：在采样任务中，采用视频生成基础模型Wan-2.2 TI2V-5B，通过AdaLN-Zero对机器人状态进行条件控制，并结合LoRA（Low-Rank Adaptation）进行微调；在压缩任务中，则从零开始训练一个时空Transformer模型来预测离散的潜在代码。该方法在两个赛道上分别取得23.0 dB PSNR和Top-500 CE 6.6386的成绩，表明其在真实场景下具备优异的未来预测能力。

链接: https://arxiv.org/abs/2510.07092
作者: Riccardo Mereu,Aidan Scannell,Yuxin Hou,Yi Zhao,Aditya Jitta,Antonio Dominguez,Luigi Acerbi,Amos Storkey,Paul Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 6 pages, 3 figures, 1X world model challenge technical report

点击查看摘要

Abstract:World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.
zh

[AI-15] HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting

【速读】：该论文旨在解决现有基于Transformer的时间序列预测方法在序列建模中存在的局限性，即模型过度强调时间依赖性，导致计算开销增加但性能提升不明显的问题。其核心解决方案是提出一种混合时间与多变量嵌入（Hybrid Temporal and Multivariate Embeddings, HTME）机制，通过引入轻量级的时间特征提取模块和精心设计的多变量特征提取模块，生成具有丰富语义信息的多维嵌入表示，从而在保持模型轻量化的同时显著提升Transformer架构对时间序列的理解能力和预测精度。

链接: https://arxiv.org/abs/2510.07084
作者: Tan Wang,Yun Wei Dong,Tao Zhang,Qi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based methods have achieved impressive results in time series forecasting. However, existing Transformers still exhibit limitations in sequence modeling as they tend to overemphasize temporal dependencies. This incurs additional computational overhead without yielding corresponding performance gains. We find that the performance of Transformers is highly dependent on the embedding method used to learn effective representations. To address this issue, we extract multivariate features to augment the effective information captured in the embedding layer, yielding multidimensional embeddings that convey richer and more meaningful sequence representations. These representations enable Transformer-based forecasters to better understand the series. Specifically, we introduce Hybrid Temporal and Multivariate Embeddings (HTME). The HTME extractor integrates a lightweight temporal feature extraction module with a carefully designed multivariate feature extraction module to provide complementary features, thereby achieving a balance between model complexity and performance. By combining HTME with the Transformer architecture, we present HTMformer, leveraging the enhanced feature extraction capability of the HTME extractor to build a lightweight forecaster. Experiments conducted on eight real-world datasets demonstrate that our approach outperforms existing baselines in both accuracy and efficiency.
zh

[AI-16] VRPAgent : LLM -Driven Discovery of Heuristic Operators for Vehicle Routing Problems

【速读】：该论文旨在解决车辆路径问题（Vehicle Routing Problem, VRP）中高绩效启发式算法设计的复杂性问题，即如何在缺乏大量标注数据的情况下，自动发现性能优于人工设计或现有学习方法的启发式策略。其解决方案的关键在于提出VRPAgent框架，该框架将大语言模型（Large Language Model, LLM）生成的问题特定算子嵌入到通用元启发式（metaheuristic）结构中，并通过一种新颖的遗传搜索对其迭代优化，从而在保证正确性的前提下，高效探索并提炼出高性能的启发式操作符。该方法仅需单个CPU核心即可在多个VRP变体（如带容量约束、时间窗和奖赏收集的VRP）上超越传统手工设计与近期基于学习的方法。

链接: https://arxiv.org/abs/2510.07073
作者: André Hottung,Federico Berto,Chuanbo Hua,Nayeli Gast Zepeda,Daniel Wetzel,Michael Römer,Haoran Ye,Davide Zago,Michael Poli,Stefano Massaroli,Jinkyoo Park,Kevin Tierney
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing high-performing heuristics for vehicle routing problems (VRPs) is a complex task that requires both intuition and deep domain knowledge. Large language model (LLM)-based code generation has recently shown promise across many domains, but it still falls short of producing heuristics that rival those crafted by human experts. In this paper, we propose VRPAgent, a framework that integrates LLM-generated components into a metaheuristic and refines them through a novel genetic search. By using the LLM to generate problem-specific operators, embedded within a generic metaheuristic framework, VRPAgent keeps tasks manageable, guarantees correctness, and still enables the discovery of novel and powerful strategies. Across multiple problems, including the capacitated VRP, the VRP with time windows, and the prize-collecting VRP, our method discovers heuristic operators that outperform handcrafted methods and recent learning-based approaches while requiring only a single CPU core. To our knowledge, \VRPAgent is the first LLM-based paradigm to advance the state-of-the-art in VRPs, highlighting a promising future for automated heuristics discovery.
zh

[AI-17] Inductive Learning for Possibilistic Logic Programs Under Stable Models

【速读】：该论文旨在解决可能性逻辑程序（possibilistic logic programs, poss-programs）在稳定模型（stable models）语义下的归纳推理问题，即如何从背景程序和示例（即预期的可能性稳定模型的组成部分）中提取 poss-programs。其解决方案的关键在于形式化定义归纳任务（induction tasks），并提出两种算法 ilpsm 和 ilpsmmin 来计算归纳解；其中，ilpsmmin 算法通过优化搜索空间，在输入为普通逻辑程序时表现出优于现有基于稳定模型的正常逻辑程序归纳学习系统的性能，验证了其有效性与实用性。

链接: https://arxiv.org/abs/2510.07069
作者: Hongbo Hu,Yisong Wang,Yi Huang,Kewen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under consideration in Theory and Practice of Logic Programming (TPLP)

点击查看摘要

Abstract:Possibilistic logic programs (poss-programs) under stable models are a major variant of answer set programming (ASP). While its semantics (possibilistic stable models) and properties have been well investigated, the problem of inductive reasoning has not been investigated yet. This paper presents an approach to extracting poss-programs from a background program and examples (parts of intended possibilistic stable models). To this end, the notion of induction tasks is first formally defined, its properties are investigated and two algorithms ilpsm and ilpsmmin for computing induction solutions are presented. An implementation of ilpsmmin is also provided and experimental results show that when inputs are ordinary logic programs, the prototype outperforms a major inductive learning system for normal logic programs from stable models on the datasets that are randomly generated.
zh

[AI-18] Prompt Optimization Across Multiple Agents for Representing Diverse Human Populations

【速读】：该论文试图解决的问题是：如何在缺乏大规模人类响应数据的情况下，利用大型语言模型（Large Language Models, LLMs）有效捕捉人类群体行为的多样性。传统方法依赖单一LLM代理，但其输出往往同质化，难以反映真实人类的多元视角与行为模式。解决方案的关键在于提出一种新颖的框架，通过少量人类示范（任务-响应对）进行上下文学习（in-context learning），为每个LLM代理赋予特定的行为特征；进而从指数级庞大的潜在代理空间中，采用子模优化（submodular optimization）策略选择一组具有代表性的LLM代理集合，以实现对目标人类群体的多样化建模。实验表明，该方法在众包和教育场景中显著优于基线，并能复现被代表人群的行为模式与认知视角。

链接: https://arxiv.org/abs/2510.07064
作者: Manh Hung Nguyen,Sebastian Tschiatschek,Adish Singla
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The difficulty and expense of obtaining large-scale human responses make Large Language Models (LLMs) an attractive alternative and a promising proxy for human behavior. However, prior work shows that LLMs often produce homogeneous outputs that fail to capture the rich diversity of human perspectives and behaviors. Thus, rather than trying to capture this diversity with a single LLM agent, we propose a novel framework to construct a set of agents that collectively capture the diversity of a given human population. Each agent is an LLM whose behavior is steered by conditioning on a small set of human demonstrations (task-response pairs) through in-context learning. The central challenge is therefore to select a representative set of LLM agents from the exponentially large space of possible agents. We tackle this selection problem from the lens of submodular optimization. In particular, we develop methods that offer different trade-offs regarding time complexity and performance guarantees. Extensive experiments in crowdsourcing and educational domains demonstrate that our approach constructs agents that more effectively represent human populations compared to baselines. Moreover, behavioral analyses on new tasks show that these agents reproduce the behavior patterns and perspectives of the students and annotators they are designed to represent.
zh

[AI-19] ool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理需要实时知识更新或复杂计算任务时性能受限的问题，尤其是当模型仅依赖直接推理时，在事实性问答和数学计算等场景中表现不足。其解决方案的关键在于提出一种名为工具增强型策略优化（Tool-Augmented Policy Optimization, TAPO）的强化学习框架，该框架通过改进动态采样策略优化（Dynamic Sampling Policy Optimization, DAPO）算法，使模型能够动态地将多跳推理与按需调用外部工具（如搜索API、Python解释器等）相结合，从而实现更高效、精准的工具使用策略。此外，作者构建了两个专门用于训练和评估的基准数据集（TAPO-easy-60K 和 TAPO-hard-18K），并在 Qwen2.5 系列模型上验证了 TAPO 在保持参数规模相当的前提下显著提升了知识密集型与计算密集型任务的表现，同时有效避免了因奖励黑客（reward hacking）导致的过度工具调用问题。

链接: https://arxiv.org/abs/2510.07038
作者: Wenxun Wu,Yuanyang Li,Guhan Chen,Linyue Wang,Hongyang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operations. To overcome these limitations, we propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that systematically integrates multi-hop reasoning with adaptive tool-calling capabilities. Our approach employs a modified version of Dynamic Sampling Policy Optimization (DAPO), a recently developed RL paradigm, which we adapt specifically for tool invocation scenarios, enabling models to dynamically interleave complex reasoning with on-demand tool usage (including search APIs and Python interpreters). To support this research, we introduce two new datasets: TAPO-easy-60K and TAPO-hard-18K, specifically designed to train and evaluate both fact-based reasoning and mathematical calculation capabilities. Our experiments on Qwen2.5-3B and Qwen2.5-7B models demonstrate the effectiveness of our approach, with both models achieving state-of-the-art performance on tasks requiring external knowledge and mathematical computation among methods with comparable parameters. Notably, TAPO achieves more efficient tool utilization than baseline methods while preventing excessive calls caused by reward hacking. These results highlight the significant potential of combining advanced reasoning with tool usage to enhance model performance in knowledge-intensive and computationally demanding tasks. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.07038 [cs.AI] (or arXiv:2510.07038v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.07038 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-20] Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration CIKM2025

【速读】：该论文旨在解决现有分子预训练方法依赖于成对的二维（2D）和三维（3D）分子数据以避免模型坍缩至单一模态的问题，尤其在某一模态缺失或生成成本较高时受限。其解决方案的关键在于提出 FlexMol 框架，该框架通过分离设计 2D 和 3D 分子建模子网络、引入参数共享提升计算效率，并利用解码器生成缺失模态的特征，从而实现多阶段连续学习，在训练中协同利用双模态信息，同时在推理阶段支持单模态输入，保障鲁棒性。

链接: https://arxiv.org/abs/2510.07035
作者: Tengwei Song,Min Wu,Yuan Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: CIKM 2025

点击查看摘要

Abstract:Molecular representation learning plays a crucial role in advancing applications such as drug discovery and material design. Existing work leverages 2D and 3D modalities of molecular information for pre-training, aiming to capture comprehensive structural and geometric insights. However, these methods require paired 2D and 3D molecular data to train the model effectively and prevent it from collapsing into a single modality, posing limitations in scenarios where a certain modality is unavailable or computationally expensive to generate. To overcome this limitation, we propose FlexMol, a flexible molecule pre-training framework that learns unified molecular representations while supporting single-modality input. Specifically, inspired by the unified structure in vision-language models, our approach employs separate models for 2D and 3D molecular data, leverages parameter sharing to improve computational efficiency, and utilizes a decoder to generate features for the missing modality. This enables a multistage continuous learning process where both modalities contribute collaboratively during training, while ensuring robustness when only one modality is available during inference. Extensive experiments demonstrate that FlexMol achieves superior performance across a wide range of molecular property prediction tasks, and we also empirically demonstrate its effectiveness with incomplete data. Our code and data are available at this https URL.
zh

[AI-21] Federated Unlearning in the Wild: Rethinking Fairness and Data Discrepancy

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中“联邦遗忘”（Federated Unlearning, FU）面临的两个核心问题：一是现有方法在公平性方面存在缺陷，精确遗忘方法强制所有客户端重新训练，而近似方法（如梯度上升或知识蒸馏）对仅保留数据的客户端造成不公平的性能下降；二是当前FU评估普遍依赖于理想化的合成数据假设（如IID/non-IID），忽略了真实世界中的数据异构性，导致方法的实际效果被高估。论文的关键解决方案是提出一种新颖的、公平感知的联邦遗忘框架——联邦跨客户端约束遗忘（Federated Cross-Client-Constrains Unlearning, FedCCCU），通过显式建模客户端间的约束关系，在保证被删除数据有效遗忘的同时，最小化对未被删除数据客户端的性能影响，从而实现高效且公平的联邦遗忘。

链接: https://arxiv.org/abs/2510.07022
作者: ZiHeng Huang,Di Wu,Jun Bai,Jiale Zhang,Sicong Cao,Ji Zhang,Yingjie Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning is critical for enforcing data deletion rights like the “right to be forgotten.” As a decentralized paradigm, Federated Learning (FL) also requires unlearning, but realistic implementations face two major challenges. First, fairness in Federated Unlearning (FU) is often overlooked. Exact unlearning methods typically force all clients into costly retraining, even those uninvolved. Approximate approaches, using gradient ascent or distillation, make coarse interventions that can unfairly degrade performance for clients with only retained data. Second, most FU evaluations rely on synthetic data assumptions (IID/non-IID) that ignore real-world heterogeneity. These unrealistic benchmarks obscure the true impact of unlearning and limit the applicability of current methods. We first conduct a comprehensive benchmark of existing FU methods under realistic data heterogeneity and fairness conditions. We then propose a novel, fairness-aware FU approach, Federated Cross-Client-Constrains Unlearning (FedCCCU), to explicitly address both challenges. FedCCCU offers a practical and scalable solution for real-world FU. Experimental results show that existing methods perform poorly in realistic settings, while our approach consistently outperforms them.
zh

[AI-22] he Limits of Goal-Setting Theory in LLM -Driven Assessment

【速读】：该论文旨在解决当前用户在使用大语言模型（Large Language Models, LLMs）进行任务执行时，普遍采用“类人心理模型”（Model H）假设的问题，即认为LLM的行为类似于人类评估者，从而通过细化提示（prompt）来提升输出的一致性。研究基于目标设定理论（goal-setting theory），假设更具体的提示应减少评估结果的变异性。其解决方案的关键在于设计了一个受控实验，利用四个逐步增加细节的提示对ChatGPT进行测试，以量化一致性指标（通过Cohen’s Kappa衡量重复运行下的评分者内一致性）。结果显示，提示具体性增强并未显著改善一致性，反而表明LLM的行为并不符合人类评估者的模式，从而揭示了现有模型在输入处理和鲁棒性方面的局限性，为未来模型开发指明了改进方向。

链接: https://arxiv.org/abs/2510.06997
作者: Mrityunjay Kumar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at T4E 2025 for poster

点击查看摘要

Abstract:Many users interact with AI tools like ChatGPT using a mental model that treats the system as human-like, which we call Model H. According to goal-setting theory, increased specificity in goals should reduce performance variance. If Model H holds, then prompting a chatbot with more detailed instructions should lead to more consistent evaluation behavior. This paper tests that assumption through a controlled experiment in which ChatGPT evaluated 29 student submissions using four prompts with increasing specificity. We measured consistency using intra-rater reliability (Cohen’s Kappa) across repeated runs. Contrary to expectations, performance did not improve consistently with increased prompt specificity, and performance variance remained largely unchanged. These findings challenge the assumption that LLMs behave like human evaluators and highlight the need for greater robustness and improved input integration in future model development. Comments: Accepted at T4E 2025 for poster Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.06997 [cs.CY] (or arXiv:2510.06997v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2510.06997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-23] Grouped Differential Attention

【速读】：该论文旨在解决自注意力机制（self-attention mechanism）在现代Transformer架构中因频繁分配注意力到冗余或噪声上下文而导致的计算效率低下问题。现有方法如差分注意力（Differential Attention）虽通过信号与噪声的减法注意力图缓解此问题，但其要求均衡头分配（balanced head allocation）限制了模型的表示灵活性和可扩展性。论文提出分组差分注意力（Grouped Differential Attention, GDA），其核心创新在于引入非平衡头分配策略：将注意力头分为保留信号组和控制噪声组，通过更多头用于信号提取、较少头用于噪声控制，并借助受控重复（类似分组查询注意力GQA）稳定噪声控制组。该设计在保持极低计算开销的同时显著提升信号保真度；进一步结合分组差异化增长策略，仅选择性复制信号聚焦头实现高效容量扩展，从而在大规模预训练和持续训练中验证了适度不平衡比例可大幅提升泛化能力和稳定性，为构建计算高效且可扩展的Transformer架构提供了有效路径。

链接: https://arxiv.org/abs/2510.06949
作者: Junghwan Lim,Sungmin Lee,Dongseok Kim,Wai Ting Cheung,Beomgyu Kim,Taehwan Kim,Haesol Lee,Junhyeok Lee,Dongpin Oh,Eunhwan Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The self-attention mechanism, while foundational to modern Transformer architectures, suffers from a critical inefficiency: it frequently allocates substantial attention to redundant or noisy context. Differential Attention addressed this by using subtractive attention maps for signal and noise, but its required balanced head allocation imposes rigid constraints on representational flexibility and scalability. To overcome this, we propose Grouped Differential Attention (GDA), a novel approach that introduces unbalanced head allocation between signal-preserving and noise-control groups. GDA significantly enhances signal focus by strategically assigning more heads to signal extraction and fewer to noise-control, stabilizing the latter through controlled repetition (akin to GQA). This design achieves stronger signal fidelity with minimal computational overhead. We further extend this principle to group-differentiated growth, a scalable strategy that selectively replicates only the signal-focused heads, thereby ensuring efficient capacity expansion. Through large-scale pretraining and continual training experiments, we demonstrate that moderate imbalance ratios in GDA yield substantial improvements in generalization and stability compared to symmetric baselines. Our results collectively establish that ratio-aware head allocation and selective expansion offer an effective and practical path toward designing scalable, computation-efficient Transformer architectures. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.06949 [cs.LG] (or arXiv:2510.06949v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.06949 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-24] DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning

【速读】：该论文旨在解决现有模仿学习方法在多智能体交通仿真中难以建模真实交通行为的问题，特别是行为克隆（Behavior Cloning）因协变量偏移（covariate shift）导致性能下降，以及生成对抗模仿学习（Generative Adversarial Imitation Learning, GAIL）在多智能体场景下稳定性差的缺陷。其解决方案的关键在于识别并缓解“无关交互误导”（irrelevant interaction misguidance）——即判别器因邻居之间的不真实交互而错误惩罚智能体的真实行为。为此，作者提出分解式多智能体GAIL（Decomposed Multi-agent GAIL, DecompGAIL），通过显式将真实性分解为自车-地图（ego-map）和自车-邻近车辆（ego-neighbor）两个组件，过滤掉邻近车辆间及邻近车辆与地图间的误导性交互；同时引入社会PPO目标，以距离加权的邻域奖励增强自车奖励，从而提升整体交通场景的真实性。

链接: https://arxiv.org/abs/2510.06913
作者: Ke Guo,Haochen Liu,Xiaojun Wu,Chen Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Realistic traffic simulation is critical for the development of autonomous driving systems and urban mobility planning, yet existing imitation learning approaches often fail to model realistic traffic behaviors. Behavior cloning suffers from covariate shift, while Generative Adversarial Imitation Learning (GAIL) is notoriously unstable in multi-agent settings. We identify a key source of this instability: irrelevant interaction misguidance, where a discriminator penalizes an ego vehicle’s realistic behavior due to unrealistic interactions among its neighbors. To address this, we propose Decomposed Multi-agent GAIL (DecompGAIL), which explicitly decomposes realism into ego-map and ego-neighbor components, filtering out misleading neighbor: neighbor and neighbor: map interactions. We further introduce a social PPO objective that augments ego rewards with distance-weighted neighborhood rewards, encouraging overall realism across agents. Integrated into a lightweight SMART-based backbone, DecompGAIL achieves state-of-the-art performance on the WOMD Sim Agents 2025 benchmark.
zh

[AI-25] LLM -Assisted Modeling of Semantic Web-Enabled Multi-Agents Systems with AJAN

【速读】：该论文旨在解决AJAN框架中基于RDF/RDFS和SPARQL定义智能体行为时所面临的两大挑战：一是URIs的拼写错误易导致严重问题，二是大规模环境下编写复杂SPARQL查询的学习曲线陡峭。解决方案的关键在于提出一个集成开发环境（Integrated Development Environment, IDE），通过简化RDF/RDFS知识表示与SPARQL查询操作来降低建模门槛，同时引入大语言模型（Large Language Models, LLMs）以增强用户对AJAN智能体工程的支持能力，从而扩展该框架的用户群体并提升建模效率。

链接: https://arxiv.org/abs/2510.06911
作者: Hacane Hechehouche,Andre Antakli,Matthias Klusch
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There are many established semantic Web standards for implementing multi-agent driven applications. The AJAN framework allows to engineer multi-agent systems based on these standards. In particular, agent knowledge is represented in RDF/RDFS and OWL, while agent behavior models are defined with Behavior Trees and SPARQL to access and manipulate this knowledge. However, the appropriate definition of RDF/RDFS and SPARQL-based agent behaviors still remains a major hurdle not only for agent modelers in practice. For example, dealing with URIs is very error-prone regarding typos and dealing with complex SPARQL queries in large-scale environments requires a high learning curve. In this paper, we present an integrated development environment to overcome such hurdles of modeling AJAN agents and at the same time to extend the user community for AJAN by the possibility to leverage Large Language Models for agent engineering.
zh

[AI-26] Emotionally Vulnerable Subtype of Internet Gaming Disorder: Measuring and Exploring the Pathology of Problematic Generative AI Use

【速读】：该论文旨在解决生成式 AI (Generative AI) 使用可能被过度病理化的问题，以及当前对生成式 AI 成瘾缺乏概念清晰度的困境。其关键解决方案是开发并验证了 PUGenAIS-9（Problematic Use of Generative Artificial Intelligence Scale-9 items），该量表基于互联网游戏障碍（Internet Gaming Disorder, IGD）框架，通过多国样本（中国与美国，N=1,508）的验证性因子分析和测量不变性检验，确立了具有跨文化稳定性的九项结构，并进一步通过人群中心（潜剖面分析）和变量中心（网络分析）方法揭示其与IGD中情绪脆弱亚型特征高度一致，而非基于能力的亚型。这一发现不仅为识别问题性生成式 AI 使用提供了可靠工具，还提出以“基础设施-内容-设备”（ICD）模型重构数字成瘾研究范式，从而在响应新兴媒介变化的同时避免过度病理化倾向。

链接: https://arxiv.org/abs/2510.06908
作者: Haocan Sun,Di Wua,Weizi Liu,Guoming Yua,Mike Yao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 27 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Concerns over the potential over-pathologization of generative AI (GenAI) use and the lack of conceptual clarity surrounding GenAI addiction call for empirical tools and theoretical refinement. This study developed and validated the PUGenAIS-9 (Problematic Use of Generative Artificial Intelligence Scale-9 items) and examined whether PUGenAIS reflects addiction-like patterns under the Internet Gaming Disorder (IGD) framework. Using samples from China and the United States (N = 1,508), we conducted confirmatory factor analysis and identified a robust 31-item structure across nine IGD-based dimensions. We then derived the PUGenAIS-9 by selecting the highest-loading items from each dimension and validated its structure in an independent sample (N = 1,426). Measurement invariance tests confirmed its stability across nationality and gender. Person-centered (latent profile analysis) and variable-centered (network analysis) approaches found that PUGenAIS matches the traits of the emotionally vulnerable subtype of IGD, not the competence-based kind. These results support using PUGenAIS-9 to identify problematic GenAI use and show the need to rethink digital addiction with an ICD (infrastructures, content, and device) model. This keeps addiction research responsive to new media while avoiding over-pathologizing.
zh

[AI-27] M3Retrieve: Benchmarking Multimodal Retrieval for Medicine EMNLP

【速读】：该论文旨在解决医疗领域中缺乏标准基准来评估多模态检索模型性能的问题。当前，随着检索增强生成（Retrieval-Augmented Generation, RAG）技术的广泛应用，强健的检索模型变得至关重要，尤其在医学场景下，文本与图像结合的多模态检索能够显著提升问答、跨模态检索和多模态摘要等下游任务的效果。然而，由于缺少统一、全面且覆盖多专科的评测基准，相关模型的比较与优化受限。为此，作者提出了M3Retrieve——一个涵盖5个领域、16个医学专科、4类任务的多模态医学检索基准，包含超过120万条文本文档和16.4万条多模态查询，所有数据均在合规授权下收集。其关键创新在于构建了首个系统性、大规模、多任务的医学多模态检索评测体系，为模型性能评估、技术创新和可靠医疗检索系统的开发提供了标准化工具。

链接: https://arxiv.org/abs/2510.06888
作者: Arkadeep Acharya,Akash Ghosh,Pradeepika Verma,Kitsuchart Pasupa,Sriparna Saha,Priti Singh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: EMNLP Mains 2025

点击查看摘要

Abstract:With the increasing use of RetrievalAugmented Generation (RAG), strong retrieval models have become more important than ever. In healthcare, multimodal retrieval models that combine information from both text and images offer major advantages for many downstream tasks such as question answering, cross-modal retrieval, and multimodal summarization, since medical data often includes both formats. However, there is currently no standard benchmark to evaluate how well these models perform in medical settings. To address this gap, we introduce M3Retrieve, a Multimodal Medical Retrieval Benchmark. M3Retrieve, spans 5 domains,16 medical fields, and 4 distinct tasks, with over 1.2 Million text documents and 164K multimodal queries, all collected under approved licenses. We evaluate leading multimodal retrieval models on this benchmark to explore the challenges specific to different medical specialities and to understand their impact on retrieval performance. By releasing M3Retrieve, we aim to enable systematic evaluation, foster model innovation, and accelerate research toward building more capable and reliable multimodal retrieval systems for medical applications. The dataset and the baselines code are available in this github page this https URL.
zh

[AI-28] Multi-Dimensional Autoscaling of Stream Processing Services on Edge Devices

【速读】：该论文旨在解决边缘设备（Edge devices）资源受限条件下，流处理服务难以满足其服务等级目标（Service Level Objectives, SLOs）的问题。现有自动伸缩机制仅关注资源层面的扩展，无法有效应对多服务竞争场景下的性能保障需求。解决方案的关键在于提出一个多维自动伸缩平台（Multi-dimensional Autoscaling Platform, MUDAP），支持细粒度的服务级与资源级垂直伸缩，并引入基于结构知识回归分析（Regression Analysis of Structural Knowledge, RASK）的伸缩代理，通过学习处理环境的连续回归模型来高效推断最优伸缩动作。RASK代理在仅20次迭代（即观察200秒处理数据）内即可构建高精度模型，并在增加弹性维度后相较基线方法（Kubernetes VPA和强化学习代理）将SLO违规率降低28%，显著提升了边缘环境下多服务的稳定性与资源利用率。

链接: https://arxiv.org/abs/2510.06882
作者: Boris Sedlak,Philipp Raith,Andrea Morichetta,Víctor Casamayor Pujol,Schahram Dustdar
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Edge devices have limited resources, which inevitably leads to situations where stream processing services cannot satisfy their needs. While existing autoscaling mechanisms focus entirely on resource scaling, Edge devices require alternative ways to sustain the Service Level Objectives (SLOs) of competing services. To address these issues, we introduce a Multi-dimensional Autoscaling Platform (MUDAP) that supports fine-grained vertical scaling across both service- and resource-level dimensions. MUDAP supports service-specific scaling tailored to available parameters, e.g., scale data quality or model size for a particular service. To optimize the execution across services, we present a scaling agent based on Regression Analysis of Structural Knowledge (RASK). The RASK agent efficiently explores the solution space and learns a continuous regression model of the processing environment for inferring optimal scaling actions. We compared our approach with two autoscalers, the Kubernetes VPA and a reinforcement learning agent, for scaling up to 9 services on a single Edge device. Our results showed that RASK can infer an accurate regression model in merely 20 iterations (i.e., observe 200s of processing). By increasingly adding elasticity dimensions, RASK sustained the highest request load with 28% less SLO violations, compared to baselines.
zh

[AI-29] MoRE-GNN: Multi-omics Data Integration with a Heterogeneous Graph Autoencoder

【速读】：该论文旨在解决多组学单细胞数据整合中因高维度和复杂的跨模态关系带来的挑战。其解决方案的关键在于提出MoRE-GNN（Multi-omics Relational Edge Graph Neural Network），这是一种异质图自编码器，通过结合图卷积与注意力机制，直接从数据中动态构建关系图，从而有效捕捉生物上有意义的跨模态关联，并实现准确的下游跨模态预测。

链接: https://arxiv.org/abs/2510.06880
作者: Zhiyu Wang,Sonia Koszut,Pietro Liò,Francesco Ceccarelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of multi-omics single-cell data remains challenging due to high-dimensionality and complex inter-modality relationships. To address this, we introduce MoRE-GNN (Multi-omics Relational Edge Graph Neural Network), a heterogeneous graph autoencoder that combines graph convolution and attention mechanisms to dynamically construct relational graphs directly from data. Evaluations on six publicly available datasets demonstrate that MoRE-GNN captures biologically meaningful relationships and outperforms existing methods, particularly in settings with strong inter-modality correlations. Furthermore, the learned representations allow for accurate downstream cross-modal predictions. While performance may vary with dataset complexity, MoRE-GNN offers an adaptive, scalable and interpretable framework for advancing multi-omics integration.
zh

[AI-30] GPR: Tree-Guided Policy Refinement for Robust Self-Debugging of LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在迭代精炼（Iterative Refinement）过程中面临的搜索空间过大、探索与利用（exploration-exploitation）权衡困难的问题。现有方法依赖预定义启发式策略，难以根据历史精炼结果自适应调整，导致效率低下。其解决方案的关键在于提出Tree-Guided Policy Refinement (TGPR) 框架，该框架将广义相对策略优化（GRPO）与基于Thompson采样的树搜索相结合，能够主动探索成功和失败的精炼路径，从而生成更密集的训练轨迹并学习更具适应性的策略，显著提升了代码生成任务中的性能表现。

链接: https://arxiv.org/abs/2510.06878
作者: Daria Ozerova,Ekaterina Trofimova
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Iterative refinement has been a promising paradigm to enable large language models (LLMs) to resolve difficult reasoning and problem-solving tasks. One of the key challenges, however, is how to effectively search through the enormous search space of possible refinements. Existing methods typically fall back on predefined heuristics, which are troubled by the exploration-exploitation dilemma and cannot adapt based on past refinement outcomes. We introduce Tree-Guided Policy Refinement (TGPR), a novel framework that combines GRPO with a Thompson-Sampling-based tree search. TGPR explores both failed and successful refinement paths actively, with denser training trajectories and more adaptive policies. On HumanEval, MBPP, and APPS benchmarks, our method achieves up to +4.2 percentage points absolute improvement in pass@1 (on MBPP) and up to +12.51 percentage points absolute improvement in pass@10 (on APPS) compared to a competitive GRPO baseline. Apart from debugging code, TGPR focuses on a principled approach to combining learned policies with structured search methods, offering a general framework for enhancing iterative refinement and stateful reasoning in LLMs.
zh

[AI-31] Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Retrieval

【速读】：该论文旨在解决多跳加性高斯白噪声（AWGN）信道中图像传输时因噪声累积导致的感知质量下降问题，传统深度联合源信道编码（DeepJSCC）在此类场景下表现不佳。其解决方案的关键在于引入预训练的深度哈希蒸馏（DHD）模块，通过在训练过程中同时优化均方误差（MSE）和源图像与重建图像在DHD哈希空间中的余弦距离，实现语义层面的一致性对齐，从而显著提升感知重建质量，如LPIPS指标所验证的那样。

链接: https://arxiv.org/abs/2510.06868
作者: Didrik Bergström,Deniz Gündüz,Onur Günlü
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We consider image transmission via deep joint source-channel coding (DeepJSCC) over multi-hop additive white Gaussian noise (AWGN) channels by training a DeepJSCC encoder-decoder pair with a pre-trained deep hash distillation (DHD) module to semantically cluster images, facilitating security-oriented applications through enhanced semantic consistency and improving the perceptual reconstruction quality. We train the DeepJSCC module to both reduce mean square error (MSE) and minimize cosine distance between DHD hashes of source and reconstructed images. Significantly improved perceptual quality as a result of semantic alignment is illustrated for different multi-hop settings, for which classical DeepJSCC may suffer from noise accumulation, measured by the learned perceptual image patch similarity (LPIPS) metric.
zh

[AI-32] owards Generalization of Graph Neural Networks for AC Optimal Power Flow

【速读】：该论文旨在解决大规模电力系统中交流最优潮流（AC Optimal Power Flow, ACOPF）计算成本过高问题，传统求解器在面对复杂电网时存在求解时间过长的瓶颈。为实现跨电网规模的可扩展性和拓扑变化下的适应性，研究提出了一种混合异构消息传递神经网络（Hybrid Heterogeneous Message Passing Neural Network, HH-MPNN），其关键在于将母线、发电机、负荷、无功补偿装置、输电线路和变压器等不同元件建模为异构节点或边类型，并结合可处理长程依赖关系的可扩展Transformer模型，从而在仅基于默认拓扑训练的情况下，实现对数千种未见过拓扑的零样本泛化，优化间隙小于3%，同时相较内点法求解器获得高达1000–10000倍的计算加速。

链接: https://arxiv.org/abs/2510.06860
作者: Olayiwola Arowolo,Jochen L. Cremer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Pre-print has been submitted for review

点击查看摘要

Abstract:AC Optimal Power Flow (ACOPF) is computationally expensive for large-scale power systems, with conventional solvers requiring prohibitive solution times. Machine learning approaches offer computational speedups but struggle with scalability and topology adaptability without expensive retraining. To enable scalability across grid sizes and adaptability to topology changes, we propose a Hybrid Heterogeneous Message Passing Neural Network (HH-MPNN). HH-MPNN models buses, generators, loads, shunts, transmission lines and transformers as distinct node or edge types, combined with a scalable transformer model for handling long-range dependencies. On grids from 14 to 2,000 buses, HH-MPNN achieves less than 1% optimality gap on default topologies. Applied zero-shot to thousands of unseen topologies, HH-MPNN achieves less than 3% optimality gap despite training only on default topologies. Pre-training on smaller grids also improves results on a larger grid. Computational speedups reach 1,000x to 10,000x compared to interior point solvers. These results advance practical, generalizable machine learning for real-time power system operations.
zh

[AI-33] Autoformalizer with Tool Feedback

【速读】：该论文旨在解决自动定理证明（Automated Theorem Proving, ATP）中因缺乏高质量形式化数据而导致的模型训练困难问题，特别是现有形式化模型在生成语句时难以同时保证语法正确性和语义一致性的问题。解决方案的关键在于提出一种引入工具反馈的自动形式化框架（Autoformalizer with Tool Feedback, ATF），其核心创新是将Lean 4编译器作为语法纠错工具与多大语言模型（multi-LLMs-as-judge）作为语义一致性判别工具嵌入到形式化流程中，使模型能够基于工具反馈自适应地修正生成内容，从而显著提升形式化结果的语法有效性与语义合理性。

链接: https://arxiv.org/abs/2510.06857
作者: Qi Guo,Jianing Wang,Jianfei Zhang,Deyang Kong,Xiangzhou Huang,Xiangyu Xi,Wei Wang,Jingang Wang,Xunliang Cai,Shikun Zhang,Wei Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements. Efforts in recent work shift from directly prompting large language models to training an end-to-end formalizer model from scratch, achieving remarkable advancements. However, existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency. To address this issue, we propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process. By integrating Lean 4 compilers for syntax corrections and employing a multi-LLMs-as-judge approach for consistency validation, the model is able to adaptively refine generated statements according to the tool feedback, enhancing both syntactic validity and semantic consistency. The training of ATF involves a cold-start phase on synthetic tool-calling data, an expert iteration phase to improve formalization capabilities, and Direct Preference Optimization to alleviate ineffective revisions. Experimental results show that ATF markedly outperforms a range of baseline formalizer models, with its superior performance further validated by human evaluations. Subsequent analysis reveals that ATF demonstrates excellent inference scaling properties. Moreover, we open-source Numina-ATF, a dataset containing 750K synthetic formal statements to facilitate advancements in autoformalization and ATP research.
zh

[AI-34] Enhancing Bankruptcy Prediction of Banks through Advanced Machine Learning Techniques: An Innovative Approach and Analysis

【速读】：该论文旨在解决传统统计模型在预测银行破产风险时因假设僵化或不适用而导致预测准确率较低的问题。其核心解决方案在于引入机器学习方法（如逻辑回归、随机森林和支撑向量机），利用真实历史财务数据构建更精准的破产预测模型，从而提升对商业银行和农村银行系统性风险的识别能力。研究结果表明，随机森林（Random Forest, RF）在商业银行业务数据中实现了90%的预测准确率，且三种机器学习方法均能有效预测农村银行破产趋势，为制定降低破产成本的政策提供了技术支撑。

链接: https://arxiv.org/abs/2510.06852
作者: Zuherman Rustam,Sri Hartini,Sardar M.N. Islam,Fevi Novkaniza,Fiftitah R. Aszhari,Muhammad Rifqi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context: Financial system stability is determined by the condition of the banking system. A bank failure can destroy the stability of the financial system, as banks are subject to systemic risk, affecting not only individual banks but also segments or the entire financial system. Calculating the probability of a bank going bankrupt is one way to ensure the banking system is safe and sound. Existing literature and limitations: Statistical models, such as Altman’s Z-Score, are one of the common techniques for developing a bankruptcy prediction model. However, statistical methods rely on rigid and sometimes irrelevant assumptions, which can result in low forecast accuracy. New approaches are necessary. Objective of the research: Bankruptcy models are developed using machine learning techniques, such as logistic regression (LR), random forest (RF), and support vector machines (SVM). According to several studies, machine learning is also more accurate and effective than statistical methods for categorising and forecasting banking risk management. Present Research: The commercial bank data are derived from the annual financial statements of 44 active banks and 21 bankrupt banks in Turkey from 1994 to 2004, and the rural bank data are derived from the quarterly financial reports of 43 active and 43 bankrupt rural banks in Indonesia between 2013 and 2019. Five rural banks in Indonesia have also been selected to demonstrate the feasibility of analysing bank bankruptcy trends. Findings and implications: The results of the research experiments show that RF can forecast data from commercial banks with a 90% accuracy rate. Furthermore, the three machine learning methods proposed accurately predict the likelihood of rural bank bankruptcy. Contribution and Conclusion: The proposed innovative machine learning approach help to implement policies that reduce the costs of bankruptcy.
zh

[AI-35] CNN-TFT explained by SHAP with multi-head attention weights for time series forecasting

【速读】：该论文旨在解决多变量时间序列预测中如何有效融合局部特征提取与长程依赖建模的问题。现有方法如卷积神经网络（Convolutional Neural Networks, CNNs）擅长捕捉局部模式和位移不变性，而Transformer架构则通过自注意力机制高效建模长距离依赖关系，但二者单独使用时存在局限。解决方案的关键在于提出一种混合架构CNN-TFT-SHAP-MHAW：首先利用一维卷积层构建层次化特征提取模块，从原始输入序列中提取显著的局部模式并降噪降维；随后将特征图输入到Temporal Fusion Transformer (TFT) 中，借助多头注意力机制同时捕获短期与长期依赖，并自适应加权相关协变量；最后引入基于Shapley值的多头注意力权重解释方法（SHAP-MHAW），提升模型可解释性。实验表明，该架构在水文自然流量数据集上显著优于主流深度学习模型，平均绝对百分比误差降低至2.2%。

链接: https://arxiv.org/abs/2510.06840
作者: Stefano F. Stefenon,João P. Matos-Carvalho,Valderi R. Q. Leithardt,Kin-Choong Yow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) and transformer architectures offer strengths for modeling temporal data: CNNs excel at capturing local patterns and translational invariances, while transformers effectively model long-range dependencies via self-attention. This paper proposes a hybrid architecture integrating convolutional feature extraction with a temporal fusion transformer (TFT) backbone to enhance multivariate time series forecasting. The CNN module first applies a hierarchy of one-dimensional convolutional layers to distill salient local patterns from raw input sequences, reducing noise and dimensionality. The resulting feature maps are then fed into the TFT, which applies multi-head attention to capture both short- and long-term dependencies and to weigh relevant covariates adaptively. We evaluate the CNN-TFT on a hydroelectric natural flow time series dataset. Experimental results demonstrate that CNN-TFT outperforms well-established deep learning models, with a mean absolute percentage error of up to 2.2%. The explainability of the model is obtained by a proposed Shapley additive explanations with multi-head attention weights (SHAP-MHAW). Our novel architecture, named CNN-TFT-SHAP-MHAW, is promising for applications requiring high-fidelity, multivariate time series forecasts, being available for future analysis at this https URL .
zh

[AI-36] Recurrence-Complete Frame-based Action Models

【速读】：该论文试图解决的问题是：当前基于注意力机制（attention）的大语言模型在处理长时序、持续性代理任务（long-running agentic tasks）时存在理论局限性，即纯注意力架构无法有效聚合长时间跨度的输入信息，从而限制了其在软件工程等复杂代理系统中的应用。解决方案的关键在于提出一种递归完备（recurrence-complete）架构，该架构通过引入递归结构确保模型能够正确聚合任意长度的输入序列，并在GitHub衍生的动作序列上进行训练，结果显示损失函数随训练序列长度呈幂律下降，且固定参数量下训练时间越长，单位时间内获得的性能提升越显著，从而证明了递归机制对于长期任务建模的必要性与有效性。

链接: https://arxiv.org/abs/2510.06828
作者: Michael Keiblinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, attention-like mechanisms have been used to great success in the space of large language models, unlocking scaling potential to a previously unthinkable extent. “Attention Is All You Need” famously claims RNN cells are not needed in conjunction with attention. We challenge this view. In this paper, we point to existing proofs that architectures with fully parallelizable forward or backward passes cannot represent classes of problems specifically interesting for long-running agentic tasks. We further conjecture a critical time t beyond which non-recurrence-complete models fail to aggregate inputs correctly, with concrete implications for agentic systems (e.g., software engineering agents). To address this, we introduce a recurrence-complete architecture and train it on GitHub-derived action sequences. Loss follows a power law in the trained sequence length while the parameter count remains fixed. Moreover, longer-sequence training always amortizes its linearly increasing wall-time cost, yielding lower loss as a function of wall time.
zh

[AI-37] Modeling COVID-19 Dynamics in German States Using Physics-Informed Neural Networks

【速读】：该论文旨在解决传染病动力学建模中如何有效利用噪声观测数据来估计区域特异性传播参数和时变基本再生数（R_t）的问题，尤其针对疫情长期演化过程中的复杂时空异质性。其解决方案的关键在于引入物理信息神经网络（Physics-Informed Neural Networks, PINNs），将SIR（Susceptible-Infectious-Recovered）模型的微分方程约束嵌入神经网络训练过程中，从而在不依赖精确初始条件的前提下，直接从德国各联邦州的感染数据中反演状态特异性传播率与恢复率，并实现对R_t的精细化时空追踪。

链接: https://arxiv.org/abs/2510.06776
作者: Phillip Rothenbeck,Sai Karthikeya Vemuri,Niklas Penzel,Joachim Denzler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures, 2 tables

点击查看摘要

Abstract:The COVID-19 pandemic has highlighted the need for quantitative modeling and analysis to understand real-world disease dynamics. In particular, post hoc analyses using compartmental models offer valuable insights into the effectiveness of public health interventions, such as vaccination strategies and containment policies. However, such compartmental models like SIR (Susceptible-Infectious-Recovered) often face limitations in directly incorporating noisy observational data. In this work, we employ Physics-Informed Neural Networks (PINNs) to solve the inverse problem of the SIR model using infection data from the Robert Koch Institute (RKI). Our main contribution is a fine-grained, spatio-temporal analysis of COVID-19 dynamics across all German federal states over a three-year period. We estimate state-specific transmission and recovery parameters and time-varying reproduction number (R_t) to track the pandemic progression. The results highlight strong variations in transmission behavior across regions, revealing correlations with vaccination uptake and temporal patterns associated with major pandemic phases. Our findings demonstrate the utility of PINNs in localized, long-term epidemiological modeling.
zh

[AI-38] Verifying Memoryless Sequential Decision-making of Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在无记忆顺序决策任务中策略的安全性验证问题，即如何对LLM策略是否满足特定安全性质进行形式化、自动化且严格的验证。其解决方案的关键在于：基于马尔可夫决策过程（Markov Decision Process, MDP）构建仅可达状态的增量式形式模型，利用自然语言提示（natural language prompt）编码状态并解析LLM输出为动作，结合Storm模型检测工具验证所生成模型是否满足概率计算时态逻辑（Probabilistic Computation Tree Logic, PCTL）表述的安全规范，从而实现对LLM策略的高效、精确的形式验证。

链接: https://arxiv.org/abs/2510.06756
作者: Dennis Gross,Helge Spieker,Arnaud Gotlieb
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a tool for rigorous and automated verification of large language model (LLM)- based policies in memoryless sequential decision-making tasks. Given a Markov decision process (MDP) representing the sequential decision-making task, an LLM policy, and a safety requirement expressed as a PCTL formula, our approach incrementally constructs only the reachable portion of the MDP guided by the LLM’s chosen actions. Each state is encoded as a natural language prompt, the LLM’s response is parsed into an action, and reachable successor states by the policy are expanded. The resulting formal model is checked with Storm to determine whether the policy satisfies the specified safety property. In experiments on standard grid world benchmarks, we show that open source LLMs accessed via Ollama can be verified when deterministically seeded, but generally underperform deep reinforcement learning baselines. Our tool natively integrates with Ollama and supports PRISM-specified tasks, enabling continuous benchmarking in user-specified sequential decision-making tasks and laying a practical foundation for formally verifying increasingly capable LLMs.
zh

[AI-39] MultiCNKG: Integrating Cognitive Neuroscience Gene and Disease Knowledge Graphs Using Large Language Models

【速读】：该论文旨在解决传统机器学习方法在捕捉基因、疾病与认知过程之间复杂语义关联方面的局限性，从而提升知识图谱（Knowledge Graph, KG）在生物医学与认知科学领域的整合能力。其解决方案的关键在于提出MultiCNKG框架，通过融合三个核心知识源——认知神经科学知识图谱（Cognitive Neuroscience Knowledge Graph, CNKG）、基因本体（Gene Ontology, GO）和疾病本体（Disease Ontology, DO），并借助生成式AI（Generative AI）如GPT-4进行实体对齐、语义相似度计算与图结构增强，最终构建出一个包含6.9K节点和11.3K边的多层统一知识图谱，实现了从分子机制到行为层面的跨域知识互联。

链接: https://arxiv.org/abs/2510.06742
作者: Ali Sarabadani,Kheirolah Rahsepar Fard
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has revolutionized the integration of knowledge graphs (KGs) in biomedical and cognitive sciences, overcoming limitations in traditional machine learning methods for capturing intricate semantic links among genes, diseases, and cognitive processes. We introduce MultiCNKG, an innovative framework that merges three key knowledge sources: the Cognitive Neuroscience Knowledge Graph (CNKG) with 2.9K nodes and 4.3K edges across 9 node types and 20 edge types; Gene Ontology (GO) featuring 43K nodes and 75K edges in 3 node types and 4 edge types; and Disease Ontology (DO) comprising 11.2K nodes and 8.8K edges with 1 node type and 2 edge types. Leveraging LLMs like GPT-4, we conduct entity alignment, semantic similarity computation, and graph augmentation to create a cohesive KG that interconnects genetic mechanisms, neurological disorders, and cognitive functions. The resulting MultiCNKG encompasses 6.9K nodes across 5 types (e.g., Genes, Diseases, Cognitive Processes) and 11.3K edges spanning 7 types (e.g., Causes, Associated with, Regulates), facilitating a multi-layered view from molecular to behavioral domains. Assessments using metrics such as precision (85.20%), recall (87.30%), coverage (92.18%), graph consistency (82.50%), novelty detection (40.28%), and expert validation (89.50%) affirm its robustness and coherence. Link prediction evaluations with models like TransE (MR: 391, MRR: 0.411) and RotatE (MR: 263, MRR: 0.395) show competitive performance against benchmarks like FB15k-237 and WN18RR. This KG advances applications in personalized medicine, cognitive disorder diagnostics, and hypothesis formulation in cognitive neuroscience.
zh

[AI-40] LLM Company Policies and Policy Implications in Software Organizations

【速读】：该论文旨在解决软件组织在采用大语言模型（Large Language Model, LLM）聊天机器人时所面临的风险问题，其核心在于明确如何制定有效的政策以保障安全集成。解决方案的关键在于系统性分析11家公司的政策制定实践及其影响因素，从而为管理者提供可操作的指导，实现聊天机器人在开发工作流中的安全落地。

链接: https://arxiv.org/abs/2510.06718
作者: Ranim Khojah,Mazen Mohamad,Linda Erlenhov,Francisco Gomes de Oliveira Neto,Philipp Leitner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Software Special Issue on AIware in the Foundation Models Era

点击查看摘要

Abstract:The risks associated with adopting large language model (LLM) chatbots in software organizations highlight the need for clear policies. We examine how 11 companies create these policies and the factors that influence them, aiming to help managers safely integrate chatbots into development workflows.
zh

[AI-41] Dual Goal Representations

【速读】：该论文旨在解决目标条件强化学习（Goal-conditioned Reinforcement Learning, GCRL）中目标表示学习的难题，特别是如何构建一种既能捕捉环境内在动态特性、又能有效过滤外部噪声的目标表示方法。解决方案的关键在于提出“双目标表示”（dual goal representation），该表示通过计算当前状态到所有其他状态的时序距离（temporal distance）来刻画状态关系，从而实现对环境动力学的不变性建模，并在理论上保证足以恢复最优的目标达成策略，同时具备抗噪能力。这一方法可与现有GCRL算法无缝结合，在OGBench任务套件上的实验证明其在20个基于状态和像素的任务中均显著提升离线目标达成性能。

链接: https://arxiv.org/abs/2510.06714
作者: Seohong Park,Deepinder Mann,Sergey Levine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we introduce dual goal representations for goal-conditioned reinforcement learning (GCRL). A dual goal representation characterizes a state by “the set of temporal distances from all other states”; in other words, it encodes a state through its relations to every other state, measured by temporal distance. This representation provides several appealing theoretical properties. First, it depends only on the intrinsic dynamics of the environment and is invariant to the original state representation. Second, it contains provably sufficient information to recover an optimal goal-reaching policy, while being able to filter out exogenous noise. Based on this concept, we develop a practical goal representation learning method that can be combined with any existing GCRL algorithm. Through diverse experiments on the OGBench task suite, we empirically show that dual goal representations consistently improve offline goal-reaching performance across 20 state- and pixel-based tasks.
zh

[AI-42] Inefficiencies of Meta Agents for Agent Design

【速读】：该论文旨在解决自动化设计智能体系统（agentic systems）过程中存在的三大核心问题：一是元智能体（meta-agent）如何有效学习并迭代优化新架构，二是设计出的智能体在行为多样性上的不足限制了其协同潜力，三是自动化设计是否具备经济可行性。解决方案的关键在于：首先，采用进化式方法替代简单扩展上下文的历史设计，显著提升学习效率；其次，通过引入行为多样性评估机制，推动多智能体协同应用；最后，明确指出仅在特定数据集上（如两个案例），当部署规模超过15,000样本时，自动化设计才具备成本优势，从而为自动化设计的适用场景提供量化决策依据。

链接: https://arxiv.org/abs/2510.06711
作者: Batu El,Mert Yuksekgonul,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent works began to automate the design of agentic systems using meta-agents that propose and iteratively refine new agent architectures. In this paper, we examine three key challenges in a common class of meta-agents. First, we investigate how a meta-agent learns across iterations and find that simply expanding the context with all previous agents, as proposed by previous works, performs worse than ignoring prior designs entirely. We show that the performance improves with an evolutionary approach. Second, although the meta-agent designs multiple agents during training, it typically commits to a single agent at test time. We find that the designed agents have low behavioral diversity, limiting the potential for their complementary use. Third, we assess when automated design is economically viable. We find that only in a few cases–specifically, two datasets–the overall cost of designing and deploying the agents is lower than that of human-designed agents when deployed on over 15,000 examples. In contrast, the performance gains for other datasets do not justify the design cost, regardless of scale.
zh

[AI-43] AISysRev - LLM -based Tool for Title-abstract Screening

【速读】：该论文旨在解决系统性文献综述（Systematic Review）中研究筛选阶段工作量巨大、效率低下的问题，尤其是在面对海量论文标题和摘要时，人工筛选耗时且易疲劳。其解决方案的关键在于开发了一个基于大语言模型（Large Language Models, LLMs）的自动化筛选工具 AiSysRev，该工具以 Web 应用形式部署于 Docker 容器中，支持通过 OpenRouter 接入多种 LLM 进行零样本（zero-shot）与少样本（few-shot）筛选，并提供可视化界面辅助人工复核。实证研究表明，LLM 能有效识别“易纳入”和“易排除”类文献，但对边界案例（Boundary Cases）仍存在误判风险，因此需结合人类专家判断，从而显著降低文献筛选负担，提升系统性综述的效率与可行性。

链接: https://arxiv.org/abs/2510.06708
作者: Aleksi Huotala,Miikka Kuutila,Olli-Pekka Turtio,Mika Mäntylä
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 4 pages

点击查看摘要

Abstract:Systematic reviews are a standard practice for summarizing the state of evidence in software engineering. Conducting systematic reviews is laborious, especially during the screening or study selection phase, where the number of papers can be overwhelming. During this phase, papers are assessed against inclusion and exclusion criteria based on their titles and abstracts. Recent research has demonstrated that large language models (LLMs) can perform title-abstract screening at a level comparable to that of a master’s student. While LLMs cannot be fully trusted, they can help, for example, in Rapid Reviews, which try to expedite the review process. Building on recent research, we developed AiSysRev, an LLM-based screening tool implemented as a web application running in a Docker container. The tool accepts a CSV file containing paper titles and abstracts. Users specify inclusion and exclusion criteria. One can use multiple LLMs for screening via OpenRouter. AiSysRev supports both zero-shot and few-shot screening, and also allows for manual screening through interfaces that display LLM results as guidance for human this http URL conducted a trial study with 137 papers using the tool. Our findings indicate that papers can be classified into four categories: Easy Includes, Easy Excludes, Boundary Includes, and Boundary Excludes. The Boundary cases, where LLMs are prone to errors, highlight the need for human intervention. While LLMs do not replace human judgment in systematic reviews, they can significantly reduce the burden of assessing large volumes of scientific literature. Video: this https URL Tool: this https URL
zh

[AI-44] Agent -in-the-Loop: A Data Flywheel for Continuous Improvement in LLM -based Customer Support EMNLP2025

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）驱动的客户服务系统在实际部署中难以持续优化的问题，传统离线训练方法依赖批量标注数据，导致迭代周期长达数月，难以适应动态业务需求。其解决方案的关键在于提出一种“人在回路”（Agent-in-the-Loop, AITL）框架，通过将四类实时反馈信号——成对响应偏好、人工采纳及理由、知识相关性验证、缺失知识识别——无缝嵌入到客户支持的实际操作流程中，形成持续的数据飞轮（data flywheel），从而将模型更新周期从数月缩短至数周，并显著提升检索准确率、生成质量与人工采纳率。

链接: https://arxiv.org/abs/2510.06674
作者: Cen (Mia)Zhao,Tiantian Zhang,Hanchen Su,Yufeng(Wayne)Zhang,Shaowei Su,Mingzhi Xu, Yu (Elaine)Liu,Wei Han,Jeremy Werner,Claire Na Cheng,Yashar Mehdad
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Industry Track submission (Paper #305). Preprint. Main text within the 7-page industry limit (references/appendices excluded). Contains multiple figures and tables

点击查看摘要

Abstract:We introduce an Agent-in-the-Loop (AITL) framework that implements a continuous data flywheel for iteratively improving an LLM-based customer support system. Unlike standard offline approaches that rely on batch annotations, AITL integrates four key types of annotations directly into live customer operations: (1) pairwise response preferences, (2) agent adoption and rationales, (3) knowledge relevance checks, and (4) identification of missing knowledge. These feedback signals seamlessly feed back into models’ updates, reducing retraining cycles from months to weeks. Our production pilot involving US-based customer support agents demonstrated significant improvements in retrieval accuracy (+11.7% recall@75, +14.8% precision@8), generation quality (+8.4% helpfulness) and agent adoption rates (+4.5%). These results underscore the effectiveness of embedding human feedback loops directly into operational workflows to continuously refine LLM-based customer support system.
zh

[AI-45] Delay Independent Safe Control with Neural Networks: Positive Lure Certificates for Risk Aware Autonomy

【速读】：该论文旨在解决自主学习型控制系统（learning-enabled control systems）在存在状态/输入延迟和区间矩阵不确定性等现实风险下的安全认证问题。其解决方案的关键在于：通过将神经网络（Neural Network, NN）控制器建模为具有局部扇区边界（local sector bounds）的形式，并利用正性结构（positivity structure）推导出线性且与延迟无关的稳定性证书，从而在可接受的不确定性范围内保证局部指数稳定性。相较于基于半定规划（SDP）的IQCs验证方法，该方法计算效率更高，且能覆盖后者无法认证的稳定区域，提供可扩展的安全保障。

链接: https://arxiv.org/abs/2510.06661
作者: Hamidreza Montazeri Hedesh,Milad Siami
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Submitted to 2026 American Control Conference (ACC), New Orleans, LA

点击查看摘要

Abstract:We present a risk-aware safety certification method for autonomous, learning enabled control systems. Focusing on two realistic risks, state/input delays and interval matrix uncertainty, we model the neural network (NN) controller with local sector bounds and exploit positivity structure to derive linear, delay-independent certificates that guarantee local exponential stability across admissible uncertainties. To benchmark performance, we adopt and implement a state-of-the-art IQC NN verification pipeline. On representative cases, our positivity-based tests run orders of magnitude faster than SDP-based IQC while certifying regimes the latter cannot-providing scalable safety guarantees that complement risk-aware control.
zh

[AI-46] Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）中缺乏无需反向传播（backpropagation）的局部学习机制的问题，尤其是在传统基于梯度的方法难以应用或不具生物合理性的情境下。其解决方案的关键在于提出一种名为Action-conditioned Root mean squared Q-Functions (ARQ) 的新价值估计方法，该方法借鉴了Forward-Forward (FF) 算法中使用层活动统计量作为“好坏函数”（goodness function）的思想，并引入动作条件（action conditioning）以实现局部时序差分（temporal difference, TD）学习。这一设计不仅具备生物学合理性，还在MinAtar和DeepMind Control Suite基准测试中显著优于当前最先进的无反向传播RL方法，且在多数任务上超越了依赖反向传播训练的算法。

链接: https://arxiv.org/abs/2510.06649
作者: Frank Wu,Mengye Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:The Forward-Forward (FF) Algorithm is a recently proposed learning procedure for neural networks that employs two forward passes instead of the traditional forward and backward passes used in backpropagation. However, FF remains largely confined to supervised settings, leaving a gap at domains where learning signals can be yielded more naturally such as RL. In this work, inspired by FF’s goodness function using layer activity statistics, we introduce Action-conditioned Root mean squared Q-Functions (ARQ), a novel value estimation method that applies a goodness function and action conditioning for local RL using temporal difference learning. Despite its simplicity and biological grounding, our approach achieves superior performance compared to state-of-the-art local backprop-free RL methods in the MinAtar and the DeepMind Control Suite benchmarks, while also outperforming algorithms trained with backpropagation on most tasks. Code can be found at this https URL.
zh

[AI-47] Distilling Lightweight Language Models for C/C Vulnerabilities

【速读】：该论文旨在解决现代软件系统日益复杂的背景下，安全漏洞频发导致严重数据泄露和经济损失的问题，核心挑战在于如何实现高效且精准的代码漏洞检测。解决方案的关键在于提出FineSec框架，通过知识蒸馏（knowledge distillation）技术将大型教师模型（teacher models）的知识迁移至轻量级学生模型（student models），在保持高检测精度的同时显著降低计算开销；同时，FineSec将数据准备、训练、评估与持续学习整合为单一任务工作流，从而实现对C/C++代码库中复杂漏洞和逻辑缺陷的有效识别，具备实际部署与规模化应用潜力。

链接: https://arxiv.org/abs/2510.06645
作者: Zhiyuan Wei,Xiaoxuan Yang,Jing Sun,Zijian Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 25 pages, 10 figures

点击查看摘要

Abstract:The increasing complexity of modern software systems exacerbates the prevalence of security vulnerabilities, posing risks of severe breaches and substantial economic loss. Consequently, robust code vulnerability detection is essential for software security. While Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing, their potential for automated code vulnerability detection remains underexplored. This paper presents FineSec, a novel framework that harnesses LLMs through knowledge distillation to enable efficient and precise vulnerability identification in C/C++ codebases. FineSec utilizes knowledge distillation to transfer expertise from large teacher models to compact student models, achieving high accuracy with minimal computational cost. By integrating data preparation, training, evaluation, and continuous learning into a unified, single-task workflow, FineSec offers a streamlined approach. Extensive evaluations on C/C++ codebases demonstrate its superiority over both base models and larger LLMs in identifying complex vulnerabilities and logical flaws, establishing FineSec as a practical and scalable solution for real-world software security. To facilitate reproducibility, the datasets, source code, and experimental results are made publicly available at: this https URL.
zh

[AI-48] AI-Driven Forecasting and Monitoring of Urban Water System

【速读】：该论文旨在解决地下供水与污水管道系统中泄漏检测效率低下的问题，传统人工巡检方式效率不足，而密集传感器部署成本过高。解决方案的关键在于提出了一种融合生成式 AI 与远程传感的集成框架，通过稀疏部署的远程传感器获取实时流量和水深数据，并结合 HydroNet 模型——该模型利用管道属性（如材质、管径、坡度）构建有向图结构进行高精度建模，从而实现从有限传感器部署中准确预测整个管网的水力状态。其核心创新在于将边缘感知的消息传递机制与水力仿真相结合，显著提升了在复杂地下管网环境中的泄漏检测精度与可扩展性。

链接: https://arxiv.org/abs/2510.06631
作者: Qiming Guo,Bishal Khatri,Hua Zhang,Wenlu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Underground water and wastewater pipelines are vital for city operations but plagued by anomalies like leaks and infiltrations, causing substantial water loss, environmental damage, and high repair costs. Conventional manual inspections lack efficiency, while dense sensor deployments are prohibitively expensive. In recent years, artificial intelligence has advanced rapidly and is increasingly applied to urban infrastructure. In this research, we propose an integrated AI and remote-sensor framework to address the challenge of leak detection in underground water pipelines, through deploying a sparse set of remote sensors to capture real-time flow and depth data, paired with HydroNet - a dedicated model utilizing pipeline attributes (e.g., material, diameter, slope) in a directed graph for higher-precision modeling. Evaluations on a real-world campus wastewater network dataset demonstrate that our system collects effective spatio-temporal hydraulic data, enabling HydroNet to outperform advanced baselines. This integration of edge-aware message passing with hydraulic simulations enables accurate network-wide predictions from limited sensor deployments. We envision that this approach can be effectively extended to a wide range of underground water pipeline networks.
zh

[AI-49] Fine-Grained Emotion Recognition via In-Context Learning

【速读】：该论文旨在解决细粒度情感识别中因忽略决策过程而导致的性能瓶颈问题。现有基于上下文学习（In-Context Learning, ICL）的方法虽通过语义相似示例增强推理过程，但未能有效优化决策机制，尤其在情绪原型匹配时易受语义相似但情感不一致示例干扰，导致查询表征失真。其解决方案的关键在于提出情绪上下文学习（Emotion In-Context Learning, EICL），引入情感相似示例并采用动态软标签策略提升情绪推理中的查询表征质量；同时设计两阶段排除策略从多角度评估相似性，从而显著优化决策过程，实验证明EICL在多个数据集上显著优于传统ICL方法。

链接: https://arxiv.org/abs/2510.06600
作者: Zhaochun Ren,Zhou Yang,Chenglong Ye,Haizhou Sun,Chao Chen,Xiaofei Zhu,Xiangwen Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Fine-grained emotion recognition aims to identify the emotional type in queries through reasoning and decision-making processes, playing a crucial role in various systems. Recent methods use In-Context Learning (ICL), enhancing the representation of queries in the reasoning process through semantically similar examples, while further improving emotion recognition by explaining the reasoning mechanisms. However, these methods enhance the reasoning process but overlook the decision-making process. This paper investigates decision-making in fine-grained emotion recognition through prototype theory. We show that ICL relies on similarity matching between query representations and emotional prototypes within the model, where emotion-accurate representations are critical. However, semantically similar examples often introduce emotional discrepancies, hindering accurate representations and causing errors. To address this, we propose Emotion In-Context Learning (EICL), which introduces emotionally similar examples and uses a dynamic soft-label strategy to improve query representations in the emotion reasoning process. A two-stage exclusion strategy is then employed to assess similarity from multiple angles, further optimizing the decision-making process. Extensive experiments show that EICL significantly outperforms ICL on multiple datasets.
zh

[AI-50] WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在处理复杂网络任务时面临的挑战，包括长周期导航、大规模信息提取以及受限条件下的推理能力不足等问题。解决方案的关键在于提出WebDART框架，其核心机制是动态将目标任务分解为三个聚焦的子任务：导航（navigation）、信息抽取（information extraction）和执行（execution），使模型能够逐项专注完成；同时，在浏览过程中持续重构任务分解策略，利用新发现的过滤器或捷径优化路径，避免冗余探索，从而显著提升复杂网络任务的成功率与效率。

链接: https://arxiv.org/abs/2510.06587
作者: Jingbo Yang,Bairu Hou,Wei Wei,Shiyu Chang,Yujia Bao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are becoming competent at straightforward web tasks, such as opening an item page or submitting a form, but still struggle with objectives that require long horizon navigation, large scale information extraction, and reasoning under constraints. We present WebDART, a general framework that enables a single LLM to handle such complex chores. WebDART (i) dynamically decomposes each objective into three focused subtasks: navigation, information extraction, and execution, so the model concentrates on one skill at a time, and (ii) continuously replans the decomposition as new webpages are revealed, taking advantage of newly discovered filters or shortcuts and avoiding redundant exploration. Evaluated on WebChoreArena, WebDART lifts success rates by up to 13.7 percentage points over previous SOTA agents, while matching their performance on the easier WebArena suite and completing tasks with up to 14.7 fewer navigation steps.
zh

[AI-51] he Framework That Survives Bad Models: Human-AI Collaboration For Clinical Trials

【速读】：该论文旨在解决在临床试验中部署人工智能（Artificial Intelligence, AI）时缺乏有效保障机制所带来的风险问题，尤其是在基于医学影像的疾病评估中，如何确保AI辅助决策不会干扰试验结论的可靠性。其关键解决方案是采用“AI作为辅助阅片者”（AI as a Supporting Reader, AI-SR）框架，该方法通过将AI与人类专家协同工作，在各种模型类型（包括劣质模型）下均能保持高准确性、鲁棒性、泛化能力及成本效益，并且在随机对照试验中维持治疗效应估计的稳定性，从而保障临床试验结果的有效性和可信度。

链接: https://arxiv.org/abs/2510.06567
作者: Yao Chen,David Ohlssen,Aimee Readie,Gregory Ligozio,Ruvie Martin,Thibaud Coroller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) holds great promise for supporting clinical trials, from patient recruitment and endpoint assessment to treatment response prediction. However, deploying AI without safeguards poses significant risks, particularly when evaluating patient endpoints that directly impact trial conclusions. We compared two AI frameworks against human-only assessment for medical image-based disease evaluation, measuring cost, accuracy, robustness, and generalization ability. To stress-test these frameworks, we injected bad models, ranging from random guesses to naive predictions, to ensure that observed treatment effects remain valid even under severe model degradation. We evaluated the frameworks using two randomized controlled trials with endpoints derived from spinal X-ray images. Our findings indicate that using AI as a supporting reader (AI-SR) is the most suitable approach for clinical trials, as it meets all criteria across various model types, even with bad models. This method consistently provides reliable disease estimation, preserves clinical trial treatment effect estimates and conclusions, and retains these advantages when applied to different populations.
zh

[AI-52] Incoherence in goal-conditioned autoregressive models

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）策略在通过朴素目标条件化自回归模型时出现的“非一致性”（incoherence）问题，即策略在迭代优化过程中因生成式建模与控制目标之间的不匹配而导致性能退化。其解决方案的关键在于对策略进行在线重训练（fine-tuning offline-learned policies with online RL），该过程可有效降低非一致性并提升回报（return）。作者通过重新表述控制即推理（control-as-inference）和软Q学习（soft Q-learning）的标准框架，建立了三种等价视角：将后验分布折叠进奖励函数、在确定性情况下降低温度参数（temperature parameter），以及通过训练-推理权衡（training-inference trade-off）体现计算层面的对应关系，从而系统刻画了策略迭代轨迹的演化机制，并揭示了非一致性与有效horizon之间的联系。

链接: https://arxiv.org/abs/2510.06545
作者: Jacek Karwowski,Raymond Douglas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.
zh

[AI-53] Scalable Policy-Based RL Algorithms for POMDPs NEURIPS2025

【速读】：该论文旨在解决部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP）中信念状态（belief state）的连续性带来的计算复杂性问题，从而实现对最优策略的有效学习。其解决方案的关键在于将原始POMDP近似为一个有限状态的马尔可夫决策过程（Markov Decision Process, MDP），称为“超状态MDP”（Superstate MDP），其中状态对应于有限历史序列。通过理论分析证明了该近似下最优价值函数与原POMDP最优价值函数之间的误差随历史长度呈指数级衰减，并提出基于线性函数逼近的策略学习方法，结合时序差分（Temporal Difference, TD）学习和策略优化，实现了在非马尔可夫动态环境下对POMDP的高效近似求解。这是首个明确量化标准TD学习在非马尔可夫设定下引入误差的有限时间边界工作。

链接: https://arxiv.org/abs/2510.06540
作者: Ameya Anjarlekar,Rasoul Etesami,R Srikant
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 36 pages, 3 Figures, Accepted at NeurIPS 2025

点击查看摘要

Abstract:The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.
zh

[AI-54] Auto-Prompt Ensemble for LLM Judge

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）作为评判者时可靠性不足的问题，尤其在于其常因忽视人类评估中隐含的标准而遗漏关键评价维度。解决方案的核心是提出Auto-Prompt Ensemble（APE）框架，该框架通过自动从失败案例中学习额外的评价维度，并引入基于置信度的集成机制——即“集体置信度”（Collective Confidence）来动态决定何时采纳新增维度的判断，从而提升LLM评判的一致性与准确性。

链接: https://arxiv.org/abs/2510.06538
作者: Jiajie Li,Huayi Zhang,Peng Lin,Jinjun Xiong,Wei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.
zh

[AI-55] Beneficial Reasoning Behaviors in Agent ic Search and Effective Post-training to Obtain Them

【速读】：该论文旨在解决生成式 AI (Generative AI) 在代理搜索（Agentic Search）场景中因复杂信息需求而面临的推理与自主能力不足的问题，尤其关注如何提升大语言模型（LLMs）在多步骤规划、检索和信息整合过程中的有效推理行为。解决方案的关键在于提出一种基于**行为提示（Behavior Priming）**的训练范式：首先通过构建包含四种有益推理行为（信息验证、权威性评估、自适应搜索、错误恢复）的代理搜索轨迹数据集，并采用监督微调（SFT）将其注入模型；随后进行标准强化学习（RL）优化。实验证明，这种先以推理行为为导向的SFT策略比直接使用RL训练更能显著提升性能（如Llama3.2-3B和Qwen3-1.7B模型在GAIA、WebWalker和HLE基准上提升超35%），且核心机制在于这些推理行为增强了模型的探索能力（更高pass@k值和熵）及测试时扩展能力（更长轨迹），从而为后续RL提供了更强的基础。

链接: https://arxiv.org/abs/2510.06534
作者: Jiahe Jin,Abhijay Paladugu,Chenyan Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic search leverages large language models (LLMs) to interpret complex user information needs and execute a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs’ reasoning and agentic capabilities when interacting with retrieval systems and the broader web. In this paper, we propose a reasoning-driven LLM-based pipeline to study effective reasoning behavior patterns in agentic search. Using this pipeline, we analyze successful agentic search trajectories and identify four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Based on these findings, we propose a technique called Behavior Priming to train more effective agentic search models. It synthesizes agentic search trajectories that exhibit these four behaviors and integrates them into the agentic search model through supervised fine-tuning (SFT), followed by standard reinforcement learning (RL). Experiments on three benchmarks (GAIA, WebWalker, and HLE) demonstrate that behavior priming yields over 35% gains in Llama3.2-3B and Qwen3-1.7B compared to directly training agentic search models with RL. Crucially, we demonstrate that the desired reasoning behaviors in the SFT data, rather than the correctness of the final answer, is the critical factor for achieving strong final performance after RL: fine-tuning on trajectories with desirable reasoning behaviors but incorrect answers leads to better performance than fine-tuning on trajectories with correct answers. Our analysis further reveals the underlying mechanism: the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL. Our code will be released as open source.
zh

[AI-56] Visualizing Multimodality in Combinatorial Search Landscapes

【速读】：该论文旨在解决组合搜索空间（combinatorial search landscape）可视化中多模态性（multimodality）的表征难题，即如何通过有效的可视化技术更全面地揭示搜索空间的复杂结构。其解决方案的关键在于整合来自景观分析（landscape analysis）文献中的多种可视化技术，并基于图形语法（Grammar of Graphics）的几何与美学要素，构建一个协同互补的可视化框架，从而提供对搜索景观更深入、多维的理解。研究指出，不存在“免费午餐”式的通用可视化方法，强调需根据具体问题选择和组合技术，并为未来研究指明了多个可行方向。

链接: https://arxiv.org/abs/2510.06517
作者: Xavier F. C. Sánchez-Díaz,Ole Jakob Mengshoel
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 18 pages, 9 figures, Poster presented at the 2025 Symposium of the Norwegian Artificial Intelligence Society (NAIS 2025) on June 18, 2025

点击查看摘要

Abstract:This work walks through different visualization techniques for combinatorial search landscapes, focusing on multimodality. We discuss different techniques from the landscape analysis literature, and how they can be combined to provide a more comprehensive view of the search landscape. We also include examples and discuss relevant work to show how others have used these techniques in practice, based on the geometric and aesthetic elements of the Grammar of Graphics. We conclude that there is no free lunch in visualization, and provide recommendations for future work as there are several paths to continue the work in this field.
zh

[AI-57] A Median Perspective on Unlabeled Data for Out-of-Distribution Detection

【速读】：该论文旨在解决开放世界场景下无标签数据中分布外（Out-of-distribution, OOD）样本难以有效识别的问题，尤其是在混合了分布内（In-distribution, InD）与OOD样本的无标签“野生”数据中，传统方法因缺乏明确的OOD标注而难以训练出鲁棒的OOD检测模型。其解决方案的关键在于提出Medix框架，利用中位数（median）运算对无标签数据进行稳定估计，以识别潜在的异常点（即OOD样本），随后结合已标注的InD数据训练一个高精度的OOD分类器。该方法在理论上提供了误差界证明，实验证明其在开放世界设置下显著优于现有方法，体现了中位数在抗噪和抗异常值方面的优势。

链接: https://arxiv.org/abs/2510.06505
作者: Momin Abbas,Ali Falahati,Hossein Goli,Mohammad Mohammadi Amiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection plays a crucial role in ensuring the robustness and reliability of machine learning systems deployed in real-world applications. Recent approaches have explored the use of unlabeled data, showing potential for enhancing OOD detection capabilities. However, effectively utilizing unlabeled in-the-wild data remains challenging due to the mixed nature of both in-distribution (InD) and OOD samples. The lack of a distinct set of OOD samples complicates the task of training an optimal OOD classifier. In this work, we introduce Medix, a novel framework designed to identify potential outliers from unlabeled data using the median operation. We use the median because it provides a stable estimate of the central tendency, as an OOD detection mechanism, due to its robustness against noise and outliers. Using these identified outliers, along with labeled InD data, we train a robust OOD classifier. From a theoretical perspective, we derive error bounds that demonstrate Medix achieves a low error rate. Empirical results further substantiate our claims, as Medix outperforms existing methods across the board in open-world settings, confirming the validity of our theoretical insights.
zh

[AI-58] ATLO-ML: Adaptive Time-Length Optimizer for Machine Learning – Insights from Air Quality Forecasting

【速读】：该论文旨在解决时间序列预测中因输入时间长度（input time length）和采样率（sampling rate）选择不当而导致的模型性能下降问题。解决方案的关键在于提出了一种自适应时间长度优化系统（ATLO-ML），该系统能够根据用户定义的输出时间长度，自动确定最优的输入时间长度和采样率，从而动态调整时间序列数据预处理参数，显著提升机器学习模型的预测准确性。

链接: https://arxiv.org/abs/2510.06503
作者: I-Hsi Kao,Kanji Uchino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate time-series predictions in machine learning are heavily influenced by the selection of appropriate input time length and sampling rate. This paper introduces ATLO-ML, an adaptive time-length optimization system that automatically determines the optimal input time length and sampling rate based on user-defined output time length. The system provides a flexible approach to time-series data pre-processing, dynamically adjusting these parameters to enhance predictive performance. ATLO-ML is validated using air quality datasets, including both GAMS-dataset and proprietary data collected from a data center, both in time series format. Results demonstrate that utilizing the optimized time length and sampling rate significantly improves the accuracy of machine learning models compared to fixed time lengths. ATLO-ML shows potential for generalization across various time-sensitive applications, offering a robust solution for optimizing temporal input parameters in machine learning workflows.
zh

[AI-59] Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift

【速读】：该论文旨在解决语言模型生成过程中存在资源浪费的问题，即如何在保证信息充分性的同时尽早停止生成，从而减少计算开销。其核心挑战在于设计一种能够在任意停止时间下提供形式化误差控制（delta-level error control）的停止准则，避免传统方法因固定长度或启发式策略导致的效率低下或可靠性不足。解决方案的关键是提出Sequential-EDFL（Empirical Dynamic Formal Lift），通过自归一化的经验伯努斯坦e过程（self-normalized empirical-Bernstein e-processes）实时追踪信息提升（information lift），即全模型与故意弱化的“骨架”基线模型之间的对数似然比，并结合在线均值估计处理未知中心化问题、混合e过程整合多参数以及适应性重置机制应对分布漂移。该方法在六个基准测试中实现了22–28%的生成量减少，同时保持delta级误差控制，仅带来12%的计算开销，且可通过轻量级正确性门控进一步提升任务准确性，但不保证事实正确性，因此适合作为第一阶段过滤器以显著降低验证负担（减少83%）。

链接: https://arxiv.org/abs/2510.06478
作者: Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), applying anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift – the log-likelihood ratio between full models and deliberately weakened “skeleton” baselines – using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation by 22-28% vs. sequential baselines while maintaining delta-level control with 12% computational overhead. We introduce automated skeletons (distilled submodels, randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries + verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness – 10.9% of stopped sequences remain incorrect even with the gate (13.2-22.7% without it). EDFL serves as a first-stage filter reducing verification burden by 83%, not as a standalone solution for safety-critical domains.
zh

[AI-60] Attention Sinks and Compression Valleys in LLM s are Two Sides of the Same Coin

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）中两个长期被孤立研究的现象——注意力陷阱（attention sinks）和压缩谷值（compression valleys）之间的内在联系问题。研究表明，这两种现象均源于残差流（residual stream）中大规模激活的形成，且其本质是表征压缩（representational compression）的结果。解决方案的关键在于提出“混合-压缩-精炼”（Mix-Compress-Refine）理论框架，该框架揭示了Transformer-based LLMs通过控制大规模激活来调节注意力机制与表征压缩的关系，从而在深度上组织计算：早期层进行广泛混合，中期层执行压缩计算并限制混合，晚期层实现选择性精炼。这一统一视角解释了为何嵌入任务在中间层表现最佳，而生成任务则依赖全深度处理，阐明了任务依赖的表征差异。

链接: https://arxiv.org/abs/2510.06477
作者: Enrique Queipo-de-Llano,Álvaro Arroyo,Federico Barbero,Xiaowen Dong,Michael Bronstein,Yann LeCun,Ravid Shwartz-Ziv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.
zh

[AI-61] Evaluating Node-tree Interfaces for AI Explainability

【速读】：该论文试图解决的问题是：随着大语言模型（Large Language Models, LLMs）在职场工具和决策流程中的广泛应用，如何提升其可解释性并增强用户信任，尤其是在人机交互界面设计中嵌入透明度与可信机制方面，当前的人类中心设计仍显滞后。解决方案的关键在于提出一种基于节点树（node-tree）的可视化界面设计，该设计将AI生成的回答以分层结构化、可交互的节点形式呈现，使用户能够导航、细化并追踪复杂信息流；相较于传统的对话式聊天界面，节点树界面在探索性任务、后续追问、决策支持和问题解决中表现更优，尤其在促进头脑风暴和维持上下文连贯性方面显著提升用户信任水平。研究通过对照实验验证了该设计能有效增强任务性能与用户信心，从而为构建适应不同任务需求的动态AI交互界面提供了可行路径。

链接: https://arxiv.org/abs/2510.06457
作者: Lifei Wang,Natalie Friedman,Chengchao Zhu,Zeshu Zhu,S.Joy Mountford
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures. Accepted to the 3rd Workshop on Explainability in Human-Robot Collaboration: Real-World Concerns (XHRI 2025), scheduled for March 3, 2025, Hybrid (Melbourne and online) as part of HRI 2025

点击查看摘要

Abstract:As large language models (LLMs) become ubiquitous in workplace tools and decision-making processes, ensuring explainability and fostering user trust are critical. Although advancements in LLM engineering continue, human-centered design is still catching up, particularly when it comes to embedding transparency and trust into AI interfaces. This study evaluates user experiences with two distinct AI interfaces - node-tree interfaces and chatbot interfaces - to assess their performance in exploratory, follow-up inquiry, decision-making, and problem-solving tasks. Our design-driven approach introduces a node-tree interface that visually structures AI-generated responses into hierarchically organized, interactive nodes, allowing users to navigate, refine, and follow up on complex information. In a comparative study with n=20 business users, we observed that while the chatbot interface effectively supports linear, step-by-step queries, it is the node-tree interface that enhances brainstorming. Quantitative and qualitative findings indicate that node-tree interfaces not only improve task performance and decision-making support but also promote higher levels of user trust by preserving context. Our findings suggest that adaptive AI interfaces capable of switching between structured visualizations and conversational formats based on task requirements can significantly enhance transparency and user confidence in AI-powered systems. This work contributes actionable insights to the fields of human-robot interaction and AI design, particularly for enterprise applications where trust-building is critical for teams.
zh

[AI-62] How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

【速读】：该论文旨在解决当前用于评估模型迁移能力（transferability）估计指标的基准测试设置存在根本性缺陷的问题。作者指出，现有基准测试中不现实的模型空间和静态性能层级人为夸大了已有指标的表现，甚至使得简单的、与数据集无关的启发式方法也能超越复杂算法。解决方案的关键在于重新设计更加稳健和贴近实际场景的基准测试框架，以确保未来研究能够基于更真实、更具挑战性的评估环境推进，从而提升迁移能力估计指标在真实模型选择任务中的有效性与可靠性。

链接: https://arxiv.org/abs/2510.06448
作者: Prabhant Singh,Sibylle Hess,Joaquin Vanschoren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.
zh

[AI-63] Context-Aware Inference via Performance Forecasting in Decentralized Learning Networks

【速读】：该论文旨在解决去中心化学习网络中模型预测组合策略的滞后性问题，即传统线性池化方法（如简单平均或基于历史表现动态加权）因依赖移动平均或指数加权而难以快速响应环境变化（如阶段转移或制度切换）。其解决方案的关键在于引入一个基于机器学习的性能预测模块，通过预测每个模型在时间序列中的表现（如遗憾值 regret 或归一化遗憾 z-score），实现对模型权重的前瞻性分配，从而提升网络推理的准确性。实验表明，相较于直接预测损失的模型，预测性能指标的模型能显著优于基线（历史加权平均），且模型性能对特征选择和训练轮次敏感，需根据具体任务进行调优。

链接: https://arxiv.org/abs/2510.06444
作者: Joel Pfeffer,J. M. Diederik Kruijssen,Clément Gossart,Mélanie Chevance,Diego Campo Millan,Florian Stecker,Steven N. Longmore(Allora Foundation)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 17 pages, 12 figures; appeared in ADI (October 2025)

点击查看摘要

Abstract:In decentralized learning networks, predictions from many participants are combined to generate a network inference. While many studies have demonstrated performance benefits of combining multiple model predictions, existing strategies using linear pooling methods (ranging from simple averaging to dynamic weight updates) face a key limitation. Dynamic prediction combinations that rely on historical performance to update weights are necessarily reactive. Due to the need to average over a reasonable number of epochs (with moving averages or exponential weighting), they tend to be slow to adjust to changing circumstances (phase or regime changes). In this work, we develop a model that uses machine learning to forecast the performance of predictions by models at each epoch in a time series. This enables `context-awareness’ by assigning higher weight to models that are likely to be more accurate at a given time. We show that adding a performance forecasting worker in a decentralized learning network, following a design similar to the Allora network, can improve the accuracy of network inferences. Specifically, we find forecasting models that predict regret (performance relative to the network inference) or regret z-score (performance relative to other workers) show greater improvement than models predicting losses, which often do not outperform the naive network inference (historically weighted average of all inferences). Through a series of optimization tests, we show that the performance of the forecasting model can be sensitive to choices in the feature set and number of training epochs. These properties may depend on the exact problem and should be tailored to each domain. Although initially designed for a decentralized learning network, using performance forecasting for prediction combination may be useful in any situation where predictive rather than reactive model weighting is needed.
zh

[AI-64] Flavonoid Fusion: Creating a Knowledge Graph to Unveil the Interplay Between Food and Health

【速读】：该论文旨在解决当前关于“食物即药物”（food as medicine）的研究缺乏标准化、机器可读表示形式的问题，尤其在整合食物营养成分与健康关联知识方面存在明显不足。其解决方案的关键在于构建一个基于语义网（semantic web）的知识图谱（knowledge graph），通过KNARM方法对来自美国农业部（USDA）数据库的食品类黄酮含量数据与文献中报道的癌症关联信息进行结构化整合，从而以机器可操作格式呈现食物与健康之间的复杂关系，为后续研究提供可推理和扩展的基础框架。

链接: https://arxiv.org/abs/2510.06433
作者: Aryan Singh Dalal,Yinglun Zhang,Duru Doğan,Atalay Mert İleri,Hande Küçük McGinty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The focus on “food as medicine” is gaining traction in the field of health and several studies conducted in the past few years discussed this aspect of food in the literature. However, very little research has been done on representing the relationship between food and health in a standardized, machine-readable format using a semantic web that can help us leverage this knowledge effectively. To address this gap, this study aims to create a knowledge graph to link food and health through the knowledge graph’s ability to combine information from various platforms focusing on flavonoid contents of food found in the USDA databases and cancer connections found in the literature. We looked closely at these relationships using KNARM methodology and represented them in machine-operable format. The proposed knowledge graph serves as an example for researchers, enabling them to explore the complex interplay between dietary choices and disease management. Future work for this study involves expanding the scope of the knowledge graph by capturing nuances, adding more related data, and performing inferences on the acquired knowledge to uncover hidden relationships.
zh

[AI-65] Off-Trajectory Reasoning : Can LLM Reasoning : Can LLMs Collaborate on Reasoning Trajectory?

【速读】：该论文旨在解决生成式 AI（Generative AI）在多模型协同推理场景中的能力局限问题，特别是如何评估和提升大语言模型（Large Language Models, LLMs）在非自身轨迹上的推理能力，即“离轨推理”（off-trajectory reasoning）。其核心挑战在于：现有基于单模型推理训练的流水线是否能有效支持模型在共享推理路径中识别并修正他人误导性推理（recoverability），以及能否有效利用更强协作方提供的正确推理步骤（guidability）。解决方案的关键在于提出两种量化测试方法——recoverability 和 guidability，用于系统评估不同规模（1.5B–32B参数）的开放权重LLMs在这两个维度上的表现，并通过控制实验分离后训练阶段中教师模型选择、强化学习（RL）使用及数据筛选策略对上述行为的影响。研究发现，更强的LLM反而更易受干扰，且普遍无法有效利用外部指导，揭示了当前预训练+微调范式在构建可靠协同推理系统方面的根本不足。

链接: https://arxiv.org/abs/2510.06410
作者: Aochong Oliver Li,Tanya Goyal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning LLMs are trained to verbalize their reasoning process, yielding strong gains on complex tasks. This transparency also opens a promising direction: multiple reasoners can directly collaborate on each other’s thinking within a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is the ability to assess the usefulness and build on another model’s partial thinking – we call this off-trajectory reasoning. Our paper investigates a critical question: can standard solo-reasoning training pipelines deliver desired off-trajectory behaviors? We propose twin tests that capture the two extremes of the off-trajectory spectrum, namely Recoverability, which tests whether LLMs can backtrack from “distractions” induced by misleading reasoning traces, and Guidability, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5B-32B) and reveals a counterintuitive finding – “stronger” LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities with solve rates remaining under 9.2%. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that suboptimal recoverability behaviors of teacher models are transferred to distilled students even if the distillation trajectories are correct. Taken together, this work lays the groundwork for evaluating multi-model collaborations in shared reasoning trajectories and highlights the limitations of off-the-shelf reasoning LLMs.
zh

[AI-66] Geometry-Aware Backdoor Attacks: Leverag ing Curvature in Hyperbolic Embeddings

【速读】：该论文旨在解决非欧几里得基础模型（non-Euclidean foundation models）在曲率空间（如双曲几何）中因几何特性引发的后门攻击（backdoor attack）脆弱性问题。其核心发现是：模型表示空间靠近边界时，输入微小变化在标准输入空间检测器下看似无害，却会在表示空间中引发显著偏移，从而被后门触发器利用。解决方案的关键在于提出一种几何自适应触发器（geometry-adaptive trigger），该触发器利用边界驱动的不对称性增强攻击效果，并通过理论分析与实证验证揭示了防御机制的局限——即任何沿径向内推的防御策略虽可抑制触发器，但会牺牲模型在该方向上的敏感性。这一发现为理解非欧模型的安全边界提供了关键洞见，并指导更鲁棒的防御设计。

链接: https://arxiv.org/abs/2510.06397
作者: Ali Baheri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Non-Euclidean foundation models increasingly place representations in curved spaces such as hyperbolic geometry. We show that this geometry creates a boundary-driven asymmetry that backdoor triggers can exploit. Near the boundary, small input changes appear subtle to standard input-space detectors but produce disproportionately large shifts in the model’s representation space. Our analysis formalizes this effect and also reveals a limitation for defenses: methods that act by pulling points inward along the radius can suppress such triggers, but only by sacrificing useful model sensitivity in that same direction. Building on these insights, we propose a simple geometry-adaptive trigger and evaluate it across tasks and architectures. Empirically, attack success increases toward the boundary, whereas conventional detectors weaken, mirroring the theoretical trends. Together, these results surface a geometry-specific vulnerability in non-Euclidean models and offer analysis-backed guidance for designing and understanding the limits of defenses.
zh

[AI-67] Adaptive Protein Design Protocols and Middleware

【速读】：该论文旨在解决计算蛋白质设计中因蛋白序列与结构空间极其庞大而导致的收敛困难问题，尤其是在有限计算资源下难以高效采样和验证设计效果的挑战。解决方案的关键在于提出并实现了一个名为IMPRESS（Integrated Machine-learning for Protein Structures at Scale）的系统框架，其核心是将人工智能（AI）与高性能计算（HPC）任务耦合，通过动态资源分配和异步工作负载执行机制，实现对蛋白质设计过程的实时评估与优化，从而显著提升设计质量的一致性和整体吞吐量。

链接: https://arxiv.org/abs/2510.06396
作者: Aymen Alsaadi,Jonathan Ash,Mikhail Titov,Matteo Turilli,Andre Merzky,Shantenu Jha,Sagar Khare
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE)
备注: N/A

点击查看摘要

Abstract:Computational protein design is experiencing a transformation driven by AI/ML. However, the range of potential protein sequences and structures is astronomically vast, even for moderately sized proteins. Hence, achieving convergence between generated and predicted structures demands substantial computational resources for sampling. The Integrated Machine-learning for Protein Structures at Scale (IMPRESS) offers methods and advanced computing systems for coupling AI to high-performance computing tasks, enabling the ability to evaluate the effectiveness of protein designs as they are developed, as well as the models and simulations used to generate data and train models. This paper introduces IMPRESS and demonstrates the development and implementation of an adaptive protein design protocol and its supporting computing infrastructure. This leads to increased consistency in the quality of protein design and enhanced throughput of protein design due to dynamic resource allocation and asynchronous workload execution.
zh

[AI-68] Monte Carlo Permutation Search

【速读】：该论文旨在解决在计算资源有限或深度强化学习不可行的情况下，如何提升蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）算法在通用游戏博弈（General Game Playing）等场景中的决策性能问题。其核心解决方案是提出一种新的MCTS算法——蒙特卡洛置换搜索（Monte Carlo Permutation Search, MCPS），关键创新在于改进探索项的计算方式：将从根节点到当前节点路径上所有走子对应的模拟结果（playouts）统计信息纳入探索权重中，从而更有效地利用历史模拟数据。此外，MCPS通过数学推导优化了三种统计来源（即路径统计、AMAF统计和置换统计）的加权公式，消除了GRAVE算法中依赖的偏置超参数（bias hyperparameter），并显著降低了对ref超参数的敏感性，提升了算法的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2510.06381
作者: Tristan Cazenave
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE algorithm. MCPS is relevant when deep reinforcement learning is not an option, or when the computing power available before play is not substantial, such as in General Game Playing, for example. The principle of MCPS is to include in the exploration term of a node the statistics on all the playouts that contain all the moves on the path from the root to the node. We extensively test MCPS on a variety of games: board games, wargame, investment game, video game and multi-player games. MCPS has better results than GRAVE in all the two-player games. It has equivalent results for multi-player games because these games are inherently balanced even when players have different strengths. We also show that using abstract codes for moves instead of exact codes can be beneficial to both MCPS and GRAVE, as they improve the permutation statistics and the AMAF statistics. We also provide a mathematical derivation of the formulas used for weighting the three sources of statistics. These formulas are an improvement on the GRAVE formula since they no longer use the bias hyperparameter of GRAVE. Moreover, MCPS is not sensitive to the ref hyperparameter.
zh

[AI-69] Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data

【速读】：该论文旨在解决关系型数据领域中模型跨数据集和任务迁移能力不足的问题，其核心挑战在于关系数据的异构性，包括多样化的表结构（schema）、图结构以及函数依赖关系。解决方案的关键在于提出一种名为Relational Transformer (RT) 的新架构，该架构通过三个关键设计实现零样本迁移：(i) 利用表/列元数据对单元格进行tokenization，(ii) 采用掩码token预测方式进行预训练，(iii) 引入一种新颖的关系注意力机制（Relational Attention），在列、行及主外键链接上进行建模。RT无需针对特定任务或数据集微调即可在未见过的数据集和任务上直接应用，展现出强大的零样本性能（如在二分类任务中达到全监督AUROC的94%），并具备高样本效率的微调表现，为关系型数据领域的基础模型提供了可行路径。

链接: https://arxiv.org/abs/2510.06377
作者: Rishabh Ranjan,Valter Hudovernik,Mark Znidar,Charilaos Kanatsoulis,Roshan Upendra,Mahmoud Mohammadi,Joe Meyer,Tom Palczewski,Carlos Guestrin,Jure Leskovec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: preprint; under review

点击查看摘要

Abstract:Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel \textitRelational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 94% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT’s zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.
zh

[AI-70] Constrained Natural Language Action Planning for Resilient Embodied Systems

【速读】：该论文旨在解决生成式 AI（Generative AI）在执行具身任务时因现实环境的不确定性导致的可靠性与可重复性不足问题，尤其是在大型语言模型（LLM）规划中因幻觉（hallucination）引发的不可靠性，以及传统提示工程（prompt engineering）缺乏透明度和可复现性的问题。其解决方案的关键在于引入符号规划（symbolic planning）作为监督机制，对LLM规划过程进行约束与校验，从而在不损失LLM灵活性和开放世界泛化能力的前提下，显著提升系统的可靠性、可重复性和透明度。该方法通过明确定义硬约束条件，相较于传统提示工程提供了更强的清晰度，并在ALFWorld基准和真实四足机器人平台上验证了其优越性能，实现了接近100%的任务成功率。

链接: https://arxiv.org/abs/2510.06357
作者: Grayson Byrd,Corban Rivera,Bethany Kemp,Meghan Booker,Aurora Schmidt,Celso M de Melo,Lalithkumar Seenivasan,Mathias Unberath
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Replicating human-level intelligence in the execution of embodied tasks remains challenging due to the unconstrained nature of real-world environments. Novel use of large language models (LLMs) for task planning seeks to address the previously intractable state/action space of complex planning tasks, but hallucinations limit their reliability, and thus, viability beyond a research context. Additionally, the prompt engineering required to achieve adequate system performance lacks transparency, and thus, repeatability. In contrast to LLM planning, symbolic planning methods offer strong reliability and repeatability guarantees, but struggle to scale to the complexity and ambiguity of real-world tasks. We introduce a new robotic planning method that augments LLM planners with symbolic planning oversight to improve reliability and repeatability, and provide a transparent approach to defining hard constraints with considerably stronger clarity than traditional prompt engineering. Importantly, these augmentations preserve the reasoning capabilities of LLMs and retain impressive generalization in open-world environments. We demonstrate our approach in simulated and real-world environments. On the ALFWorld planning benchmark, our approach outperforms current state-of-the-art methods, achieving a near-perfect 99% success rate. Deployment of our method to a real-world quadruped robot resulted in 100% task success compared to 50% and 30% for pure LLM and symbolic planners across embodied pick and place tasks. Our approach presents an effective strategy to enhance the reliability, repeatability and transparency of LLM-based robot planners while retaining their key strengths: flexibility and generalizability to complex real-world environments. We hope that this work will contribute to the broad goal of building resilient embodied intelligent systems.
zh

[AI-71] Flexible Swarm Learning May Outpace Foundation Models in Essential Tasks

【速读】：该论文试图解决的问题是：在动态复杂系统（如重症监护中的疾病诊疗）中，如何实现高效、可靠的自适应决策，尤其是在数据有限且机制知识不充分的情况下，传统单体式基础模型（monolithic foundation models）难以克服维度诅咒（curse of dimensionality），导致其在真实世界应用中表现受限。解决方案的关键在于提出一种去中心化的“小智能体网络”（Small Agent Networks, SANs）架构，其中每个智能体仅负责系统的一部分功能，通过群体学习（swarm-learning）机制实现跨智能体的协同优化，从而在保持对动态环境快速响应能力的同时，提升整体决策性能，尽管这会牺牲部分细节上的可复现性。

链接: https://arxiv.org/abs/2510.06349
作者: Moein E. Samadi,Andreas Schuppert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Foundation models have rapidly advanced AI, raising the question of whether their decisions will ultimately surpass human strategies in real-world domains. The exponential, and possibly super-exponential, pace of AI development makes such analysis elusive. Nevertheless, many application areas that matter for daily life and society show only modest gains so far; a prominent case is diagnosing and treating dynamically evolving disease in intensive care. The common challenge is adapting complex systems to dynamic environments. Effective strategies must optimize outcomes in systems composed of strongly interacting functions while avoiding shared side effects; this requires reliable, self-adaptive modeling. These tasks align with building digital twins of highly complex systems whose mechanisms are not fully or quantitatively understood. It is therefore essential to develop methods for self-adapting AI models with minimal data and limited mechanistic knowledge. As this challenge extends beyond medicine, AI should demonstrate clear superiority in these settings before assuming broader decision-making roles. We identify the curse of dimensionality as a fundamental barrier to efficient self-adaptation and argue that monolithic foundation models face conceptual limits in overcoming it. As an alternative, we propose a decentralized architecture of interacting small agent networks (SANs). We focus on agents representing the specialized substructure of the system, where each agent covers only a subset of the full system functions. Drawing on mathematical results on the learning behavior of SANs and evidence from existing applications, we argue that swarm-learning in diverse swarms can enable self-adaptive SANs to deliver superior decision-making in dynamic environments compared with monolithic foundation models, though at the cost of reduced reproducibility in detail. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2510.06349 [cs.LG] (or arXiv:2510.06349v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.06349 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-72] Leverag ing Large Language Models for Cybersecurity Risk Assessment – A Case from Forestry Cyber-Physical Systems ICSE

【速读】：该论文旨在解决安全关键型软件系统中网络安全专家资源匮乏导致的风险评估工作负荷过重问题，尤其是在林业领域等数据隐私要求严格的场景下，如何在不违反数据保护法规的前提下提升风险评估效率。其解决方案的关键在于利用本地部署的大语言模型（Large Language Models, LLMs）结合检索增强生成（Retrieval-Augmented Generation, RAG）技术，为网络安全专家和软件工程师提供辅助支持，包括生成初步风险评估、识别潜在威胁以及执行冗余校验，同时强调人类监督对于保障准确性与合规性的必要性。

链接: https://arxiv.org/abs/2510.06343
作者: Fikret Mert Gültekin,Oscar Lilja,Ranim Khojah,Rebekka Wohlrab,Marvin Damschen,Mazen Mohamad
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at Autonomous Agents in Software Engineering (AgenticSE) Workshop, co-located with ASE 2025

点击查看摘要

Abstract:In safety-critical software systems, cybersecurity activities become essential, with risk assessment being one of the most critical. In many software teams, cybersecurity experts are either entirely absent or represented by only a small number of specialists. As a result, the workload for these experts becomes high, and software engineers would need to conduct cybersecurity activities themselves. This creates a need for a tool to support cybersecurity experts and engineers in evaluating vulnerabilities and threats during the risk assessment process. This paper explores the potential of leveraging locally hosted large language models (LLMs) with retrieval-augmented generation to support cybersecurity risk assessment in the forestry domain while complying with data protection and privacy requirements that limit external data sharing. We performed a design science study involving 12 experts in interviews, interactive sessions, and a survey within a large-scale project. The results demonstrate that LLMs can assist cybersecurity experts by generating initial risk assessments, identifying threats, and providing redundancy checks. The results also highlight the necessity for human oversight to ensure accuracy and compliance. Despite trust concerns, experts were willing to utilize LLMs in specific evaluation and assistance roles, rather than solely relying on their generative capabilities. This study provides insights that encourage the use of LLM-based agents to support the risk assessment process of cyber-physical systems in safety-critical domains.
zh

[AI-73] Belief-Calibrated Multi-Agent Consensus Seeking for Complex NLP Tasks NEURIPS2025

【速读】：该论文旨在解决多智能体系统（Multi-Agent System, MAS）在自然语言处理（Natural Language Processing, NLP）任务中因共识寻求机制不完善而导致的稳定性问题。现有方法通常依赖投票机制判断共识，忽视了系统内部信念之间的矛盾，且采用无差别的协同更新策略，未能识别每个智能体的最佳协作对象，从而阻碍稳定共识的形成。解决方案的关键在于提出一个理论框架，用于选择能最大化共识稳定性的最优合作者，并基于系统内部信念对共识判断进行校准，由此构建出Belief-Calibrated Consensus Seeking (BCCS) 框架，实验证明其在MATH和MMLU基准数据集上分别提升准确率2.23%和3.95%，显著优于现有方法。

链接: https://arxiv.org/abs/2510.06307
作者: Wentao Deng,Jiahuan Pei,Zhiwei Xu,Zhaochun Ren,Zhumin Chen,Pengjie Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by NeurIPS 2025

点击查看摘要

Abstract:A multi-agent system (MAS) enhances its capacity to solve complex natural language processing (NLP) tasks through collaboration among multiple agents, where consensus-seeking serves as a fundamental mechanism. However, existing consensus-seeking approaches typically rely on voting mechanisms to judge consensus, overlooking contradictions in system-internal beliefs that destabilize the consensus. Moreover, these methods often involve agents updating their results through indiscriminate collaboration with every other agent. Such uniform interaction fails to identify the optimal collaborators for each agent, hindering the emergence of a stable consensus. To address these challenges, we provide a theoretical framework for selecting optimal collaborators that maximize consensus stability. Based on the theorems, we propose the Belief-Calibrated Consensus Seeking (BCCS) framework to facilitate stable consensus via selecting optimal collaborators and calibrating the consensus judgment by system-internal beliefs. Experimental results on the MATH and MMLU benchmark datasets demonstrate that the proposed BCCS framework outperforms the best existing results by 2.23% and 3.95% of accuracy on challenging tasks, respectively. Our code and data are available at this https URL.
zh

[AI-74] SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

【速读】：该论文旨在解决生成式模型中训练效率与推理并行性之间的权衡问题，即如何在保持自回归（Autoregressive, AR）模型高计算效率的同时，实现类似扩散模型的并行推理能力。其解决方案的关键在于提出SDAR（Synergistic Diffusion-Autoregression）范式：通过轻量级的范式转换，将已训练好的AR模型快速适配为分块扩散模型，仅需少量数据和计算资源即可完成转换；在推理阶段，SDAR在块间采用自回归方式保证全局一致性，而在每个块内利用离散扩散过程实现并行解码，从而兼顾效率与性能。实验证明，该方法可在不牺牲AR性能的前提下显著提升推理速度，并在大规模模型和复杂任务（如科学推理）中展现出更强的鲁棒性和适应性。

链接: https://arxiv.org/abs/2510.06303
作者: Shuang Cheng,Yihan Bian,Dawei Liu,Yuhua Jiang,Yihao Liu,Linfeng Zhang,Wenhai Wang,Qipeng Guo,Kai Chen,Biqing Qi,Bowen Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Technical report. 39 pages, including 14 pages of appendix

点击查看摘要

Abstract:We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.
zh

[AI-75] Requirements for Game-Based Learning Design Framework for Information System Integration in the Context of Post-Merger Integration

【速读】：该论文旨在解决后并购整合（Post-Merger Integration, PMI）过程中信息系统集成（Information System Integration, ISI）培训中存在的显著知识 gap，特别是针对 AMILI 和 AMILP 等已有支持方法在实际应用中面临的学习曲线高和学习动机低的问题。解决方案的关键在于引入游戏化学习设计（Game-Based Learning Design），通过将静态的方法培训转化为沉浸式、互动性强的学习体验，从而降低认知负荷、提升学习动机，并基于学习理论、认知负荷模型及严肃游戏设计框架，构建一个结构化的、面向 IS 集成的可迭代开发与现实验证的学习框架。

链接: https://arxiv.org/abs/2510.06302
作者: Ksenija Lace,Marite Kirikova
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-merger integration states unique challenges for professionals responsible for information system integration aimed on alignment and combination diverse system architectures of merging organizations. Although the theoretical and practical guidance exists for post-merger integration on the business level, there is a significant gap in training for information system integration in this context. In prior research specific methods AMILI (Support method for informed decision identification) and AMILP (Support method for informed decision-making) were introduced for the support of information system integration decisions in the post-merger integration. But during the practical application was reported high learning curve and low learner motivation. This paper explores how game-based learning design can address these limitations by transforming static method training into engaging learning experience. The study analyzes foundational learning theories, cognitive load and motivation models, and serious game design frameworks to identify the essential requirements for a game-based learning design framework tailored to information system integration in post-merger integration. Requirements are structured in two components: the transformation process and resulting learning experience. The paper concludes with a plan for developing and evaluating the proposed framework through iterative design and real-world validation.
zh

[AI-76] VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在生成代码时缺乏可靠形式化验证手段的问题，特别是如何有效评估生成代码与用户意图之间的对齐程度。现有基准测试依赖于人工标注的真值规范（ground-truth specifications）进行匹配，这一过程不仅耗时且高度依赖专家知识，限制了数据集规模和可靠性。论文提出的关键解决方案是引入VeriEquivBench——一个包含2,389个复杂算法问题的新基准，并采用形式化基础的等价分数（equivalence score）替代传统匹配方法，通过严格的形式化验证来评估生成代码与规范的质量。该方案显著提升了评估的自动化水平和可信度，为推动可扩展、可靠的编码代理发展提供了必要工具。

链接: https://arxiv.org/abs/2510.06296
作者: Lingfei Zeng,Fengdi Che,Xuhan Huang,Fei Ye,Xu Xu,Binhang Yuan,Jie Fu
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with 2,389 complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.
zh

[AI-77] BlockGPT : Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression

【速读】：该论文旨在解决短时降水预报（nowcasting）中模型精度与计算效率难以兼顾的问题，尤其针对当前基于token的自回归模型存在归纳偏置缺陷和推理速度慢、扩散模型计算开销大等局限。解决方案的关键在于提出BlockGPT，一种采用分块token化（batched tokenization）策略的生成式自回归Transformer架构，其通过在每一帧内使用自注意力机制（self-attention）建模空间依赖，并跨帧使用因果注意力机制（causal attention）捕捉时间动态，从而实现对二维降水场的端到端预测。该方法在KNMI和SEVIR两个数据集上验证了其优越性，在保持高精度和事件定位能力的同时，推理速度比同类先进模型快达31倍。

链接: https://arxiv.org/abs/2510.06293
作者: Cristian Meo,Varun Sarathchandran,Avijit Majhi,Shao Hung,Carlo Saccardi,Ruben Imhoff,Roberto Deidda,Remko Uijlenhoet,Justin Dauwels
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting precipitation maps is a highly complex spatiotemporal modeling task, critical for mitigating the impacts of extreme weather events. Short-term precipitation forecasting, or nowcasting, requires models that are not only accurate but also computationally efficient for real-time applications. Current methods, such as token-based autoregressive models, often suffer from flawed inductive biases and slow inference, while diffusion models can be computationally intensive. To address these limitations, we introduce BlockGPT, a generative autoregressive transformer using batched tokenization (Block) method that predicts full two-dimensional fields (frames) at each time step. Conceived as a model-agnostic paradigm for video prediction, BlockGPT factorizes space-time by using self-attention within each frame and causal attention across frames; in this work, we instantiate it for precipitation nowcasting. We evaluate BlockGPT on two precipitation datasets, viz. KNMI (Netherlands) and SEVIR (U.S.), comparing it to state-of-the-art baselines including token-based (NowcastingGPT) and diffusion-based (DiffCast+Phydnet) models. The results show that BlockGPT achieves superior accuracy, event localization as measured by categorical metrics, and inference speeds up to 31x faster than comparable baselines.
zh

[AI-78] raj-Transformer: Diffusion Models with Transformer for GPS Trajectory Generation

【速读】：该论文旨在解决现有轨迹生成方法在使用基于卷积的架构（如UNet）进行扩散过程中的噪声预测时，因模型容量有限而导致轨迹偏差显著及细粒度街道级细节丢失的问题。解决方案的关键在于提出Trajectory Transformer，该模型采用Transformer骨干网络同时完成条件信息嵌入与噪声预测，通过引入两种GPS坐标嵌入策略（位置嵌入与经度-纬度嵌入）并分析不同尺度下的性能表现，显著提升了轨迹生成质量并有效缓解了先前方法中存在的偏差问题。

链接: https://arxiv.org/abs/2510.06291
作者: Zhiyang Zhang,Ningcong Chen,Xin Zhang,Yanhua Li,Shen Su,Hui Lu,Jun Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread use of GPS devices has driven advances in spatiotemporal data mining, enabling machine learning models to simulate human decision making and generate realistic trajectories, addressing both data collection costs and privacy concerns. Recent studies have shown the promise of diffusion models for high-quality trajectory generation. However, most existing methods rely on convolution based architectures (e.g. UNet) to predict noise during the diffusion process, which often results in notable deviations and the loss of fine-grained street-level details due to limited model capacity. In this paper, we propose Trajectory Transformer, a novel model that employs a transformer backbone for both conditional information embedding and noise prediction. We explore two GPS coordinate embedding strategies, location embedding and longitude-latitude embedding, and analyze model performance at different scales. Experiments on two real-world datasets demonstrate that Trajectory Transformer significantly enhances generation quality and effectively alleviates the deviation issues observed in prior approaches.
zh

[AI-79] BuilderBench – A benchmark for generalist agents

【速读】：该论文旨在解决当前人工智能模型主要依赖模仿与精炼学习，难以应对超出已有数据范围的新问题的局限性，其核心挑战在于缺乏可扩展的学习机制以支持智能体通过交互实现自主探索与学习。解决方案的关键在于提出BuilderBench这一基准测试平台，该平台聚焦于开放式的环境探索，要求智能体在无外部监督的情况下，通过与物理块状物体的交互学习通用环境规律，并最终构建未见过的目标结构。其创新之处在于结合了硬件加速的机器人仿真器和包含42种多样化任务的评估套件，涵盖物理理解、数学推理及长时程规划能力，从而推动具备具身推理（embodied reasoning）能力的预训练智能体研究发展。

链接: https://arxiv.org/abs/2510.06288
作者: Raj Ghugare,Catherine Ji,Kathryn Wantlin,Jin Schofield,Benjamin Eysenbach
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL and Code: this https URL

点击查看摘要

Abstract:Today’s AI models learn primarily through mimicry and sharpening, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills for exploring and learning through experience. Finding a scalable learning mechanism for developing agents that learn through interaction remains a major open problem. In this work, we introduce BuilderBench, a benchmark to accelerate research into agent pre-training that centers open-ended exploration. BuilderBench requires agents to learn how to build any structure using blocks. BuilderBench is equipped with (1) a hardware accelerated simulator of a robotic agent interacting with various physical blocks, and (2) a task-suite with over 42 diverse target structures that are carefully curated to test an understanding of physics, mathematics, and long-horizon planning. During training, agents have to explore and learn general principles about the environment without any external supervision. During evaluation, agents have to build the unseen target structures from the task suite. Solving these tasks requires a sort of \emphembodied reasoning that is not reflected in words but rather in actions, experimenting with different strategies and piecing them together. Our experiments show that many of these tasks challenge the current iteration of algorithms. Hence, we also provide a ``training wheels’’ protocol, in which agents are trained and evaluated to build a single target structure from the task suite. Finally, we provide single-file implementations of six different algorithms as a reference point for researchers.
zh

[AI-80] RVFL-X: A Novel Randomized Network Based on Complex Transformed Real-Valued Tabular Datasets

【速读】：该论文旨在解决随机神经网络（Randomized Neural Networks, RNNs）中复数域表示能力难以应用的问题，即缺乏有效的将实值表格数据转换为复值表示的方法。其解决方案的关键在于提出两种生成复值表示的新方法：一种是自然变换方法，另一种是基于自动编码器（Autoencoder）驱动的映射方法；在此基础上构建了RVFL-X模型——一个复数域扩展的随机向量功能链接（Random Vector Functional Link, RVFL）网络，通过在输入、权重和激活函数中引入复数运算，实现对复值数据的处理并输出实值结果，从而显著提升了模型性能，在80个UCI真实数据集上的实验验证了其优于原始RVFL及当前最优RNN变体的鲁棒性和有效性。

链接: https://arxiv.org/abs/2510.06278
作者: M. Sajid,Mushir Akhtar,A. Quadir,M. Tanveer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in neural networks, supported by foundational theoretical insights, emphasize the superior representational power of complex numbers. However, their adoption in randomized neural networks (RNNs) has been limited due to the lack of effective methods for transforming real-valued tabular datasets into complex-valued representations. To address this limitation, we propose two methods for generating complex-valued representations from real-valued datasets: a natural transformation and an autoencoder-driven method. Building on these mechanisms, we propose RVFL-X, a complex-valued extension of the random vector functional link (RVFL) network. RVFL-X integrates complex transformations into real-valued datasets while maintaining the simplicity and efficiency of the original RVFL architecture. By leveraging complex components such as input, weights, and activation functions, RVFL-X processes complex representations and produces real-valued outputs. Comprehensive evaluations on 80 real-valued UCI datasets demonstrate that RVFL-X consistently outperforms both the original RVFL and state-of-the-art (SOTA) RNN variants, showcasing its robustness and effectiveness across diverse application domains.
zh

[AI-81] Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

【速读】：该论文旨在解决当前人工智能模型在推理能力评估与提升方面的关键瓶颈问题——即缺乏对“推理能力”的统一、可量化且具备泛化意义的定义和衡量标准。现有研究多聚焦于模式识别任务的泛化性能（如out-of-distribution, OoD），但推理任务所需的系统性、分步逻辑处理（System2-style reasoning）尚未形成一致的评价框架。为此，作者提出“复杂度外分布（Complexity Out of Distribution, Complexity OoD）”作为新的通用范式，其核心在于：当测试实例所需解的最小复杂度（无论是表示层面的结构丰富度还是计算层面的推理步骤数）超过训练集中所有样本时，若模型仍能保持良好性能，则视为实现了 Complexity OoD 一般化。该方案通过引入 Kolmogorov 复杂度及其操作代理（如关系数量、推理步数）来形式化复杂度，并明确区分其与传统长度或组合性 OoD 的差异。解决方案的关键在于将推理建模为一种跨复杂度层级的泛化问题，从而指导基准设计、监督信号重构、归纳偏置构建及训练机制优化，最终推动模型从单纯的数据规模扩展转向具备结构感知与动态计算分配能力的稳健推理架构。

链接: https://arxiv.org/abs/2510.06274
作者: Mohammad Mahdi Samiei Paqaleh,Arash Marioriyad,Arman Tahmasebi-Zadeh,Mohamadreza Fereydooni,Mahdi Ghaznavai,Mahdieh Soleymani Baghshah
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent progress has pushed AI frontiers from pattern recognition tasks toward problems that require step by step, System2 style reasoning, especially with large language models. Yet, unlike learning, where generalization and out of distribution (OoD) evaluation concepts are well formalized, there is no clear, consistent definition or metric for reasoning ability. We propose Complexity Out of Distribution (Complexity OoD) generalization as a framework and problem setting to define and measure reasoning. A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity, either representational (richer solution structure) or computational (more reasoning steps/program length), exceeds that of all training examples. We formalize complexity via solution description Kolmogorov complexity and operational proxies (e.g., object/relation counts; reasoning step counts), clarifying how Complexity OoD differs from length and compositional OoD. This lens unifies learning and reasoning: many cases solvable with System1 like processing at low complexity become System2 like under complexity pressure, while System2 can be viewed as generalization over solution structures. We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack: incorporating complexity into benchmark and evaluation metric design, rethinking supervision to target solution traces, seeking and designing inductive biases for Complexity OoD generalization, addressing learning to reason spillovers such as spurious shortcuts, semantic robustness, catastrophic forgetting, and step wise calibration. Because Complexity OoD cannot be solved by scaling data alone, progress toward robust reasoning will require architectures and training regimes that explicitly model and allocate computation with respect to complexity.
zh

[AI-82] MCCE: A Framework for Multi-LLM Collaborative Co-Evolution

【速读】：该论文旨在解决多目标离散优化问题（如分子设计）中因组合空间庞大且无结构而导致的传统进化算法易陷入局部最优的问题，同时兼顾专家知识引导与持续学习能力的缺乏。其解决方案的关键在于提出一种混合框架——多大语言模型协同共进化（Multi-LLM Collaborative Co-evolution, MCCE），该框架将一个参数冻结的闭源大语言模型（closed-source LLM）与一个轻量级可微调模型相结合，通过轨迹记忆机制记录历史搜索过程，并利用强化学习逐步优化小模型；两个模型在全局探索中相互支持、互补，而非简单蒸馏，从而实现知识驱动的探索与经验驱动的学习协同演进，显著提升帕累托前沿质量并优于现有基线方法。

链接: https://arxiv.org/abs/2510.06270
作者: Nian Ran,Zhongzheng Li,Yue Wang,Qingsong Ran,Xiaoyuan Zhang,Shikun Feng,Richard Allmendinger,Xiaoguang Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-objective discrete optimization problems, such as molecular design, pose significant challenges due to their vast and unstructured combinatorial spaces. Traditional evolutionary algorithms often get trapped in local optima, while expert knowledge can provide crucial guidance for accelerating convergence. Large language models (LLMs) offer powerful priors and reasoning ability, making them natural optimizers when expert knowledge matters. However, closed-source LLMs, though strong in exploration, cannot update their parameters and thus cannot internalize experience. Conversely, smaller open models can be continually fine-tuned but lack broad knowledge and reasoning strength. We introduce Multi-LLM Collaborative Co-evolution (MCCE), a hybrid framework that unites a frozen closed-source LLM with a lightweight trainable model. The system maintains a trajectory memory of past search processes; the small model is progressively refined via reinforcement learning, with the two models jointly supporting and complementing each other in global exploration. Unlike model distillation, this process enhances the capabilities of both models through mutual inspiration. Experiments on multi-objective drug design benchmarks show that MCCE achieves state-of-the-art Pareto front quality and consistently outperforms baselines. These results highlight a new paradigm for enabling continual evolution in hybrid LLM systems, combining knowledge-driven exploration with experience-driven learning.
zh

[AI-83] RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases

【速读】：该论文旨在解决超罕见疾病电子健康记录（EHR）数据稀缺且难以安全共享的问题，以支持精准医疗研究。其核心挑战在于如何生成既保持生物合理性又具备隐私保护能力的合成EHR轨迹。解决方案的关键在于提出RareGraph-Synth框架，该框架将五个公共生物医学知识资源（包括Orphanet、人类表型本体HPO、GARD罕见病知识图谱等）整合为一个包含约800万条类型边的异构知识图谱（heterogeneous knowledge graph），并通过提取元路径得分（meta-path scores）动态调节扩散过程中的噪声调度（noise schedule），从而引导生成过程中实现实验室检查、药物使用与不良事件的共现符合生物学逻辑；同时，在反向去噪阶段生成无受保护健康信息（PHI）的时间戳序列，有效提升数据真实性与隐私安全性。实验表明，该方法在降低类别最大均值差异（categorical Maximum Mean Discrepancy）方面显著优于无引导扩散基线和GAN模型，并通过黑盒成员推理攻击评估（AUROC≈0.53）验证了更强的抗再识别能力。

链接: https://arxiv.org/abs/2510.06267
作者: Khartik Uppalapati,Shakeel Abdulkareem,Bora Yimenicioglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 2 tables. Submitted to IEEE International Conference on Data Science and Advanced Analytics (DSAA)

点击查看摘要

Abstract:We propose RareGraph-Synth, a knowledge-guided, continuous-time diffusion framework that generates realistic yet privacy-preserving synthetic electronic-health-record (EHR) trajectories for ultra-rare diseases. RareGraph-Synth unifies five public resources: Orphanet/Orphadata, the Human Phenotype Ontology (HPO), the GARD rare-disease KG, PrimeKG, and the FDA Adverse Event Reporting System (FAERS) into a heterogeneous knowledge graph comprising approximately 8 M typed edges. Meta-path scores extracted from this 8-million-edge KG modulate the per-token noise schedule in the forward stochastic differential equation, steering generation toward biologically plausible lab-medication-adverse-event co-occurrences while retaining score-based diffusion model stability. The reverse denoiser then produces timestamped sequences of lab-code, medication-code, and adverse-event-flag triples that contain no protected health information. On simulated ultra-rare-disease cohorts, RareGraph-Synth lowers categorical Maximum Mean Discrepancy by 40 percent relative to an unguided diffusion baseline and by greater than 60 percent versus GAN counterparts, without sacrificing downstream predictive utility. A black-box membership-inference evaluation using the DOMIAS attacker yields AUROC approximately 0.53, well below the 0.55 safe-release threshold and substantially better than the approximately 0.61 plus or minus 0.03 observed for non-KG baselines, demonstrating strong resistance to re-identification. These results suggest that integrating biomedical knowledge graphs directly into diffusion noise schedules can simultaneously enhance fidelity and privacy, enabling safer data sharing for rare-disease research.
zh

[AI-84] LLM -Driven Rubric-Based Assessment of Algebraic Competence in Multi-Stage Block Coding Tasks with Design and Field Evaluation

【速读】：该论文旨在解决在线教育平台中传统评估方法难以全面衡量学生认知深度的问题，尤其是在数学与STEM教学中，如何实现对解题过程质量的精准评价。解决方案的关键在于提出并验证了一种基于评分量规（rubric）的评估框架，该框架由大语言模型（Large Language Model, LLM）驱动，能够将每道题目拆解为五个预定义的评分维度，并结合学生在块编程（block coding）任务中的中间响应数据，实现对答案正确性与思维过程质量的同步评估。实证研究表明，该系统输出与专家评判高度一致，且能持续生成符合课程目标的过程导向反馈，从而验证了其有效性与可扩展性。

链接: https://arxiv.org/abs/2510.06253
作者: Yong Oh Lee,Byeonghun Bang,Sejun Oh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As online education platforms continue to expand, there is a growing need for assessment methods that not only measure answer accuracy but also capture the depth of students’ cognitive processes in alignment with curriculum objectives. This study proposes and evaluates a rubric-based assessment framework powered by a large language model (LLM) for measuring algebraic competence, real-world-context block coding tasks. The problem set, designed by mathematics education experts, aligns each problem segment with five predefined rubric dimensions, enabling the LLM to assess both correctness and quality of students’ problem-solving processes. The system was implemented on an online platform that records all intermediate responses and employs the LLM for rubric-aligned achievement evaluation. To examine the practical effectiveness of the proposed framework, we conducted a field study involving 42 middle school students engaged in multi-stage quadratic equation tasks with block coding. The study integrated learner self-assessments and expert ratings to benchmark the system’s outputs. The LLM-based rubric evaluation showed strong agreement with expert judgments and consistently produced rubric-aligned, process-oriented feedback. These results demonstrate both the validity and scalability of incorporating LLM-driven rubric assessment into online mathematics and STEM education platforms.
zh

[AI-85] DynBenchmark: Customizable Ground Truths to Benchmark Community Detection and Tracking in Temporal Networks

【速读】：该论文旨在解决现有社区检测算法评估基准在追踪真实网络中社区演化方面存在的不足，即大多数基准模型无法模拟社区的动态变化过程（如增长、收缩、合并、分裂、出现或消失）。其解决方案的关键在于提出一种以社区为中心的可定制演化图生成模型，能够精确控制社区结构的时序演变，并同步生成包含节点出生、死亡及跨社区迁移的动态网络，从而为评估算法在追踪节点归属和识别社区演化方面的性能提供更贴近现实的测试环境。

链接: https://arxiv.org/abs/2510.06245
作者: Laurent Brisson(IMT Atlantique - DSD),Cécile Bothorel(IMT Atlantique - DSD),Nicolas Duminy(IMT Atlantique, IMT Atlantique - DSD)
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph models help understand network dynamics and evolution. Creating graphs with controlled topology and embedded partitions is a common strategy for evaluating community detection algorithms. However, existing benchmarks often overlook the need to track the evolution of communities in real-world networks. To address this, a new community-centered model is proposed to generate customizable evolving community structures where communities can grow, shrink, merge, split, appear or disappear. This benchmark also generates the underlying temporal network, where nodes can appear, disappear, or move between communities. The benchmark has been used to test three methods, measuring their performance in tracking nodes’ cluster membership and detecting community evolution. Python libraries, drawing utilities, and validation metrics are provided to compare ground truth with algorithm results for detecting dynamic communities.
zh

[AI-86] Exploring Human-AI Collaboration Using Mental Models of Early Adopters of Multi-Agent Generative AI Tools

【速读】：该论文旨在解决多智能体生成式AI（multi-agent generative AI）在实际应用中如何被早期采用者理解和使用的问题，尤其关注人类与AI协作机制、协作动态性以及透明度等关键挑战。其解决方案的关键在于揭示早期开发者将多智能体系统视为“团队”结构的认知模型，其中智能体根据角色和任务分工协作，呈现出从AI主导到人机协同的连续谱系；同时强调透明度是建立信任、追踪错误、防止滥用的核心策略，并提出需设计清晰的沟通机制以应对层叠式透明度问题，从而为CSCW（计算机支持的协同工作）研究提供关于人-智能体及智能体间协作机制、监督策略与可定制化工作流的设计洞见。

链接: https://arxiv.org/abs/2510.06224
作者: Suchismita Naik,Austin L. Toombs,Amanda Snellinger,Scott Saponas,Amanda K. Hall
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages, 1 table, 2 figures

点击查看摘要

Abstract:With recent advancements in multi-agent generative AI (Gen AI), technology organizations like Microsoft are adopting these complex tools, redefining AI agents as active collaborators in complex workflows rather than as passive tools. In this study, we investigated how early adopters and developers conceptualize multi-agent Gen AI tools, focusing on how they understand human-AI collaboration mechanisms, general collaboration dynamics, and transparency in the context of AI tools. We conducted semi-structured interviews with 13 developers, all early adopters of multi-agent Gen AI technology who work at Microsoft. Our findings revealed that these early adopters conceptualize multi-agent systems as “teams” of specialized role-based and task-based agents, such as assistants or reviewers, structured similar to human collaboration models and ranging from AI-dominant to AI-assisted, user-controlled interactions. We identified key challenges, including error propagation, unpredictable and unproductive agent loop behavior, and the need for clear communication to mitigate the layered transparency issues. Early adopters’ perspectives about the role of transparency underscored its importance as a way to build trust, verify and trace errors, and prevent misuse, errors, and leaks. The insights and design considerations we present contribute to CSCW research about collaborative mechanisms with capabilities ranging from AI-dominant to AI-assisted interactions, transparency and oversight strategies in human-agent and agent-agent interactions, and how humans make sense of these multi-agent systems as dynamic, role-diverse collaborators which are customizable for diverse needs and workflows. We conclude with future research directions that extend CSCW approaches to the design of inter-agent and human mediation interactions.
zh

[AI-87] A Multimodal GUI Architecture for Interfacing with LLM -Based Conversational Assistants

【速读】：该论文旨在解决现有图形用户界面（GUI）缺乏对语音指令原生支持的问题，使得用户无法通过自然语言直接操作未设计为语音交互的应用程序。其解决方案的关键在于提出一种基于模型上下文协议（Model Context Protocol, MCP）的架构，该架构通过MVVM（Model-View-ViewModel）模式中的ViewModel组件暴露应用的导航图和语义信息，将当前视图可用工具与全局应用工具相结合，从而实现语音指令到GUI动作的可靠映射，并确保跨模态反馈一致性。此设计不仅提升了语音可访问性，还为未来操作系统级超级助手（如计算机使用代理，CUAs）提供原生兼容能力，同时评估了本地部署开源大语言模型（LLM）在多模态UI中的隐私与性能表现，验证了小型开放权重模型在准确率上接近主流专有模型，但需企业级硬件保障响应速度。

链接: https://arxiv.org/abs/2510.06223
作者: Hans G.W. van Dam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 24 pages, 19 figures

点击查看摘要

Abstract:Advances in large language models (LLMs) and real-time speech recognition now make it possible to issue any graphical user interface (GUI) action through natural language and receive the corresponding system response directly through the GUI. Most production applications were never designed with speech in mind. This article provides a concrete architecture that enables GUIs to interface with LLM-based speech-enabled assistants. The architecture makes an application’s navigation graph and semantics available through the Model Context Protocol (MCP). The ViewModel, part of the MVVM (Model-View-ViewModel) pattern, exposes the application’s capabilities to the assistant by supplying both tools applicable to a currently visible view and application-global tools extracted from the GUI tree router. This architecture facilitates full voice accessibility while ensuring reliable alignment between spoken input and the visual interface, accompanied by consistent feedback across modalities. It future-proofs apps for upcoming OS super assistants that employ computer use agents (CUAs) and natively consume MCP if an application provides it. To address concerns about privacy and data security, the practical effectiveness of locally deployable, open-weight LLMs for speech-enabled multimodal UIs is evaluated. Findings suggest that recent smaller open-weight models approach the performance of leading proprietary models in overall accuracy and require enterprise-grade hardware for fast responsiveness. Comments: 24 pages, 19 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; D.2.11 Cite as: arXiv:2510.06223 [cs.HC] (or arXiv:2510.06223v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2510.06223 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-88] Agent Builder: Exploring Scaffolds for Prototyping User Experiences of Interface Agents

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 接口代理（agent）开发中用户体验（agent experience）设计缺乏可访问性的问题，即如何为非AI工程师的更广泛群体提供支持，使其能够有效参与代理体验原型设计。解决方案的关键在于通过需求挖掘研究识别出代理体验原型设计的核心活动与系统能力，并基于此构建了一个名为 AgentBuilder 的设计探针（design probe），以验证其在实际场景中的有效性，从而为开发者提供可操作的工具和洞察，提升代理原型设计过程的效率与包容性。

链接: https://arxiv.org/abs/2510.04452
作者: Jenny T. Liang,Titus Barik,Jeffrey Nichols,Eldon Schoop,Ruijia Cheng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interface agents powered by generative AI models (referred to as “agents”) can automate actions based on user commands. An important aspect of developing agents is their user experience (i.e., agent experience). There is a growing need to provide scaffolds for a broader set of individuals beyond AI engineers to prototype agent experiences, since they can contribute valuable perspectives to designing agent experiences. In this work, we explore the affordances agent prototyping systems should offer by conducting a requirements elicitation study with 12 participants with varying experience with agents. We identify key activities in agent experience prototyping and the desired capabilities of agent prototyping systems. We instantiate those capabilities in the AgentBuilder design probe for agent prototyping. We conduct an in situ agent prototyping study with 14 participants using AgentBuilder to validate the design requirements and elicit insights on how developers prototype agents and what their needs are in this process.
zh

[AI-89] Xter: CNN-based Electro-tactile Rendering of Tilt Angle for Telemanipulation of Pasteur Pipettes

【速读】：该论文旨在解决可变形物体在机器人夹爪抓取过程中形状发生显著变化，导致用户对物体姿态感知模糊的问题，从而影响机器人定位精度和远程操作（telemanipulation）的准确性。其解决方案的关键在于提出一种基于卷积神经网络（Convolutional Neural Networks, CNN）的新方法，通过分析嵌入在Robotiq夹爪中的触觉传感器阵列数据，实时识别物体倾斜角度，并将该信息转化为清晰的触觉模式（tactile pattern），通过电刺激阵列反馈给操作者，从而提升用户在远程操作中的姿态辨识能力和任务成功率。实验表明，使用CNN生成的触觉模式使用户对倾斜的识别准确率从23.13%提升至57.9%，远程操作的成功率从53.12%提高到92.18%。

链接: https://arxiv.org/abs/2409.15838
作者: Miguel Altamirano Cabrera,Jonathan Tirado,Aleksey Fedoseev,Oleg Sautenkov,Vladimir Poliakov,Pavel Kopanev,Dzmitry Tsetserukou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Manuscript accepted to IEEE Telepresence 2024. arXiv admin note: text overlap with arXiv:2204.03521 by other authors

点击查看摘要

Abstract:The shape of deformable objects can change drastically during grasping by robotic grippers, causing an ambiguous perception of their alignment and hence resulting in errors in robot positioning and telemanipulation. Rendering clear tactile patterns is fundamental to increasing users’ precision and dexterity through tactile haptic feedback during telemanipulation. Therefore, different methods have to be studied to decode the sensors’ data into haptic stimuli. This work presents a telemanipulation system for plastic pipettes that consists of a Force Dimension Omega.7 haptic interface endowed with two electro-stimulation arrays and two tactile sensor arrays embedded in the 2-finger Robotiq gripper. We propose a novel approach based on convolutional neural networks (CNN) to detect the tilt of deformable objects. The CNN generates a tactile pattern based on recognized tilt data to render further electro-tactile stimuli provided to the user during the telemanipulation. The study has shown that using the CNN algorithm, tilt recognition by users increased from 23.13% with the downsized data to 57.9%, and the success rate during teleoperation increased from 53.12% using the downsized data to 92.18% using the tactile patterns generated by the CNN.
zh

[AI-90] DeepXPalm: Tilt and Position Rendering using Palm-worn Haptic Display and CNN-based Tactile Pattern Recognition

【速读】：该论文旨在解决在远程操作柔性物体（如塑料移液管）时，由于物体形状动态变化导致用户对物体姿态感知模糊、进而引发机器人定位误差的问题。其核心挑战在于如何准确识别并分类物体的倾斜角度和位置信息，以提供清晰的触觉反馈。解决方案的关键在于提出一种基于卷积神经网络（Convolutional Neural Networks, CNN）的新方法，通过CNN模型从多点触觉传感器数据中提取特征并生成掩码（mask），用于指导多接触式触觉刺激的呈现，从而显著提升用户对物体姿态的识别准确率——实验表明，从直接使用原始数据时的9.67%提升至82.5%。

链接: https://arxiv.org/abs/2204.03521
作者: Altamirano Cabrera Miguel,Sautenkov Oleg,Tirado Jonathan,Fedoseev Aleksey,Kopanev Pavel,Kajimoto Hiroyuki,Tsetserukou Dzmitry
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted paper in IEEE Haptic Symposium 2022, IEEE copyright

点击查看摘要

Abstract:Telemanipulation of deformable objects requires high precision and dexterity from the users, which can be increased by kinesthetic and tactile feedback. However, the object shape can change dynamically, causing ambiguous perception of its alignment and hence errors in the robot positioning. Therefore, the tilt angle and position classification problem has to be solved to present a clear tactile pattern to the user. This work presents a telemanipulation system for plastic pipettes consisting of a multi-contact haptic device LinkGlide to deliver haptic feedback at the users’ palm and two tactile sensors array embedded in the 2-finger Robotiq gripper. We propose a novel approach based on Convolutional Neural Networks (CNN) to detect the tilt and position while grasping deformable objects. The CNN generates a mask based on recognized tilt and position data to render further multi-contact tactile stimuli provided to the user during the telemanipulation. The study has shown that using the CNN algorithm and the preset mask, tilt, and position recognition by users is increased from 9.67% using the direct data to 82.5%.
zh

[AI-91] GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations NEURIPS2025

【速读】：该论文旨在解决等离子体湍流建模中因高维非线性动力学带来的计算成本过高与物理机制缺失的问题，尤其针对传统降阶模型（reduced-order models）无法捕捉完整五维（5D）分布函数演化中非线性效应的局限性。解决方案的关键在于提出GyroSwin——首个可扩展的5D神经代理模型，其核心创新包括：(i) 将分层视觉Transformer扩展至5D空间，(ii) 引入跨注意力机制和集成模块以实现静电势场与分布函数之间的3D↔5D潜在交互，以及(iii) 借鉴非线性物理原理进行通道级模式分离。该方法在保持物理可验证性的前提下，将全解析非线性gyrokinetic模拟的计算成本降低三个数量级，并显著提升热通量预测精度，同时成功捕捉湍流动能级联过程。

链接: https://arxiv.org/abs/2510.07314
作者: Fabian Paischer,Gianluca Galletti,William Hornsby,Paul Setinek,Lorenzo Zanisi,Naomi Carey,Stanislas Pamela,Johannes Brandstetter
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Nuclear fusion plays a pivotal role in the quest for reliable and sustainable energy production. A major roadblock to viable fusion power is understanding plasma turbulence, which significantly impairs plasma confinement, and is vital for next-generation reactor design. Plasma turbulence is governed by the nonlinear gyrokinetic equation, which evolves a 5D distribution function over time. Due to its high computational cost, reduced-order models are often employed in practice to approximate turbulent transport of energy. However, they omit nonlinear effects unique to the full 5D dynamics. To tackle this, we introduce GyroSwin, the first scalable 5D neural surrogate that can model 5D nonlinear gyrokinetic simulations, thereby capturing the physical phenomena neglected by reduced models, while providing accurate estimates of turbulent heat this http URL (i) extends hierarchical Vision Transformers to 5D, (ii) introduces cross-attention and integration modules for latent 3D \leftrightarrow 5D interactions between electrostatic potential fields and the distribution function, and (iii) performs channelwise mode separation inspired by nonlinear physics. We demonstrate that GyroSwin outperforms widely used reduced numerics on heat flux prediction, captures the turbulent energy cascade, and reduces the cost of fully resolved nonlinear gyrokinetics by three orders of magnitude while remaining physically verifiable. GyroSwin shows promising scaling laws, tested up to one billion parameters, paving the way for scalable neural surrogates for gyrokinetic simulations of plasma turbulence.
zh

[AI-92] Expressive and Scalable Quantum Fusion for Multimodal Learning

【速读】：该论文旨在解决多模态学习中传统融合机制在表达高阶特征交互时面临的参数爆炸与计算效率低下的问题。解决方案的关键在于提出一种量子融合层（Quantum Fusion Layer, QFL），其核心是利用可训练的量子电路实现模态间纠缠特征交互的学习，同时保持线性参数增长；该方法基于量子信号处理原理，能够以线性参数规模高效表示跨模态的高阶多项式交互关系，并通过理论分离示例展示了其相较于低秩张量方法的潜在量子查询优势。

链接: https://arxiv.org/abs/2510.06938
作者: Tuyen Nguyen,Trong Nghia Hoang,Phi Le Nguyen,Hai L. Vu,Truong Cong Thang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures

点击查看摘要

Abstract:The aim of this paper is to introduce a quantum fusion mechanism for multimodal learning and to establish its theoretical and empirical potential. The proposed method, called the Quantum Fusion Layer (QFL), replaces classical fusion schemes with a hybrid quantum-classical procedure that uses parameterized quantum circuits to learn entangled feature interactions without requiring exponential parameter growth. Supported by quantum signal processing principles, the quantum component efficiently represents high-order polynomial interactions across modalities with linear parameter scaling, and we provide a separation example between QFL and low-rank tensor-based methods that highlights potential quantum query advantages. In simulation, QFL consistently outperforms strong classical baselines on small but diverse multimodal tasks, with particularly marked improvements in high-modality regimes. These results suggest that QFL offers a fundamentally new and scalable approach to multimodal fusion that merits deeper exploration on larger systems.
zh

[AI-93] Bayesian Nonparametric Dynamical Clustering of Time Series

【速读】：该论文旨在解决时间序列聚类中簇数量不确定且随时间动态演化的问题，尤其在面对无界时间序列数据时如何避免簇的过度分裂。其核心解决方案是基于贝叶斯非参数方法，采用分层狄利克雷过程（Hierarchical Dirichlet Process, HDP）作为切换线性动态系统（Switching Linear Dynamical System, SLDS）参数的先验，并结合高斯过程（Gaussian Process, GP）对每个簇内部的幅度变化和时间对齐差异进行建模。通过引入变分下界推断框架，该方法能够在离线与在线场景中高效学习，从而在保持模型灵活性的同时实现对时间序列模式演化的稳健捕捉，避免不必要的簇增长。

链接: https://arxiv.org/abs/2510.06919
作者: Adrián Pérez-Herrero,Paulo Félix,Jesús Presedo,Carl Henrik Ek
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: This work has been submitted to the IEEE for possible publication. 15 pages. 9 figures

点击查看摘要

Abstract:We present a method that models the evolution of an unbounded number of time series clusters by switching among an unknown number of regimes with linear dynamics. We develop a Bayesian non-parametric approach using a hierarchical Dirichlet process as a prior on the parameters of a Switching Linear Dynamical System and a Gaussian process prior to model the statistical variations in amplitude and temporal alignment within each cluster. By modeling the evolution of time series patterns, the method avoids unnecessary proliferation of clusters in a principled manner. We perform inference by formulating a variational lower bound for off-line and on-line scenarios, enabling efficient learning through optimization. We illustrate the versatility and effectiveness of the approach through several case studies of electrocardiogram analysis using publicly available databases.
zh

[AI-94] CLAQS: Compact Learnable All-Quantum Token Mixer with Shared-ansatz for Text Classification

【速读】：该论文旨在解决当前量子自然语言处理（Quantum Natural Language Processing, Quantum NLP）在实际应用中面临的三大挑战：一是量子设备受限于比特数（qubit limited）和电路深度（depth limited），难以支持长序列建模；二是量子训练过程易出现不稳定性，影响模型收敛；三是经典注意力机制（classical attention）计算与内存开销大，难以高效映射到量子硬件。解决方案的关键在于提出一种紧凑且相位敏感的量子 token mixer——CLAQS，其核心创新包括：1）在统一量子电路中联合学习复值混合（complex-valued mixing）与非线性变换，实现端到端可微分的量子特征交互；2）引入 l1 归一化策略稳定振幅缩放，提升优化鲁棒性；3）设计两阶段参数化量子架构，将共享 token 嵌入与窗口级量子前馈模块解耦，从而在仅需 8 个数据量子比特和浅层电路条件下，实现对长文本的有效建模，并在 SST-2 和 IMDB 数据集上分别达到 91.64% 和 87.08% 的准确率，超越经典 Transformer 及强混合量子-经典基线。

链接: https://arxiv.org/abs/2510.06532
作者: Junhao Chen,Yifan Zhou,Hanqi Jiang,Yi Pan,Yiwei Li,Huaqin Zhao,Wei Zhang,Yingfeng Wang,Tianming Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum compute is scaling fast, from cloud QPUs to high throughput GPU simulators, making it timely to prototype quantum NLP beyond toy tasks. However, devices remain qubit limited and depth limited, training can be unstable, and classical attention is compute and memory heavy. This motivates compact, phase aware quantum token mixers that stabilize amplitudes and scale to long sequences. We present CLAQS, a compact, fully quantum token mixer for text classification that jointly learns complex-valued mixing and nonlinear transformations within a unified quantum circuit. To enable stable end-to-end optimization, we apply l1 normalization to regulate amplitude scaling and introduce a two-stage parameterized quantum architecture that decouples shared token embeddings from a window-level quantum feed-forward module. Operating under a sliding-window regime with document-level aggregation, CLAQS requires only eight data qubits and shallow circuits, yet achieves 91.64% accuracy on SST-2 and 87.08% on IMDB, outperforming both classical Transformer baselines and strong hybrid quantum-classical counterparts.
zh

[AI-95] Deep Generative Model for Human Mobility Behavior

【速读】：该论文旨在解决人类移动行为模拟的难题，尤其是个体移动轨迹在大空间尺度上跨日到周时间范围内的复杂性、情境依赖性和探索性特征难以准确建模的问题。其解决方案的关键在于提出了一种名为MobilityGen的深度生成模型，该模型通过将行为属性与环境背景相耦合，能够生成符合真实世界规律的多样化、可解释且新颖的移动轨迹，同时再现诸如地点访问的标度律、活动时间分配以及出行方式与目的地选择之间的协同演化等关键模式，从而为城市规划、公共健康等领域的精细化数据驱动研究提供了新范式。

链接: https://arxiv.org/abs/2510.06473
作者: Ye Hong,Yatao Zhang,Konrad Schindler,Martin Raubal
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Understanding and modeling human mobility is central to challenges in transport planning, sustainable urban design, and public health. Despite decades of effort, simulating individual mobility remains challenging because of its complex, context-dependent, and exploratory nature. Here, we present MobilityGen, a deep generative model that produces realistic mobility trajectories spanning days to weeks at large spatial scales. By linking behavioral attributes with environmental context, MobilityGen reproduces key patterns such as scaling laws for location visits, activity time allocation, and the coupled evolution of travel mode and destination choices. It reflects spatio-temporal variability and generates diverse, plausible, and novel mobility patterns consistent with the built environment. Beyond standard validation, MobilityGen yields insights not attainable with earlier models, including how access to urban space varies across travel modes and how co-presence dynamics shape social exposure and segregation. Our work establishes a new framework for mobility simulation, paving the way for fine-grained, data-driven studies of human behavior and its societal implications.
zh

[AI-96] Soft-Evidence Fused Graph Neural Network for Cancer Driver Gene Identification across Multi-View Biological Graphs

【速读】：该论文旨在解决当前基于图神经网络（GNN）的癌症驱动基因（CDG）识别方法在多生物网络融合时存在的局限性问题，即现有方法通常依赖单一蛋白质-蛋白质相互作用（PPI）网络，或通过特征层面的一致性约束整合多个网络，但这种做法往往假设不同网络中基因关系一致，忽略了网络异质性，可能导致信息冲突和性能下降。其解决方案的关键在于提出Soft-Evidence Fusion Graph Neural Network（SEFGNN），该框架在决策层而非特征层进行融合：将每个生物网络视为独立证据源，利用Dempster-Shafer理论（DST）实现不确定性感知的融合，并引入Soft Evidence Smoothing（SES）模块缓解DST可能带来的过自信问题，从而提升排名稳定性并保持判别性能。

链接: https://arxiv.org/abs/2510.06290
作者: Bang Chen,Lijun Guo,Houli Fan,Wentao He,Rong Zhang
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8pages

点击查看摘要

Abstract:Identifying cancer driver genes (CDGs) is essential for understanding cancer mechanisms and developing targeted therapies. Graph neural networks (GNNs) have recently been employed to identify CDGs by capturing patterns in biological interaction networks. However, most GNN-based approaches rely on a single protein-protein interaction (PPI) network, ignoring complementary information from other biological networks. Some studies integrate multiple networks by aligning features with consistency constraints to learn unified gene representations for CDG identification. However, such representation-level fusion often assumes congruent gene relationships across networks, which may overlook network heterogeneity and introduce conflicting information. To address this, we propose Soft-Evidence Fusion Graph Neural Network (SEFGNN), a novel framework for CDG identification across multiple networks at the decision level. Instead of enforcing feature-level consistency, SEFGNN treats each biological network as an independent evidence source and performs uncertainty-aware fusion at the decision level using Dempster-Shafer Theory (DST). To alleviate the risk of overconfidence from DST, we further introduce a Soft Evidence Smoothing (SES) module that improves ranking stability while preserving discriminative performance. Experiments on three cancer datasets show that SEFGNN consistently outperforms state-of-the-art baselines and exhibits strong potential in discovering novel CDGs.
zh

[AI-97] Dream2Image : An Open Multimodal EEG Dataset for Decoding and Visualizing Dreams with Artificial Intelligence

【速读】：该论文旨在解决如何通过脑电图（EEG）信号实现对梦境内容的解码与可视化重建这一难题，其核心挑战在于建立从神经活动到主观梦境体验之间的映射关系。解决方案的关键在于构建了全球首个融合脑电数据、梦境口述记录和生成式AI图像的多模态数据集Dream2Image，其中包含38名受试者超过31小时的睡眠EEG记录及对应的梦境描述与AI生成图像，从而为研究梦境的神经相关性、开发基于脑活动的梦境解码模型提供了可复现的基准资源。

链接: https://arxiv.org/abs/2510.06252
作者: Yann Bellec
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 Pages, 3 Figures, The Dream2Image dataset is openly available on Hugging Face at: this https URL

点击查看摘要

Abstract:Dream2Image is the world’s first dataset combining EEG signals, dream transcriptions, and AI-generated images. Based on 38 participants and more than 31 hours of dream EEG recordings, it contains 129 samples offering: the final seconds of brain activity preceding awakening (T-15, T-30, T-60, T-120), raw reports of dream experiences, and an approximate visual reconstruction of the dream. This dataset provides a novel resource for dream research, a unique resource to study the neural correlates of dreaming, to develop models for decoding dreams from brain activity, and to explore new approaches in neuroscience, psychology, and artificial intelligence. Available in open access on Hugging Face and GitHub, Dream2Image provides a multimodal resource designed to support research at the interface of artificial intelligence and neuroscience. It was designed to inspire researchers and extend the current approaches to brain activity decoding. Limitations include the relatively small sample size and the variability of dream recall, which may affect generalizability.
zh

[AI-98] Generalized Multi-agent Social Simulation Framework

【速读】：该论文旨在解决多智能体社会交互模拟系统在多样化场景下难以扩展以及因缺乏模块化设计而导致复用性差的问题。其解决方案的关键在于提出了一种基于面向对象的模块化框架，通过层次化结构有机整合多种基础类，并继承该框架实现常见派生类，从而提升系统的可扩展性和复用性；同时引入记忆摘要机制（memory summarization mechanism），从原始记忆数据中过滤并提炼出语境相关的事件与交互信息，增强模拟的真实性与效率。

链接: https://arxiv.org/abs/2510.06225
作者: Gang Li,Jie Lin,Yining Tang,Ziteng Wang,Yirui Huang,Junyu Zhang,Shuang Luo,Chao Wu,Yike Guo
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent social interaction has clearly benefited from Large Language Models. However, current simulation systems still face challenges such as difficulties in scaling to diverse scenarios and poor reusability due to a lack of modular design. To address these issues, we designed and developed a modular, object-oriented framework that organically integrates various base classes through a hierarchical structure, harvesting scalability and reusability. We inherited the framework to realize common derived classes. Additionally, a memory summarization mechanism is proposed to filter and distill relevant information from raw memory data, prioritizing contextually salient events and interactions. By selecting and combining some necessary derived classes, we customized a specific simulated environment. Utilizing this simulated environment, we successfully simulated human interactions on social media, replicating real-world online social behaviors. The source code for the project will be released and evolve.
zh

机器学习

[LG-0] MolGA: Molecular Graph Adaptation with Pre-trained 2D Graph Encoder

链接: https://arxiv.org/abs/2510.07289
作者: Xingtong Yu,Chang Zhou,Xinming Zhang,Yuan Fang
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Molecular graph representation learning is widely used in chemical and biomedical research. While pre-trained 2D graph encoders have demonstrated strong performance, they overlook the rich molecular domain knowledge associated with submolecular instances (atoms and bonds). While molecular pre-training approaches incorporate such knowledge into their pre-training objectives, they typically employ designs tailored to a specific type of knowledge, lacking the flexibility to integrate diverse knowledge present in molecules. Hence, reusing widely available and well-validated pre-trained 2D encoders, while incorporating molecular domain knowledge during downstream adaptation, offers a more practical alternative. In this work, we propose MolGA, which adapts pre-trained 2D graph encoders to downstream molecular applications by flexibly incorporating diverse molecular domain knowledge. First, we propose a molecular alignment strategy that bridge the gap between pre-trained topological representations with domain-knowledge representations. Second, we introduce a conditional adaptation mechanism that generates instance-specific tokens to enable fine-grained integration of molecular domain knowledge for downstream tasks. Finally, we conduct extensive experiments on eleven public datasets, demonstrating the effectiveness of MolGA.

[LG-1] Dynamic Regret Bounds for Online Omniprediction with Long Term Constraints

链接: https://arxiv.org/abs/2510.07266
作者: Yahav Bechavod,Jiuyao Lu,Aaron Roth
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We present an algorithm guaranteeing dynamic regret bounds for online omniprediction with long term constraints. The goal in this recently introduced problem is for a learner to generate a sequence of predictions which are broadcast to a collection of downstream decision makers. Each decision maker has their own utility function, as well as a vector of constraint functions, each mapping their actions and an adversarially selected state to reward or constraint violation terms. The downstream decision makers select actions “as if” the state predictions are correct, and the goal of the learner is to produce predictions such that all downstream decision makers choose actions that give them worst-case utility guarantees while minimizing worst-case constraint violation. Within this framework, we give the first algorithm that obtains simultaneous \emphdynamic regret guarantees for all of the agents – where regret for each agent is measured against a potentially changing sequence of actions across rounds of interaction, while also ensuring vanishing constraint violation for each agent. Our results do not require the agents themselves to maintain any state – they only solve one-round constrained optimization problems defined by the prediction made at that round.

[LG-2] st-Time Graph Search for Goal-Conditioned Reinforcement Learning

链接: https://arxiv.org/abs/2510.07257
作者: Evgenii Opryshko,Junwei Quan,Claas Voelcker,Yilun Du,Igor Gilitschenski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.

[LG-3] Discriminative Feature Feedback with General Teacher Classes

链接: https://arxiv.org/abs/2510.07245
作者: Omri Bar Oz,Tosca Lechner,Sivan Sabato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the theoretical properties of the interactive learning protocol Discriminative Feature Feedback (DFF) (Dasgupta et al., 2018). The DFF learning protocol uses feedback in the form of discriminative feature explanations. We provide the first systematic study of DFF in a general framework that is comparable to that of classical protocols such as supervised learning and online learning. We study the optimal mistake bound of DFF in the realizable and the non-realizable settings, and obtain novel structural results, as well as insights into the differences between Online Learning and settings with richer feedback such as DFF. We characterize the mistake bound in the realizable setting using a new notion of dimension. In the non-realizable setting, we provide a mistake upper bound and show that it cannot be improved in general. Our results show that unlike Online Learning, in DFF the realizable dimension is insufficient to characterize the optimal non-realizable mistake bound or the existence of no-regret algorithms.

[LG-4] A Broader View of Thompson Sampling

链接: https://arxiv.org/abs/2510.07208
作者: Yanlin Qu,Hongseok Namkoong,Assaf Zeevi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to “properly” balance exploration and exploitation, remains a mystery. In this paper we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, a key conceptual tool is introduced, which we refer to as “faithful” stationarization of the regret formulation. Essentially, the finite horizon dynamic optimization problem is converted into a stationary counterpart which “closely resembles” the original objective (in contrast, the classical infinite horizon discounted formulation, that leads to the Gittins index, alters the problem and objective in too significant a manner). The newly crafted time invariant objective can be studied using Bellman’s principle which leads to a time invariant optimal policy. When viewed through this lens, Thompson Sampling admits a simple online optimization form that mimics the structure of the Bellman-optimal policy, and where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation. This answers the question of how Thompson Sampling balances exploration-exploitation, and moreover, provides a principled framework to study and further improve Thompson’s original idea.

[LG-5] Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

链接: https://arxiv.org/abs/2510.07205
作者: Fangshuo Liao,Anastasios Kyrillidis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router’s learning process is ``guided’’ by the experts, that recovers the teacher’s parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.

[LG-6] An in-depth look at approximation via deep and narrow neural networks

链接: https://arxiv.org/abs/2510.07202
作者: Joris Dommel,Sven A. Wegner
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:In 2017, Hanin and Sellke showed that the class of arbitrarily deep, real-valued, feed-forward and ReLU-activated networks of width w forms a dense subset of the space of continuous functions on R^n, with respect to the topology of uniform convergence on compact sets, if and only if wn holds. To show the necessity, a concrete counterexample function f:R^n-R was used. In this note we actually approximate this very f by neural networks in the two cases w=n and w=n+1 around the aforementioned threshold. We study how the approximation quality behaves if we vary the depth and what effect (spoiler alert: dying neurons) cause that behavior.

[LG-7] Poisoning Attacks on LLM s Require a Near-constant Number of Poison Samples

链接: https://arxiv.org/abs/2510.07192
作者: Alexandra Souly,Javier Rando,Ed Chapman,Xander Davies,Burak Hasircioglu,Ezzeldin Shereen,Carlos Mougan,Vasilios Mavroudis,Erik Jones,Chris Hicks,Nicholas Carlini,Yarin Gal,Robert Kirk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

[LG-8] Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging

链接: https://arxiv.org/abs/2510.07182
作者: Patrick Peixuan Ye,Chen Shani,Ellen Vitercik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input X and output Y dataset. Our method first clusters X and Y independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input x is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction \haty . Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.

[LG-9] Spectral Graph Clustering under Differential Privacy: Balancing Privacy Accuracy and Efficiency

链接: https://arxiv.org/abs/2510.07136
作者: Mohamed Seif,Antti Koskela,H. Vincent Poor,Andrea J. Goldsmith
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:We study the problem of spectral graph clustering under edge differential privacy (DP). Specifically, we develop three mechanisms: (i) graph perturbation via randomized edge flipping combined with adjacency matrix shuffling, which enforces edge privacy while preserving key spectral properties of the graph. Importantly, shuffling considerably amplifies the guarantees: whereas flipping edges with a fixed probability alone provides only a constant epsilon edge DP guarantee as the number of nodes grows, the shuffled mechanism achieves (epsilon, delta) edge DP with parameters that tend to zero as the number of nodes increase; (ii) private graph projection with additive Gaussian noise in a lower-dimensional space to reduce dimensionality and computational complexity; and (iii) a noisy power iteration method that distributes Gaussian noise across iterations to ensure edge DP while maintaining convergence. Our analysis provides rigorous privacy guarantees and a precise characterization of the misclassification error rate. Experiments on synthetic and real-world networks validate our theoretical analysis and illustrate the practical privacy-utility trade-offs.

[LG-10] DPMM-CFL: Clustered Federated Learning via Dirichlet Process Mixture Model Nonparametric Clustering

链接: https://arxiv.org/abs/2510.07132
作者: Mariona Jaramillo-Civill,Peng Wu,Pau Closas
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Clustered Federated Learning (CFL) improves performance under non-IID client heterogeneity by clustering clients and training one model per cluster, thereby balancing between a global model and fully personalized models. However, most CFL methods require the number of clusters K to be fixed a priori, which is impractical when the latent structure is unknown. We propose DPMM-CFL, a CFL algorithm that places a Dirichlet Process (DP) prior over the distribution of cluster parameters. This enables nonparametric Bayesian inference to jointly infer both the number of clusters and client assignments, while optimizing per-cluster federated objectives. This results in a method where, at each round, federated updates and cluster inferences are coupled, as presented in this paper. The algorithm is validated on benchmark datasets under Dirichlet and class-split non-IID partitions.

[LG-11] GNN-enhanced Traffic Anomaly Detection for Next-Generation SDN-Enabled Consumer Electronics

链接: https://arxiv.org/abs/2510.07109
作者: Guan-Yan Yang,Farn Wang,Kuo-Hui Yeh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been accepted for publication in IEEE Transactions on Consumer Electronics. 10 pages, 6 figures

点击查看摘要

Abstract:Consumer electronics (CE) connected to the Internet of Things are susceptible to various attacks, including DDoS and web-based threats, which can compromise their functionality and facilitate remote hijacking. These vulnerabilities allow attackers to exploit CE for broader system attacks while enabling the propagation of malicious code across the CE network, resulting in device failures. Existing deep learning-based traffic anomaly detection systems exhibit high accuracy in traditional network environments but are often overly complex and reliant on static infrastructure, necessitating manual configuration and management. To address these limitations, we propose a scalable network model that integrates Software-defined Networking (SDN) and Compute First Networking (CFN) for next-generation CE networks. In this network model, we propose a Graph Neural Networks-based Network Anomaly Detection framework (GNN-NAD) that integrates SDN-based CE networks and enables the CFN architecture. GNN-NAD uniquely fuses a static, vulnerability-aware attack graph with dynamic traffic features, providing a holistic view of network security. The core of the framework is a GNN model (GSAGE) for graph representation learning, followed by a Random Forest (RF) classifier. This design (GSAGE+RF) demonstrates superior performance compared to existing feature selection methods. Experimental evaluations on CE environment reveal that GNN-NAD achieves superior metrics in accuracy, recall, precision, and F1 score, even with small sample sizes, exceeding the performance of current network anomaly detection methods. This work advances the security and efficiency of next-generation intelligent CE networks.

[LG-12] Non-Asymptotic Analysis of Efficiency in Conformalized Regression

链接: https://arxiv.org/abs/2510.07093
作者: Yunzhen Yao,Lie He,Michael Gastpar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal prediction provides prediction sets with coverage guarantees. The informativeness of conformal prediction depends on its efficiency, typically quantified by the expected size of the prediction set. Prior work on the efficiency of conformalized regression commonly treats the miscoverage level \alpha as a fixed constant. In this work, we establish non-asymptotic bounds on the deviation of the prediction set length from the oracle interval length for conformalized quantile and median regression trained via SGD, under mild assumptions on the data distribution. Our bounds of order \mathcalO(1/\sqrtn + 1/(\alpha^2 n) + 1/\sqrtm + \exp(-\alpha^2 m)) capture the joint dependence of efficiency on the proper training set size n , the calibration set size m , and the miscoverage level \alpha . The results identify phase transitions in convergence rates across different regimes of \alpha , offering guidance for allocating data to control excess prediction set length. Empirical results are consistent with our theoretical findings.

[LG-13] Non-Stationary Online Structured Prediction with Surrogate Losses

链接: https://arxiv.org/abs/2510.07086
作者: Shinsaku Sakaue,Han Bao,Yuzhou Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. Therein the surrogate regret – the cumulative excess of the target loss (e.g., 0-1 loss) over the surrogate loss (e.g., logistic loss) of the fixed best estimator – has gained attention, particularly because it often admits a finite bound independent of the time horizon T . However, such guarantees break down in non-stationary environments, where every fixed estimator may incur the surrogate loss growing linearly with T . We address this by proving a bound of the form F_T + C(1 + P_T) on the cumulative target loss, where F_T is the cumulative surrogate loss of any comparator sequence, P_T is its path length, and C 0 is some constant. This bound depends on T only through F_T and P_T , often yielding much stronger guarantees in non-stationary environments. Our core idea is to synthesize the dynamic regret bound of the online gradient descent (OGD) with the technique of exploiting the surrogate gap. Our analysis also sheds light on a new Polyak-style learning rate for OGD, which systematically offers target-loss guarantees and exhibits promising empirical performance. We further extend our approach to a broader class of problems via the convolutional Fenchel–Young loss. Finally, we prove a lower bound showing that the dependence on F_T and P_T is tight.

[LG-14] Pseudo-MDPs: A Novel Framework for Efficiently Optimizing Last Revealer Seed Manipulations in Blockchains

链接: https://arxiv.org/abs/2510.07080
作者: Maxime Reynouard
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study tackles the computational challenges of solving Markov Decision Processes (MDPs) for a restricted class of problems. It is motivated by the Last Revealer Attack (LRA), which undermines fairness in some Proof-of-Stake (PoS) blockchains such as Ethereum (\ 400B market capitalization). We introduce pseudo-MDPs (pMDPs) a framework that naturally models such problems and propose two distinct problem reductions to standard MDPs. One problem reduction provides a novel, counter-intuitive perspective, and combining the two problem reductions enables significant improvements in dynamic programming algorithms such as value iteration. In the case of the LRA which size is parameterized by \kappa (in Ethereum’s case \kappa = 32), we reduce the computational complexity from O(2^ \kappa \kappa ^2^( \kappa +2)) to O( \kappa ^4) (per iteration). This solution also provide the usual benefits from Dynamic Programming solutions: exponentially fast convergence toward the optimal solution is guaranteed. The dual perspective also simplifies policy extraction, making the approach well-suited for resource-constrained agents who can operate with very limited memory and computation once the problem has been solved. Furthermore, we generalize those results to a broader class of MDPs, enhancing their applicability. The framework is validated through two case studies: a fictional card game and the LRA on the Ethereum random seed consensus protocol. These applications demonstrate the framework’s ability to solve large-scale problems effectively while offering actionable insights into optimal strategies. This work advances the study of MDPs and contributes to understanding security vulnerabilities in blockchain systems.

[LG-15] Blind Construction of Angular Power Maps in Massive MIMO Networks

链接: https://arxiv.org/abs/2510.07071
作者: Zheng Xing,Junting Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Channel state information (CSI) acquisition is a challenging problem in massive multiple-input multiple-output (MIMO) networks. Radio maps provide a promising solution for radio resource management by reducing online CSI acquisition. However, conventional approaches for radio map construction require location-labeled CSI data, which is challenging in practice. This paper investigates unsupervised angular power map construction based on large timescale CSI data collected in a massive MIMO network without location labels. A hidden Markov model (HMM) is built to connect the hidden trajectory of a mobile with the CSI evolution of a massive MIMO channel. As a result, the mobile location can be estimated, enabling the construction of an angular power map. We show that under uniform rectilinear mobility with Poisson-distributed base stations (BSs), the Cramer-Rao Lower Bound (CRLB) for localization error can vanish at any signal-to-noise ratios (SNRs), whereas when BSs are confined to a limited region, the error remains nonzero even with infinite independent measurements. Based on reference signal received power (RSRP) data collected in a real multi-cell massive MIMO network, an average localization error of 18 meters can be achieved although measurements are mainly obtained from a single serving cell.

[LG-16] Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation

链接: https://arxiv.org/abs/2510.07052
作者: Aryan Golbaghi,Shuo Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a workflow for speech emotion recognition (SER) that combines pre-trained representations with automated hyperparameter optimisation (HPO). Using SpeechBrain wav2vec2-base model fine-tuned on IEMOCAP as the encoder, we compare two HPO strategies, Gaussian Process Bayesian Optimisation (GP-BO) and Tree-structured Parzen Estimators (TPE), under an identical four-dimensional search space and 15-trial budget, with balanced class accuracy (BCA) on the German EmoDB corpus as the objective. All experiments run on 8 CPU cores with 32 GB RAM. GP-BO achieves 0.96 BCA in 11 minutes, and TPE (Hyperopt implementation) attains 0.97 in 15 minutes. In contrast, grid search requires 143 trials and 1,680 minutes to exceed 0.9 BCA, and the best AutoSpeech 2020 baseline reports only 0.85 in 30 minutes on GPU. For cross-lingual generalisation, an EmoDB-trained HPO-tuned model improves zero-shot accuracy by 0.25 on CREMA-D and 0.26 on RAVDESS. Results show that efficient HPO with pre-trained encoders delivers competitive SER on commodity CPUs. Source code to this work is available at: this https URL.

[LG-17] COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning Preference Optimization

链接: https://arxiv.org/abs/2510.07043
作者: Tian Qin,Felix Bai,Ting-Yao Hu,Raviteja Vemulapalli,Hema Swetha Koppula,Zhiyang Xu,Bowen Jin,Mert Cemri,Jiarui Lu,Zirui Wang,Meng Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent’s ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.

[LG-18] Spiral Model Technique For Data Science Machine Learning Lifecycle

链接: https://arxiv.org/abs/2510.06987
作者: Rohith Mahadevan
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Analytics play an important role in modern business. Companies adapt data science lifecycles to their culture to seek productivity and improve their competitiveness among others. Data science lifecycles are fairly an important contributing factor to start and end a project that are data dependent. Data science and Machine learning life cycles comprises of series of steps that are involved in a project. A typical life cycle states that it is a linear or cyclical model that revolves around. It is mostly depicted that it is possible in a traditional data science life cycle to start the process again after reaching the end of cycle. This paper suggests a new technique to incorporate data science life cycle to business problems that have a clear end goal. A new technique called spiral technique is introduced to emphasize versatility, agility and iterative approach to business processes.

[LG-19] Relational Database Distillation: From Structured Tables to Condensed Graph Data

链接: https://arxiv.org/abs/2510.06980
作者: Xinyi Gao,Jingxi Zhang,Lijian Chen,Tong Chen,Lizhen Cui,Hongzhi Yin
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances leverage graph representation learning to capture complex inter-table relations as multi-hop dependencies. Despite achieving state-of-the-art performance, these methods remain hindered by the prohibitive storage overhead and excessive training time, due to the massive scale of the database and the computational burden of intensive message passing across interconnected tables. To alleviate these concerns, we propose and study the problem of Relational Database Distillation (RDD). Specifically, we aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the predictive power (i.e., utility) required for training graph-based models. Multi-modal column information is preserved through node features, and primary-foreign key relations are encoded via heterogeneous edges, thereby maintaining both data fidelity and relational structure. To ensure adaptability across diverse downstream tasks without engaging the traditional, inefficient bi-level distillation framework, we further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph. Extensive experiments on multiple real-world RDBs demonstrate that our solution substantially reduces the data size while maintaining competitive performance on classification and regression tasks, creating an effective pathway for scalable learning with RDBs.

[LG-20] Falsification-Driven Reinforcement Learning for Maritime Motion Planning

链接: https://arxiv.org/abs/2510.06970
作者: Marlon Müller,Florian Finkeldei,Hanna Krasowski,Murat Arcak,Matthias Althoff
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compliance with maritime traffic rules is essential for the safe operation of autonomous vessels, yet training reinforcement learning (RL) agents to adhere to them is challenging. The behavior of RL agents is shaped by the training scenarios they encounter, but creating scenarios that capture the complexity of maritime navigation is non-trivial, and real-world data alone is insufficient. To address this, we propose a falsification-driven RL approach that generates adversarial training scenarios in which the vessel under test violates maritime traffic rules, which are expressed as signal temporal logic specifications. Our experiments on open-sea navigation with two vessels demonstrate that the proposed approach provides more relevant training scenarios and achieves more consistent rule compliance.

[LG-21] Accelerating Sparse Ternary GEMM for Quantized LLM inference on Apple Silicon

链接: https://arxiv.org/abs/2510.06957
作者: Baraq Lipshitz(ETH Zurich),Alessio Melone(ETH Zurich),Charalampos Maraziaris(ETH Zurich),Muhammed Bilal(ETH Zurich)
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple’s M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor’s theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.

[LG-22] From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics

链接: https://arxiv.org/abs/2510.06954
作者: Zheng-An Chen,Tao Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.

[LG-23] Fisher Information Training and Bias in Fourier Regression Models

链接: https://arxiv.org/abs/2510.06945
作者: Lorenzo Pastori,Veronika Eyring,Mierk Schwabe
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Data Analysis, Statistics and Probability (physics.data-an); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Motivated by the growing interest in quantum machine learning, in particular quantum neural networks (QNNs), we study how recently introduced evaluation metrics based on the Fisher information matrix (FIM) are effective for predicting their training and prediction performance. We exploit the equivalence between a broad class of QNNs and Fourier models, and study the interplay between the \empheffective dimension and the \emphbias of a model towards a given task, investigating how these affect the model’s training and performance. We show that for a model that is completely agnostic, or unbiased, towards the function to be learned, a higher effective dimension likely results in a better trainability and performance. On the other hand, for models that are biased towards the function to be learned a lower effective dimension is likely beneficial during training. To obtain these results, we derive an analytical expression of the FIM for Fourier models and identify the features controlling a model’s effective dimension. This allows us to construct models with tunable effective dimension and bias, and to compare their training. We furthermore introduce a tensor network representation of the considered Fourier models, which could be a tool of independent interest for the analysis of QNN models. Overall, these findings provide an explicit example of the interplay between geometrical properties, model-task alignment and training, which are relevant for the broader machine learning community.

[LG-24] Revisiting Node Affinity Prediction in Temporal Graphs

链接: https://arxiv.org/abs/2510.06940
作者: Krishna Sri Ipsit Mantri,Or Feldman,Moshe Eliasof,Chaim Baskin
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Node affinity prediction is a common task that is widely used in temporal graph learning with applications in social and financial networks, recommender systems, and more. Recent works have addressed this task by adapting state-of-the-art dynamic link property prediction models to node affinity prediction. However, simple heuristics, such as Persistent Forecast or Moving Average, outperform these models. In this work, we analyze the challenges in training current Temporal Graph Neural Networks for node affinity prediction and suggest appropriate solutions. Combining the solutions, we develop NAViS - Node Affinity prediction model using Virtual State, by exploiting the equivalence between heuristics and state space models. While promising, training NAViS is non-trivial. Therefore, we further introduce a novel loss function for node affinity prediction. We evaluate NAViS on TGB and show that it outperforms the state-of-the-art, including heuristics. Our source code is available at this https URL

[LG-25] Utilizing Large Language Models for Machine Learning Explainability

链接: https://arxiv.org/abs/2510.06912
作者: Alexandros Vassiliades,Nikolaos Polatidis,Stamatios Samaras,Sotiris Diplaris,Ignacio Cabrera Martin,Yannis Manolopoulos,Stefanos Vrochidis,Ioannis Kompatsiaris
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the explainability capabilities of large language models (LLMs), when employed to autonomously generate machine learning (ML) solutions. We examine two classification tasks: (i) a binary classification problem focused on predicting driver alertness states, and (ii) a multilabel classification problem based on the yeast dataset. Three state-of-the-art LLMs (i.e. OpenAI GPT, Anthropic Claude, and DeepSeek) are prompted to design training pipelines for four common classifiers: Random Forest, XGBoost, Multilayer Perceptron, and Long Short-Term Memory networks. The generated models are evaluated in terms of predictive performance (recall, precision, and F1-score) and explainability using SHAP (SHapley Additive exPlanations). Specifically, we measure Average SHAP Fidelity (Mean Squared Error between SHAP approximations and model outputs) and Average SHAP Sparsity (number of features deemed influential). The results reveal that LLMs are capable of producing effective and interpretable models, achieving high fidelity and consistent sparsity, highlighting their potential as automated tools for interpretable ML pipeline generation. The results show that LLMs can produce effective, interpretable pipelines with high fidelity and consistent sparsity, closely matching manually engineered baselines.

[LG-26] Vacuum Spiker: A Spiking Neural Network-Based Model for Efficient Anomaly Detection in Time Series

链接: https://arxiv.org/abs/2510.06910
作者: Iago Xabier Vázquez,Javier Sedano,Muhammad Afzal,Ángel Miguel García-Vico
类目: Machine Learning (cs.LG)
*备注: 53 pages, 16 figures, preprint submitted to a journal for review

点击查看摘要

Abstract:Anomaly detection is a key task across domains such as industry, healthcare, and cybersecurity. Many real-world anomaly detection problems involve analyzing multiple features over time, making time series analysis a natural approach for such problems. While deep learning models have achieved strong performance in this field, their trend to exhibit high energy consumption limits their deployment in resource-constrained environments such as IoT devices, edge computing platforms, and wearables. To address this challenge, this paper introduces the \textitVacuum Spiker algorithm, a novel Spiking Neural Network-based method for anomaly detection in time series. It incorporates a new detection criterion that relies on global changes in neural activity rather than reconstruction or prediction error. It is trained using Spike Time-Dependent Plasticity in a novel way, intended to induce changes in neural activity when anomalies occur. A new efficient encoding scheme is also proposed, which discretizes the input space into non-overlapping intervals, assigning each to a single neuron. This strategy encodes information with a single spike per time step, improving energy efficiency compared to conventional encoding methods. Experimental results on publicly available datasets show that the proposed algorithm achieves competitive performance while significantly reducing energy consumption, compared to a wide set of deep learning and machine learning baselines. Furthermore, its practical utility is validated in a real-world case study, where the model successfully identifies power curtailment events in a solar inverter. These results highlight its potential for sustainable and efficient anomaly detection.

[LG-27] Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors

链接: https://arxiv.org/abs/2510.06834
作者: Vasileios Titopoulos,Kosmas Alexandridis,Giorgos Dimitrakopoulos
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Attention is a core operation in numerous machine learning and artificial intelligence models. This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors, particularly those based on the RISC-V instruction set architecture (ISA). This work represents the first effort to vectorize FlashAttention, minimizing scalar code and simplifying the computational complexity of evaluating exponentials needed by softmax used in attention. By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function without the need to extend baseline vector ISA with new custom instructions. Also, appropriate tiling strategies are explored with the goal to improve memory locality. Experimental results highlight the scalability of our approach, demonstrating significant performance gains with the vectorized implementations when processing attention layers in practical applications.

[LG-28] Early wind turbine alarm prediction based on machine learning: AlarmForecasting

链接: https://arxiv.org/abs/2510.06831
作者: Syed Shazaib Shah,Daoliang Tan
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: International Journal of Electrical Power and Energy Systems

点击查看摘要

Abstract:Alarm data is pivotal in curbing fault behavior in Wind Turbines (WTs) and forms the backbone for advancedpredictive monitoring systems. Traditionally, research cohorts have been confined to utilizing alarm data solelyas a diagnostic tool, merely indicative of unhealthy status. However, this study aims to offer a transformativeleap towards preempting alarms, preventing alarms from triggering altogether, and consequently avertingimpending failures. Our proposed Alarm Forecasting and Classification (AFC) framework is designed on twosuccessive modules: first, the regression module based on long short-term memory (LSTM) for time-series alarmforecasting, and thereafter, the classification module to implement alarm tagging on the forecasted alarm. Thisway, the entire alarm taxonomy can be forecasted reliably rather than a few specific alarms. 14 Senvion MM82turbines with an operational period of 5 years are used as a case study; the results demonstrated 82%, 52%,and 41% accurate forecasts for 10, 20, and 30 min alarm forecasts, respectively. The results substantiateanticipating and averting alarms, which is significant in curbing alarm frequency and enhancing operationalefficiency through proactive intervention.

[LG-29] Efficient numeracy in language models through single-token number embeddings

链接: https://arxiv.org/abs/2510.06824
作者: Linus Kreitner,Paul Hager,Jonathan Mengedoht,Georgios Kaissis,Daniel Rueckert,Martin J. Menten
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either limiting the numerical intuition of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

[LG-30] he Unreason able Effectiveness of Randomized Representations in Online Continual Graph Learning

链接: https://arxiv.org/abs/2510.06819
作者: Giovanni Donghi,Daniele Zambon,Luca Pasa,Cesare Alippi,Nicolò Navarin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Catastrophic forgetting is one of the main obstacles for Online Continual Graph Learning (OCGL), where nodes arrive one by one, distribution drifts may occur at any time and offline training on task-specific subgraphs is not feasible. In this work, we explore a surprisingly simple yet highly effective approach for OCGL: we use a fixed, randomly initialized encoder to generate robust and expressive node embeddings by aggregating neighborhood information, training online only a lightweight classifier. By freezing the encoder, we eliminate drifts of the representation parameters, a key source of forgetting, obtaining embeddings that are both expressive and stable. When evaluated across several OCGL benchmarks, despite its simplicity and lack of memory buffer, this approach yields consistent gains over state-of-the-art methods, with surprising improvements of up to 30% and performance often approaching that of the joint offline-training upper bound. These results suggest that in OCGL, catastrophic forgetting can be minimized without complex replay or regularization by embracing architectural simplicity and stability.

[LG-31] Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

链接: https://arxiv.org/abs/2510.06790
作者: Tavish McDonald,Bo Lei,Stanislav Fort,Bhavya Kailkhura,Brian Bartoldson
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model’s training data better reflects the attacked data’s components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization, while RL finetuning and protracted reasoning are not critical. For example, increasing emphasis on defensive specifications via prompting lowers the success rate of gradient-based multimodal attacks on VLMs robustified by adversarial pretraining, but this same intervention provides no such benefit to not-robustified models. This correlation of inference-compute’s robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Accordingly, we advise layering train-time and test-time defenses to obtain their synergistic benefit.

[LG-32] Function regression using the forward forward training and inferring paradigm

链接: https://arxiv.org/abs/2510.06762
作者: Shivam Padmani,Akshay Joshi
类目: Machine Learning (cs.LG)
*备注: Keywords: Neural Networks, Forward Forward training, Function Regression, Physical Neural Networks, Analog Computing

点击查看摘要

Abstract:Function regression/approximation is a fundamental application of machine learning. Neural networks (NNs) can be easily trained for function regression using a sufficient number of neurons and epochs. The forward-forward learning algorithm is a novel approach for training neural networks without backpropagation, and is well suited for implementation in neuromorphic computing and physical analogs for neural networks. To the best of the authors’ knowledge, the Forward Forward paradigm of training and inferencing NNs is currently only restricted to classification tasks. This paper introduces a new methodology for approximating functions (function regression) using the Forward-Forward algorithm. Furthermore, the paper evaluates the developed methodology on univariate and multivariate functions, and provides preliminary studies of extending the proposed Forward-Forward regression to Kolmogorov Arnold Networks, and Deep Physical Neural Networks.

[LG-33] Incorporating Expert Knowledge into Bayesian Causal Discovery of Mixtures of Directed Acyclic Graphs

链接: https://arxiv.org/abs/2510.06735
作者: Zachris Björkman,Jorge Loría,Sophie Wharrie,Samuel Kaski
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 28 pages, 18 figures

点击查看摘要

Abstract:Bayesian causal discovery benefits from prior information elicited from domain experts, and in heterogeneous domains any prior knowledge would be badly needed. However, so far prior elicitation approaches have assumed a single causal graph and hence are not suited to heterogeneous domains. We propose a causal elicitation strategy for heterogeneous settings, based on Bayesian experimental design (BED) principles, and a variational mixture structure learning (VaMSL) method – extending the earlier differentiable Bayesian structure learning (DiBS) method – to iteratively infer mixtures of causal Bayesian networks (CBNs). We construct an informative graph prior incorporating elicited expert feedback in the inference of mixtures of CBNs. Our proposed method successfully produces a set of alternative causal models (mixture components or clusters), and achieves an improved structure learning performance on heterogeneous synthetic data when informed by a simulated expert. Finally, we demonstrate that our approach is capable of capturing complex distributions in a breast cancer database.

[LG-34] A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking NEURIPS2025

链接: https://arxiv.org/abs/2510.06699
作者: Gal Fadlon,Idan Arbiv,Nimrod Berman,Omri Azencot
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025; The first two authors contributed equally and are co-leading authors

点击查看摘要

Abstract:Generating realistic time series data is critical for applications in healthcare, finance, and science. However, irregular sampling and missing values present significant challenges. While prior methods address these irregularities, they often yield suboptimal results and incur high computational costs. Recent advances in regular time series generation, such as the diffusion-based ImagenTime model, demonstrate strong, fast, and scalable generative capabilities by transforming time series into image representations, making them a promising solution. However, extending ImagenTime to irregular sequences using simple masking introduces “unnatural” neighborhoods, where missing values replaced by zeros disrupt the learning process. To overcome this, we propose a novel two-step framework: first, a Time Series Transformer completes irregular sequences, creating natural neighborhoods; second, a vision-based diffusion model with masking minimizes dependence on the completed values. This approach leverages the strengths of both completion and masking, enabling robust and efficient generation of realistic time series. Our method achieves state-of-the-art performance, achieving a relative improvement in discriminative score by 70% and in computational cost by 85% . Code is at this https URL.

[LG-35] Is the Hard-Label Cryptanalytic Model Extraction Really Polynomial?

链接: https://arxiv.org/abs/2510.06692
作者: Akira Ito,Takayuki Miura,Yosuke Todo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have attracted significant attention, and their internal models are now considered valuable intellectual assets. Extracting these internal models through access to a DNN is conceptually similar to extracting a secret key via oracle access to a block cipher. Consequently, cryptanalytic techniques, particularly differential-like attacks, have been actively explored recently. ReLU-based DNNs are the most commonly and widely deployed architectures. While early works (e.g., Crypto 2020, Eurocrypt 2024) assume access to exact output logits, which are usually invisible, more recent works (e.g., Asiacrypt 2024, Eurocrypt 2025) focus on the hard-label setting, where only the final classification result (e.g., “dog” or “car”) is available to the attacker. Notably, Carlini et al. (Eurocrypt 2025) demonstrated that model extraction is feasible in polynomial time even under this restricted setting. In this paper, we first show that the assumptions underlying their attack become increasingly unrealistic as the attack-target depth grows. In practice, satisfying these assumptions requires an exponential number of queries with respect to the attack depth, implying that the attack does not always run in polynomial time. To address this critical limitation, we propose a novel attack method called CrossLayer Extraction. Instead of directly extracting the secret parameters (e.g., weights and biases) of a specific neuron, which incurs exponential cost, we exploit neuron interactions across layers to extract this information from deeper layers. This technique significantly reduces query complexity and mitigates the limitations of existing model extraction approaches. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2510.06692 [cs.LG] (or arXiv:2510.06692v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.06692 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] AutoBalance: An Automatic Balancing Framework for Training Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2510.06684
作者: Kang An,Chenhao Si,Ming Yan,Shiqian Ma
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 23 pages

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) provide a powerful and general framework for solving Partial Differential Equations (PDEs) by embedding physical laws into loss functions. However, training PINNs is notoriously difficult due to the need to balance multiple loss terms, such as PDE residuals and boundary conditions, which often have conflicting objectives and vastly different curvatures. Existing methods address this issue by manipulating gradients before optimization (a “pre-combine” strategy). We argue that this approach is fundamentally limited, as forcing a single optimizer to process gradients from spectrally heterogeneous loss landscapes disrupts its internal preconditioning. In this work, we introduce AutoBalance, a novel “post-combine” training paradigm. AutoBalance assigns an independent adaptive optimizer to each loss component and aggregates the resulting preconditioned updates afterwards. Extensive experiments on challenging PDE benchmarks show that AutoBalance consistently outperforms existing frameworks, achieving significant reductions in solution error, as measured by both the MSE and L^\infty norms. Moreover, AutoBalance is orthogonal to and complementary with other popular PINN methodologies, amplifying their effectiveness on demanding benchmarks.

[LG-37] Distributed Algorithms for Multi-Agent Multi-Armed Bandits with Collision

链接: https://arxiv.org/abs/2510.06683
作者: Daoyuan Zhou,Xuchuang Wang,Lin Yang,Yang Gao
类目: Machine Learning (cs.LG)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:We study the stochastic Multiplayer Multi-Armed Bandit (MMAB) problem, where multiple players select arms to maximize their cumulative rewards. Collisions occur when two or more players select the same arm, resulting in no reward, and are observed by the players involved. We consider a distributed setting without central coordination, where each player can only observe their own actions and collision feedback. We propose a distributed algorithm with an adaptive, efficient communication protocol. The algorithm achieves near-optimal group and individual regret, with a communication cost of only \mathcalO(\log\log T) . Our experiments demonstrate significant performance improvements over existing baselines. Compared to state-of-the-art (SOTA) methods, our approach achieves a notable reduction in individual regret. Finally, we extend our approach to a periodic asynchronous setting, proving the lower bound for this problem and presenting an algorithm that achieves logarithmic regret.

[LG-38] meFormer: Transformer with Attention Modulation Empowered by Temporal Characteristics for Time Series Forecasting

链接: https://arxiv.org/abs/2510.06680
作者: Zhipeng Liu,Peibo Duan,Xuan Tang,Baixin Li,Yongsheng Huang,Mingyang Geng,Changsheng Zhang,Bin Zhang,Binwu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although Transformers excel in natural language processing, their extension to time series forecasting remains challenging due to insufficient consideration of the differences between textual and temporal modalities. In this paper, we develop a novel Transformer architecture designed for time series data, aiming to maximize its representational capacity. We identify two key but often overlooked characteristics of time series: (1) unidirectional influence from the past to the future, and (2) the phenomenon of decaying influence over time. These characteristics are introduced to enhance the attention mechanism of Transformers. We propose TimeFormer, whose core innovation is a self-attention mechanism with two modulation terms (MoSA), designed to capture these temporal priors of time series under the constraints of the Hawkes process and causal masking. Additionally, TimeFormer introduces a framework based on multi-scale and subsequence analysis to capture semantic dependencies at different temporal scales, enriching the temporal dependencies. Extensive experiments conducted on multiple real-world datasets show that TimeFormer significantly outperforms state-of-the-art methods, achieving up to a 7.45% reduction in MSE compared to the best baseline and setting new benchmarks on 94.04% of evaluation metrics. Moreover, we demonstrate that the MoSA mechanism can be broadly applied to enhance the performance of other Transformer-based models.

[LG-39] XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

链接: https://arxiv.org/abs/2510.06672
作者: Udbhav Bamba,Minghao Fang,Yifan Yu,Haizhong Zheng,Fan Lai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy’s reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.

[LG-40] he Effect of Attention Head Count on Transformer Approximation

链接: https://arxiv.org/abs/2510.06662
作者: Penghao Yu,Haotian Jiang,Zeyu Bao,Ruoxi Yu,Qianxiao Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized D -retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for \epsilon -approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as O(1/\epsilon^cT) , for some constant c and sequence length T . To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order O(T) allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.

[LG-41] Rethinking Nonlinearity: Trainable Gaussian Mixture Modules for Modern Neural Architectures

链接: https://arxiv.org/abs/2510.06660
作者: Weiguo Lu,Gangnan Yuan,Hong-kun Zhang,Shangyang Li
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Neural networks in general, from MLPs and CNNs to attention-based Transformers, are constructed from layers of linear combinations followed by nonlinear operations such as ReLU, Sigmoid, or Softmax. Despite their strength, these conventional designs are often limited in introducing non-linearity by the choice of activation functions. In this work, we introduce Gaussian Mixture-Inspired Nonlinear Modules (GMNM), a new class of differentiable modules that draw on the universal density approximation Gaussian mixture models (GMMs) and distance properties (metric space) of Gaussian kernal. By relaxing probabilistic constraints and adopting a flexible parameterization of Gaussian projections, GMNM can be seamlessly integrated into diverse neural architectures and trained end-to-end with gradient-based methods. Our experiments demonstrate that incorporating GMNM into architectures such as MLPs, CNNs, attention mechanisms, and LSTMs consistently improves performance over standard baselines. These results highlight GMNM’s potential as a powerful and flexible module for enhancing efficiency and accuracy across a wide range of machine learning applications.

[LG-42] hree Forms of Stochastic Injection for Improved Distribution-to-Distribution Generative Modeling

链接: https://arxiv.org/abs/2510.06634
作者: Shiye Su,Yuhui Zhang,Linqi Zhou,Rajesh Ranganath,Serena Yeung-Levy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling transformations between arbitrary data distributions is a fundamental scientific challenge, arising in applications like drug discovery and evolutionary simulation. While flow matching offers a natural framework for this task, its use has thus far primarily focused on the noise-to-data setting, while its application in the general distribution-to-distribution setting is underexplored. We find that in the latter case, where the source is also a data distribution to be learned from limited samples, standard flow matching fails due to sparse supervision. To address this, we propose a simple and computationally efficient method that injects stochasticity into the training process by perturbing source samples and flow interpolants. On five diverse imaging tasks spanning biology, radiology, and astronomy, our method significantly improves generation quality, outperforming existing baselines by an average of 9 FID points. Our approach also reduces the transport cost between input and generated samples to better highlight the true effect of the transformation, making flow matching a more practical tool for simulating the diverse distribution transformations that arise in science.

[LG-43] Chem-NMF: Multi-layer α-divergence Non-Negative Matrix Factorization for Cardiorespiratory Disease Clustering with Improved Convergence Inspired by Chemical Catalysts and Rigorous Asymptotic Analysis

链接: https://arxiv.org/abs/2510.06632
作者: Yasaman Torabi,Shahram Shirani,James P. Reilly
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Non-Negative Matrix Factorization (NMF) is an unsupervised learning method offering low-rank representations across various domains such as audio processing, biomedical signal analysis, and image recognition. The incorporation of \alpha -divergence in NMF formulations enhances flexibility in optimization, yet extending these methods to multi-layer architectures presents challenges in ensuring convergence. To address this, we introduce a novel approach inspired by the Boltzmann probability of the energy barriers in chemical reactions to theoretically perform convergence analysis. We introduce a novel method, called Chem-NMF, with a bounding factor which stabilizes convergence. To our knowledge, this is the first study to apply a physical chemistry perspective to rigorously analyze the convergence behaviour of the NMF algorithm. We start from mathematically proven asymptotic convergence results and then show how they apply to real data. Experimental results demonstrate that the proposed algorithm improves clustering accuracy by 5.6% \pm 2.7% on biomedical signals and 11.1% \pm 7.2% on face images (mean \pm std).

[LG-44] POME: Post Optimization Model Edit via Muon-style Projection

链接: https://arxiv.org/abs/2510.06627
作者: Yong Liu,Di Fu,Yang Luo,Zirui Zhu,Minhao Cheng,Cho-Jui Hsieh,Yang You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Post-Optimization Model Edit (POME), a new algorithm that enhances the performance of fine-tuned large language models using only their pretrained and fine-tuned checkpoints, without requiring extra data or further optimization. The core idea is to apply a muon-style projection to \Delta W , the difference between the fine-tuned and pretrained weights. This projection uses truncated singular value decomposition (SVD) to equalize the influence of dominant update directions and prune small singular values, which often represent noise. As a simple post-processing step, POME is completely decoupled from the training pipeline. It requires zero modifications and imposes no overhead, making it universally compatible with any optimizer or distributed framework. POME delivers consistent gains, boosting average performance by +2.5% on GSM8K and +1.0% on code generation. Its broad applicability – from 7B foundation models to 72B RLHF-instructed models – establishes it as a practical, zero-cost enhancement for any fine-tuning pipeline. Code is available at this https URL.

[LG-45] DPA-Net: A Dual-Path Attention Neural Network for Inferring Glycemic Control Metrics from Self-Monitored Blood Glucose Data

链接: https://arxiv.org/abs/2510.06623
作者: Canyu Lei,Benjamin Lobo,Jianxin Xie
类目: Machine Learning (cs.LG)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Continuous glucose monitoring (CGM) provides dense and dynamic glucose profiles that enable reliable estimation of Ambulatory Glucose Profile (AGP) metrics, such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR). However, the high cost and limited accessibility of CGM restrict its widespread adoption, particularly in low- and middle-income regions. In contrast, self-monitoring of blood glucose (SMBG) is inexpensive and widely available but yields sparse and irregular data that are challenging to translate into clinically meaningful glycemic metrics. In this work, we propose a Dual-Path Attention Neural Network (DPA-Net) to estimate AGP metrics directly from SMBG data. DPA-Net integrates two complementary paths: (1) a spatial-channel attention path that reconstructs a CGM-like trajectory from sparse SMBG observations, and (2) a multi-scale ResNet path that directly predicts AGP metrics. An alignment mechanism between the two paths is introduced to reduce bias and mitigate overfitting. In addition, we develop an active point selector to identify realistic and informative SMBG sampling points that reflect patient behavioral patterns. Experimental results on a large, real-world dataset demonstrate that DPA-Net achieves robust accuracy with low errors while reducing systematic bias. To the best of our knowledge, this is the first supervised machine learning framework for estimating AGP metrics from SMBG data, offering a practical and clinically relevant decision-support tool in settings where CGM is not accessible. Comments: 14 pages, 10 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.06623 [cs.LG] (or arXiv:2510.06623v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.06623 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] From Description to Detection: LLM based Extendable O-RAN Compliant Blind DoS Detection in 5G and Beyond

链接: https://arxiv.org/abs/2510.06530
作者: Thusitha Dayaratne,Ngoc Duy Pham,Viet Vo,Shangqi Lai,Sharif Abuadbba,Hajime Suzuki,Xingliang Yuan,Carsten Rudolph
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The quality and experience of mobile communication have significantly improved with the introduction of 5G, and these improvements are expected to continue beyond the 5G era. However, vulnerabilities in control-plane protocols, such as Radio Resource Control (RRC) and Non-Access Stratum (NAS), pose significant security threats, such as Blind Denial of Service (DoS) attacks. Despite the availability of existing anomaly detection methods that leverage rule-based systems or traditional machine learning methods, these methods have several limitations, including the need for extensive training data, predefined rules, and limited explainability. Addressing these challenges, we propose a novel anomaly detection framework that leverages the capabilities of Large Language Models (LLMs) in zero-shot mode with unordered data and short natural language attack descriptions within the Open Radio Access Network (O-RAN) architecture. We analyse robustness to prompt variation, demonstrate the practicality of automating the attack descriptions and show that detection quality relies on the semantic completeness of the description rather than its phrasing or length. We utilise an RRC/NAS dataset to evaluate the solution and provide an extensive comparison of open-source and proprietary LLM implementations to demonstrate superior performance in attack detection. We further validate the practicality of our framework within O-RAN’s real-time constraints, illustrating its potential for detecting other Layer-3 attacks.

[LG-47] BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music

链接: https://arxiv.org/abs/2510.06528
作者: Mingyang Yao,Ke Chen,Shlomo Dubnov,Taylor Berg-Kirkpatrick
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Under review

点击查看摘要

Abstract:Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) We propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.

[LG-48] Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture

链接: https://arxiv.org/abs/2510.06527
作者: John Dunbar,Scott Aaronson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We establish that randomly initialized neural networks, with large width and a natural choice of hyperparameters, have nearly independent outputs exactly when their activation function is nonlinear with zero mean under the Gaussian measure: \mathbbE_z \sim \mathcalN(0,1)[\sigma(z)]=0 . For example, this includes ReLU and GeLU with an additive shift, as well as tanh, but not ReLU or GeLU by themselves. Because of their nearly independent outputs, we propose neural networks with zero-mean activation functions as a promising candidate for the Alignment Research Center’s computational no-coincidence conjecture – a conjecture that aims to measure the limits of AI interpretability.

[LG-49] xt-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security NEURIPS2025

链接: https://arxiv.org/abs/2510.06525
作者: Ali Naseh,Anshuman Suri,Yuefeng Peng,Harsh Chaudhari,Alina Oprea,Amir Houmansadr
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at Lock-LLM Workshop, NeurIPS 2025

点击查看摘要

Abstract:Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs – a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.

[LG-50] GUIDE: Guided Initialization and Distillation of Embeddings

链接: https://arxiv.org/abs/2510.06502
作者: Khoa Trinh,Gaurav Menghani,Erik Vee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic efficiency techniques such as distillation (\citehinton2015distillation) are useful in improving model quality without increasing serving costs, provided a larger teacher model is available for a smaller student model to learn from during training. Standard distillation methods are limited to only forcing the student to match the teacher’s outputs. Given the costs associated with training a large model, we believe we should be extracting more useful information from a teacher model than by just making the student match the teacher’s outputs. In this paper, we introduce \guide (Guided Initialization and Distillation of Embeddings). \guide can be considered a distillation technique that forces the student to match the teacher in the parameter space. Using \guide we show 25-26% reduction in the teacher-student quality gap when using large student models (400M - 1B parameters) trained on \approx 20B tokens. We also present a thorough analysis demonstrating that \guide can be combined with knowledge distillation with near additive improvements. Furthermore, we show that applying \guide alone leads to substantially better model quality than applying knowledge distillation by itself. Most importantly, \guide introduces no training or inference overhead and hence any model quality gains from our method are virtually free. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.06502 [cs.LG] (or arXiv:2510.06502v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.06502 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] Bayesian Optimization under Uncertainty for Training a Scale Parameter in Stochastic Models

链接: https://arxiv.org/abs/2510.06439
作者: Akash Yadav,Ruda Zhang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Hyperparameter tuning is a challenging problem especially when the system itself involves uncertainty. Due to noisy function evaluations, optimization under uncertainty can be computationally expensive. In this paper, we present a novel Bayesian optimization framework tailored for hyperparameter tuning under uncertainty, with a focus on optimizing a scale- or precision-type parameter in stochastic models. The proposed method employs a statistical surrogate for the underlying random variable, enabling analytical evaluation of the expectation operator. Moreover, we derive a closed-form expression for the optimizer of the random acquisition function, which significantly reduces computational cost per iteration. Compared with a conventional one-dimensional Monte Carlo-based optimization scheme, the proposed approach requires 40 times fewer data points, resulting in up to a 40-fold reduction in computational cost. We demonstrate the effectiveness of the proposed method through two numerical examples in computational engineering.

[LG-52] Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization

链接: https://arxiv.org/abs/2510.06434
作者: Eliot Shekhtman,Yichen Zhou,Ingvar Ziemann,Nikolai Matni,Stephen Tu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning from temporally-correlated data is a core facet of modern machine learning. Yet our understanding of sequential learning remains incomplete, particularly in the multi-trajectory setting where data consists of many independent realizations of a time-indexed stochastic process. This important regime both reflects modern training pipelines such as for large foundation models, and offers the potential for learning without the typical mixing assumptions made in the single-trajectory case. However, instance-optimal bounds are known only for least-squares regression with dependent covariates; for more general models or loss functions, the only broadly applicable guarantees result from a reduction to either i.i.d. learning, with effective sample size scaling only in the number of trajectories, or an existing single-trajectory result when each individual trajectory mixes, with effective sample size scaling as the full data budget deflated by the mixing-time. In this work, we significantly broaden the scope of instance-optimal rates in multi-trajectory settings via the Hellinger localization framework, a general approach for maximum likelihood estimation. Our method proceeds by first controlling the squared Hellinger distance at the path-measure level via a reduction to i.i.d. learning, followed by localization as a quadratic form in parameter space weighted by the trajectory Fisher information. This yields instance-optimal bounds that scale with the full data budget under a broad set of conditions. We instantiate our framework across four diverse case studies: a simple mixture of Markov chains, dependent linear regression under non-Gaussian noise, generalized linear models with non-monotonic activations, and linear-attention sequence models. In all cases, our bounds nearly match the instance-optimal rates from asymptotic normality, substantially improving over standard reductions. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2510.06434 [cs.LG] (or arXiv:2510.06434v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.06434 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] st-Time Efficient Pretrained Model Portfolios for Time Series Forecasting

链接: https://arxiv.org/abs/2510.06419
作者: Mert Kayaalp,Caner Turkmen,Oleksandr Shchur,Pedro Mercado,Abdul Fatir Ansari,Michael Bohlke-Schneider,Bernie Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.

[LG-54] he Effect of Label Noise on the Information Content of Neural Representations

链接: https://arxiv.org/abs/2510.06401
作者: Ali Hussaini Umar,Franky Kevin Nando Tezoh,Jean Barbier,Santiago Acevedo,Alessandro Laio
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:In supervised classification tasks, models are trained to predict a label for each data point. In real-world datasets, these labels are often noisy due to annotation errors. While the impact of label noise on the performance of deep learning models has been widely studied, its effects on the networks’ hidden representations remain poorly understood. We address this gap by systematically comparing hidden representations using the Information Imbalance, a computationally efficient proxy of conditional mutual information. Through this analysis, we observe that the information content of the hidden representations follows a double descent as a function of the number of network parameters, akin to the behavior of the test error. We further demonstrate that in the underparameterized regime, representations learned with noisy labels are more informative than those learned with clean labels, while in the overparameterized regime, these representations are equally informative. Our results indicate that the representations of overparameterized networks are robust to label noise. We also found that the information imbalance between the penultimate and pre-softmax layers decreases with cross-entropy loss in the overparameterized regime. This offers a new perspective on understanding generalization in classification tasks. Extending our analysis to representations learned from random labels, we show that these perform worse than random features. This indicates that training on random labels drives networks much beyond lazy learning, as weights adapt to encode labels information.

[LG-55] Making and Evaluating Calibrated Forecasts

链接: https://arxiv.org/abs/2510.06388
作者: Yuxuan Lu,Yifan Wu,Jason Hartline,Lunjia Hu
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2510.06388 [cs.LG] (or arXiv:2510.06388v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.06388 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Lagrangian neural ODEs: Measuring the existence of a Lagrangian with Helmholtz metrics NEURIPS2025

链接: https://arxiv.org/abs/2510.06367
作者: Luca Wolf,Tobias Buck,Bjoern Malte Schaefer
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Accepted for the NeurIPS 2025 Machine Learning and the Physical Sciences workshop. 6 pages, 3 figures

点击查看摘要

Abstract:Neural ODEs are a widely used, powerful machine learning technique in particular for physics. However, not every solution is physical in that it is an Euler-Lagrange equation. We present Helmholtz metrics to quantify this resemblance for a given ODE and demonstrate their capabilities on several fundamental systems with noise. We combine them with a second order neural ODE to form a Lagrangian neural ODE, which allows to learn Euler-Lagrange equations in a direct fashion and with zero additional inference cost. We demonstrate that, using only positional data, they can distinguish Lagrangian and non-Lagrangian systems and improve the neural ODE solutions.

[LG-57] PIKAN: Physics-Inspired Kolmogorov-Arnold Networks for Explainable UAV Channel Modelling

链接: https://arxiv.org/abs/2510.06355
作者: Kürşat Tekbıyık,Güneş Karabulut Kurt,Antoine Lesage-Landry
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Unmanned aerial vehicle (UAV) communications demand accurate yet interpretable air-to-ground (A2G) channel models that can adapt to nonstationary propagation environments. While deterministic models offer interpretability and deep learning (DL) models provide accuracy, both approaches suffer from either rigidity or a lack of explainability. To bridge this gap, we propose the Physics-Inspired Kolmogorov-Arnold Network (PIKAN) that embeds physical principles (e.g., free-space path loss, two-ray reflections) into the learning process. Unlike physics-informed neural networks (PINNs), PIKAN is more flexible for applying physical information because it introduces them as flexible inductive biases. Thus, it enables a more flexible training process. Experiments on UAV A2G measurement data show that PIKAN achieves comparable accuracy to DL models while providing symbolic and explainable expressions aligned with propagation laws. Remarkably, PIKAN achieves this performance with only 232 parameters, making it up to 37 times lighter than multilayer perceptron (MLP) baselines with thousands of parameters, without sacrificing correlation with measurements and also providing symbolic expressions. These results highlight PIKAN as an efficient, interpretable, and scalable solution for UAV channel modelling in beyond-5G and 6G networks.

[LG-58] Beyond Static Knowledge Messengers: Towards Adaptive Fair and Scalable Federated Learning for Medical AI ALT

链接: https://arxiv.org/abs/2510.06259
作者: Jahidul Arafat,Fariha Tasmin,Sanjaya Poudel,Ahsan Habib Tareq,Iftekhar Haider
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 20 pages, 4 figures, 14 tables. Proposes Adaptive Fair Federated Learning (AFFL) algorithm and MedFedBench benchmark suite for healthcare federated learning

点击查看摘要

Abstract:Medical AI faces challenges in privacy-preserving collaborative learning while ensuring fairness across heterogeneous healthcare institutions. Current federated learning approaches suffer from static architectures, slow convergence (45-73 rounds), fairness gaps marginalizing smaller institutions, and scalability constraints (15-client limit). We propose Adaptive Fair Federated Learning (AFFL) through three innovations: (1) Adaptive Knowledge Messengers dynamically scaling capacity based on heterogeneity and task complexity, (2) Fairness-Aware Distillation using influence-weighted aggregation, and (3) Curriculum-Guided Acceleration reducing rounds by 60-70%. Our theoretical analysis provides convergence guarantees with epsilon-fairness bounds, achieving O(T^-1/2) + O(H_max/T^3/4) rates. Projected results show 55-75% communication reduction, 56-68% fairness improvement, 34-46% energy savings, and 100+ institution support. The framework enables multi-modal integration across imaging, genomics, EHR, and sensor data while maintaining HIPAA/GDPR compliance. We propose MedFedBench benchmark suite for standardized evaluation across six healthcare dimensions: convergence efficiency, institutional fairness, privacy preservation, multi-modal integration, scalability, and clinical deployment readiness. Economic projections indicate 400-800% ROI for rural hospitals and 15-25% performance gains for academic centers. This work presents a seven-question research agenda, 24-month implementation roadmap, and pathways toward democratizing healthcare AI.

[LG-59] Enhancing Resilience for IoE: A Perspective of Networking-Level Safeguard

链接: https://arxiv.org/abs/2508.20504
作者: Guan-Yan Yang,Jui-Ning Chen,Farn Wang,Kuo-Hui Yeh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: To be published in IEEE Network Magazine, 2026

点击查看摘要

Abstract:The Internet of Energy (IoE) integrates IoT-driven digital communication with power grids to enable efficient and sustainable energy systems. Still, its interconnectivity exposes critical infrastructure to sophisticated cyber threats, including adversarial attacks designed to bypass traditional safeguards. Unlike general IoT risks, IoE threats have heightened public safety consequences, demanding resilient solutions. From the networking-level safeguard perspective, we propose a Graph Structure Learning (GSL)-based safeguards framework that jointly optimizes graph topology and node representations to resist adversarial network model manipulation inherently. Through a conceptual overview, architectural discussion, and case study on a security dataset, we demonstrate GSL’s superior robustness over representative methods, offering practitioners a viable path to secure IoE networks against evolving attacks. This work highlights the potential of GSL to enhance the resilience and reliability of future IoE networks for practitioners managing critical infrastructure. Lastly, we identify key open challenges and propose future research directions in this novel research area.

[LG-60] Accelerating Inference for Multilayer Neural Networks with Quantum Computers

链接: https://arxiv.org/abs/2510.07195
作者: Arthur G. Rattew,Po-Wei Huang,Naixu Guo,Lirandë Pira,Patrick Rebentrost
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fault-tolerant Quantum Processing Units (QPUs) promise to deliver exponential speed-ups in select computational tasks, yet their integration into modern deep learning pipelines remains unclear. In this work, we take a step towards bridging this gap by presenting the first fully-coherent quantum implementation of a multilayer neural network with non-linear activation functions. Our constructions mirror widely used deep learning architectures based on ResNet, and consist of residual blocks with multi-filter 2D convolutions, sigmoid activations, skip-connections, and layer normalizations. We analyse the complexity of inference for networks under three quantum data access regimes. Without any assumptions, we establish a quadratic speedup over classical methods for shallow bilinear-style networks. With efficient quantum access to the weights, we obtain a quartic speedup over classical methods. With efficient quantum access to both the inputs and the network weights, we prove that a network with an N -dimensional vectorized input, k residual block layers, and a final residual-linear-pooling layer can be implemented with an error of \epsilon with O(\textpolylog(N/\epsilon)^k) inference cost.

[LG-61] Covert Quantum Learning: Privately and Verifiably Learning from Quantum Data

链接: https://arxiv.org/abs/2510.07193
作者: Abhishek Anand,Matthias C. Caro,Ari Karchmer,Saachi Mutreja
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 + 54 pages

点击查看摘要

Abstract:Quantum learning from remotely accessed quantum compute and data must address two key challenges: verifying the correctness of data and ensuring the privacy of the learner’s data-collection strategies and resulting conclusions. The covert (verifiable) learning model of Canetti and Karchmer (TCC 2021) provides a framework for endowing classical learning algorithms with such guarantees. In this work, we propose models of covert verifiable learning in quantum learning theory and realize them without computational hardness assumptions for remote data access scenarios motivated by established quantum data advantages. We consider two privacy notions: (i) strategy-covertness, where the eavesdropper does not gain information about the learner’s strategy; and (ii) target-covertness, where the eavesdropper does not gain information about the unknown object being learned. We show: Strategy-covert algorithms for making quantum statistical queries via classical shadows; Target-covert algorithms for learning quadratic functions from public quantum examples and private quantum statistical queries, for Pauli shadow tomography and stabilizer state learning from public multi-copy and private single-copy quantum measurements, and for solving Forrelation and Simon’s problem from public quantum queries and private classical queries, where the adversary is a unidirectional or i.i.d. ancilla-free eavesdropper. The lattermost results in particular establish that the exponential separation between classical and quantum queries for Forrelation and Simon’s problem survives under covertness constraints. Along the way, we design covert verifiable protocols for quantum data acquisition from public quantum queries which may be of independent interest. Overall, our models and corresponding algorithms demonstrate that quantum advantages are privately and verifiably achievable even with untrusted, remote data.

[LG-62] Split Conformal Classification with Unsupervised Calibration

链接: https://arxiv.org/abs/2510.07185
作者: Santiago Mazuelas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Methods for split conformal prediction leverage calibration samples to transform any prediction rule into a set-prediction rule that complies with a target coverage probability. Existing methods provide remarkably strong performance guarantees with minimal computational costs. However, they require to use calibration samples composed by labeled examples different to those used for training. This requirement can be highly inconvenient, as it prevents the use of all labeled examples for training and may require acquiring additional labels solely for calibration. This paper presents an effective methodology for split conformal prediction with unsupervised calibration for classification tasks. In the proposed approach, set-prediction rules are obtained using unsupervised calibration samples together with supervised training samples previously used to learn the classification rule. Theoretical and experimental results show that the presented methods can achieve performance comparable to that with supervised calibration, at the expenses of a moderate degradation in performance guarantees and computational efficiency.

[LG-63] Bayesian Portfolio Optimization by Predictive Synthesis

链接: https://arxiv.org/abs/2510.07180
作者: Masahiro Kato,Kentaro Baba,Hibiki Kaibuchi,Ryo Inokuchi
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Portfolio optimization is a critical task in investment. Most existing portfolio optimization methods require information on the distribution of returns of the assets that make up the portfolio. However, such distribution information is usually unknown to investors. Various methods have been proposed to estimate distribution information, but their accuracy greatly depends on the uncertainty of the financial markets. Due to this uncertainty, a model that could well predict the distribution information at one point in time may perform less accurately compared to another model at a different time. To solve this problem, we investigate a method for portfolio optimization based on Bayesian predictive synthesis (BPS), one of the Bayesian ensemble methods for meta-learning. We assume that investors have access to multiple asset return prediction models. By using BPS with dynamic linear models to combine these predictions, we can obtain a Bayesian predictive posterior about the mean rewards of assets that accommodate the uncertainty of the financial markets. In this study, we examine how to construct mean-variance portfolios and quantile-based portfolios based on the predicted distribution information.

[LG-64] Active Control of Turbulent Airfoil Flows Using Adjoint-based Deep Learning

链接: https://arxiv.org/abs/2510.07106
作者: Xuemin Liu,Tom Hickling,Jonathan F. MacArt
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We train active neural-network flow controllers using a deep learning PDE augmentation method to optimize lift-to-drag ratios in turbulent airfoil flows at Reynolds number 5\times10^4 and Mach number 0.4. Direct numerical simulation and large eddy simulation are employed to model compressible, unconfined flow over two- and three-dimensional semi-infinite NACA 0012 airfoils at angles of attack \alpha = 5^\circ , 10^\circ , and 15^\circ . Control actions, implemented through a blowing/suction jet at a fixed location and geometry on the upper surface, are adaptively determined by a neural network that maps local pressure measurements to optimal jet total pressure, enabling a sensor-informed control policy that responds spatially and temporally to unsteady flow conditions. The sensitivities of the flow to the neural network parameters are computed using the adjoint Navier-Stokes equations, which we construct using automatic differentiation applied to the flow solver. The trained flow controllers significantly improve the lift-to-drag ratios and reduce flow separation for both two- and three-dimensional airfoil flows, especially at \alpha = 5^\circ and 10^\circ . The 2D-trained models remain effective when applied out-of-sample to 3D flows, which demonstrates the robustness of the adjoint-trained control approach. The 3D-trained models capture the flow dynamics even more effectively, which leads to better energy efficiency and comparable performance for both adaptive (neural network) and offline (simplified, constant-pressure) controllers. These results underscore the effectiveness of this learning-based approach in improving aerodynamic performance.

[LG-65] Diffusion-Augmented Reinforcement Learning for Robust Portfolio Optimization under Stress Scenarios

链接: https://arxiv.org/abs/2510.07099
作者: Himanshu Choudhary,Arishi Orra,Manoj Thakur
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:In the ever-changing and intricate landscape of financial markets, portfolio optimisation remains a formidable challenge for investors and asset managers. Conventional methods often struggle to capture the complex dynamics of market behaviour and align with diverse investor preferences. To address this, we propose an innovative framework, termed Diffusion-Augmented Reinforcement Learning (DARL), which synergistically integrates Denoising Diffusion Probabilistic Models (DDPMs) with Deep Reinforcement Learning (DRL) for portfolio management. By leveraging DDPMs to generate synthetic market crash scenarios conditioned on varying stress intensities, our approach significantly enhances the robustness of training data. Empirical evaluations demonstrate that DARL outperforms traditional baselines, delivering superior risk-adjusted returns and resilience against unforeseen crises, such as the 2025 Tariff Crisis. This work offers a robust and practical methodology to bolster stress resilience in DRL-driven financial applications.

[LG-66] Explaining Models under Multivariate Bernoulli Distribution via Hoeffding Decomposition

链接: https://arxiv.org/abs/2510.07088
作者: Baptiste Ferrere(EDF R\amp;D PRISME, IMT, SINCLAIR AI Lab),Nicolas Bousquet(EDF R\amp;D PRISME, SINCLAIR AI Lab, LPSM (UMR_8001)),Fabrice Gamboa(IMT),Jean-Michel Loubes(IMT),Joseph Muré(EDF R\amp;D PRISME)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explaining the behavior of predictive models with random inputs can be achieved through sub-models decomposition, where such sub-models have easier interpretable features. Arising from the uncertainty quantification community, recent results have demonstrated the existence and uniqueness of a generalized Hoeffding decomposition for such predictive models when the stochastic input variables are correlated, based on concepts of oblique projection onto L 2 subspaces. This article focuses on the case where the input variables have Bernoulli distributions and provides a complete description of this decomposition. We show that in this case the underlying L 2 subspaces are one-dimensional and that the functional decomposition is explicit. This leads to a complete interpretability framework and theoretically allows reverse engineering. Explicit indicators of the influence of inputs on the output prediction (exemplified by Sobol’ indices and Shapley effects) can be explicitly derived. Illustrated by numerical experiments, this type of analysis proves useful for addressing decision-support problems, based on binary decision diagrams, Boolean networks or binary neural networks. The article outlines perspectives for exploring high-dimensional settings and, beyond the case of binary inputs, extending these findings to models with finite countable inputs.

[LG-67] Root Cause Analysis of Outliers in Unknown Cyclic Graphs

链接: https://arxiv.org/abs/2510.06995
作者: Daniela Schkoda,Dominik Janzing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We study the propagation of outliers in cyclic causal graphs with linear structural equations, tracing them back to one or several “root cause” nodes. We show that it is possible to identify a short list of potential root causes provided that the perturbation is sufficiently strong and propagates according to the same structural equations as in the normal mode. This shortlist consists of the true root causes together with those of its parents lying on a cycle with the root cause. Notably, our method does not require prior knowledge of the causal graph.

[LG-68] PyCFRL: A Python library for counterfactually fair offline reinforcement learning via sequential data preprocessing

链接: https://arxiv.org/abs/2510.06935
作者: Jianhan Zhang,Jitao Wang,Chengchun Shi,John D. Piette,Donglin Zeng,Zhenke Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) aims to learn and evaluate a sequential decision rule, often referred to as a “policy”, that maximizes the population-level benefit in an environment across possibly infinitely many time steps. However, the sequential decisions made by an RL algorithm, while optimized to maximize overall population benefits, may disadvantage certain individuals who are in minority or socioeconomically disadvantaged groups. To address this problem, we introduce PyCFRL, a Python library for ensuring counterfactual fairness in offline RL. PyCFRL implements a novel data preprocessing algorithm for learning counterfactually fair RL policies from offline datasets and provides tools to evaluate the values and counterfactual unfairness levels of RL policies. We describe the high-level functionalities of PyCFRL and demonstrate one of its major use cases through a data example. The library is publicly available on PyPI and Github (this https URL), and detailed tutorials can be found in the PyCFRL documentation (this https URL).

[LG-69] xtual interpretation of transient image classifications from large language models

链接: https://arxiv.org/abs/2510.06931
作者: Fiorenzo Stoppa,Turan Bulmus,Steven Bloemen,Stephen J. Smartt,Paul J. Groot,Paul Vreeswijk,Ken W. Smith
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Published in Nature Astronomy (2025). Publisher’s Version of Record (CC BY 4.0). DOI: https://doi.org/10.1038/s41550-025-02670-z

点击查看摘要

Abstract:Modern astronomical surveys deliver immense volumes of transient detections, yet distinguishing real astrophysical signals (for example, explosive events) from bogus imaging artefacts remains a challenge. Convolutional neural networks are effectively used for real versus bogus classification; however, their reliance on opaque latent representations hinders interpretability. Here we show that large language models (LLMs) can approach the performance level of a convolutional neural network on three optical transient survey datasets (Pan-STARRS, MeerLICHT and ATLAS) while simultaneously producing direct, human-readable descriptions for every candidate. Using only 15 examples and concise instructions, Google’s LLM, Gemini, achieves a 93% average accuracy across datasets that span a range of resolution and pixel scales. We also show that a second LLM can assess the coherence of the output of the first model, enabling iterative refinement by identifying problematic cases. This framework allows users to define the desired classification behaviour through natural language and examples, bypassing traditional training pipelines. Furthermore, by generating textual descriptions of observed features, LLMs enable users to query classifications as if navigating an annotated catalogue, rather than deciphering abstract latent spaces. As next-generation telescopes and surveys further increase the amount of data available, LLM-based classification could help bridge the gap between automated detection and transparent, human-level understanding.

[LG-70] Quantum Sparse Recovery and Quantum Orthogonal Matching Pursuit

链接: https://arxiv.org/abs/2510.06925
作者: Armando Bellante,Stefano Vanerio,Stefano Zanero
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study quantum sparse recovery in non-orthogonal, overcomplete dictionaries: given coherent quantum access to a state and a dictionary of vectors, the goal is to reconstruct the state up to \ell_2 error using as few vectors as possible. We first show that the general recovery problem is NP-hard, ruling out efficient exact algorithms in full generality. To overcome this, we introduce Quantum Orthogonal Matching Pursuit (QOMP), the first quantum analogue of the classical OMP greedy algorithm. QOMP combines quantum subroutines for inner product estimation, maximum finding, and block-encoded projections with an error-resetting design that avoids iteration-to-iteration error accumulation. Under standard mutual incoherence and well-conditioned sparsity assumptions, QOMP provably recovers the exact support of a K -sparse state in polynomial time. As an application, we give the first framework for sparse quantum tomography with non-orthogonal dictionaries in \ell_2 norm, achieving query complexity \widetildeO(\sqrtN/\epsilon) in favorable regimes and reducing tomography to estimating only K coefficients instead of N amplitudes. In particular, for pure-state tomography with m=O(N) dictionary vectors and sparsity K=\widetildeO(1) on a well-conditioned subdictionary, this circumvents the \widetilde\Omega(N/\epsilon) lower bound that holds in the dense, orthonormal-dictionary setting, without contradiction, by leveraging sparsity together with non-orthogonality. Beyond tomography, we analyze QOMP in the QRAM model, where it yields polynomial speedups over classical OMP implementations, and provide a quantum algorithm to estimate the mutual incoherence of a dictionary of m vectors in O(m/\epsilon) queries, improving over both deterministic and quantum-inspired classical methods.

[LG-71] Reconquering Bell sampling on qudits: stabilizer learning and testing quantum pseudorandomness bounds and more

链接: https://arxiv.org/abs/2510.06848
作者: Jonathan Allcock,Joao F. Doriguello,Gábor Ivanyos,Miklos Santha
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 51 pages, 1 figure. Comments are welcome

点击查看摘要

Abstract:Bell sampling is a simple yet powerful tool based on measuring two copies of a quantum state in the Bell basis, and has found applications in a plethora of problems related to stabiliser states and measures of magic. However, it was not known how to generalise the procedure from qubits to d -level systems – qudits – for all dimensions d 2 in a useful way. Indeed, a prior work of the authors (arXiv’24) showed that the natural extension of Bell sampling to arbitrary dimensions fails to provide meaningful information about the quantum states being measured. In this paper, we overcome the difficulties encountered in previous works and develop a useful generalisation of Bell sampling to qudits of all d\geq 2 . At the heart of our primitive is a new unitary, based on Lagrange’s four-square theorem, that maps four copies of any stabiliser state |\mathcalS\rangle to four copies of its complex conjugate |\mathcalS^\ast\rangle (up to some Pauli operator), which may be of independent interest. We then demonstrate the utility of our new Bell sampling technique by lifting several known results from qubits to qudits for any d\geq 2 : 1. Learning stabiliser states in O(n^3) time with O(n) samples; 2. Solving the Hidden Stabiliser Group Problem in \tildeO(n^3/\varepsilon) time with \tildeO(n/\varepsilon) samples; 3. Testing whether |\psi\rangle has stabiliser size at least d^t or is \varepsilon -far from all such states in \tildeO(n^3/\varepsilon) time with \tildeO(n/\varepsilon) samples; 4. Clifford circuits with at most n/2 single-qudit non-Clifford gates cannot prepare pseudorandom states; 5. Testing whether |\psi\rangle has stabiliser fidelity at least 1-\varepsilon_1 or at most 1-\varepsilon_2 with O(d^2/\varepsilon_2) samples if \varepsilon_1 = 0 or O(d^2/\varepsilon_2^2) samples if \varepsilon_1 = O(d^-2) . Comments: 51 pages, 1 figure. Comments are welcome Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2510.06848 [quant-ph] (or arXiv:2510.06848v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2510.06848 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Joao F. Doriguello [view email] [v1] Wed, 8 Oct 2025 10:13:16 UTC (229 KB) Full-text links: Access Paper: View a PDF of the paper titled Reconquering Bell sampling on qudits: stabilizer learning and testing, quantum pseudorandomness bounds, and more, by Jonathan Allcock and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: quant-ph prev | next new | recent | 2025-10 Change to browse by: cs cs.CC cs.DS cs.LG References Citations INSPIRE HEP NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-72] Quantum Computing Methods for Malware Detection

链接: https://arxiv.org/abs/2510.06803
作者: Eliška Krátká,Aurél Gábor Gábris
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 22 pages, 2 figures, 3 tables

点击查看摘要

Abstract:In this paper, we explore the potential of quantum computing in enhancing malware detection through the application of Quantum Machine Learning (QML). Our main objective is to investigate the performance of the Quantum Support Vector Machine (QSVM) algorithm compared to SVM. A publicly available dataset containing raw binaries of Portable Executable (PE) files was used for the classification. The QSVM algorithm, incorporating quantum kernels through different feature maps, was implemented and evaluated on a local simulator within the Qiskit SDK and IBM quantum computers. Experimental results from simulators and quantum hardware provide insights into the behavior and performance of quantum computers, especially in handling large-scale computations for malware detection tasks. The work summarizes the practical experience with using quantum hardware via the Qiskit interfaces. We describe in detail the critical issues encountered, as well as the fixes that had to be developed and applied to the base code of the Qiskit Machine Learning library. These issues include missing transpilation of the circuits submitted to IBM Quantum systems and exceeding the maximum job size limit due to the submission of all the circuits in one job.

[LG-73] Latent Representation Learning in Heavy-Ion Collisions with MaskPoint Transformer NEURIPS2025

链接: https://arxiv.org/abs/2510.06691
作者: Jing-Zong Zhang,Shuang Guo,Li-Lin Zhu,Lingxiao Wang,Guo-Liang Ma
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, accepted at the NeurIPS 2025 workshop “Machine Learning and the Physical Sciences”

点击查看摘要

Abstract:A central challenge in high-energy nuclear physics is to extract informative features from the high-dimensional final-state data of heavy-ion collisions (HIC) in order to enable reliable downstream analyses. Traditional approaches often rely on selected observables, which may miss subtle but physically relevant structures in the data. To address this, we introduce a Transformer-based autoencoder trained with a two-stage paradigm: self-supervised pre-training followed by supervised fine-tuning. The pretrained encoder learns latent representations directly from unlabeled HIC data, providing a compact and information-rich feature space that can be adapted to diverse physics tasks. As a case study, we apply the method to distinguish between large and small collision systems, where it achieves significantly higher classification accuracy than PointNet. Principal component analysis and SHAP interpretation further demonstrate that the autoencoder captures complex nonlinear correlations beyond individual observables, yielding features with strong discriminative and explanatory power. These results establish our two-stage framework as a general and robust foundation for feature learning in HIC, opening the door to more powerful analyses of quark–gluon plasma properties and other emergent phenomena. The implementation is publicly available at this https URL.

[LG-74] Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

链接: https://arxiv.org/abs/2510.06685
作者: Tomohiro Hayase,Benoît Collins,Ryo Karakida
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.

[LG-75] Fitzpatrick Thresholding for Skin Image Segmentation MICCAI2025

链接: https://arxiv.org/abs/2510.06655
作者: Duncan Stothers,Sophia Xu,Carlie Reeves,Lia Gracey
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Accepted to MICCAI 2025 ISIC Workshop. 24 minute Oral presentation given. Awarded “Best Paper - Honorable Mention”

点击查看摘要

Abstract:Accurate estimation of the body surface area (BSA) involved by a rash, such as psoriasis, is critical for assessing rash severity, selecting an initial treatment regimen, and following clinical treatment response. Attempts at segmentation of inflammatory skin disease such as psoriasis perform markedly worse on darker skin tones, potentially impeding equitable care. We assembled a psoriasis dataset sourced from six public atlases, annotated for Fitzpatrick skin type, and added detailed segmentation masks for every image. Reference models based on U-Net, ResU-Net, and SETR-small are trained without tone information. On the tuning split we sweep decision thresholds and select (i) global optima and (ii) per Fitzpatrick skin tone optima for Dice and binary IoU. Adapting Fitzpatrick specific thresholds lifted segmentation performance for the darkest subgroup (Fitz VI) by up to +31 % bIoU and +24 % Dice on UNet, with consistent, though smaller, gains in the same direction for ResU-Net (+25 % bIoU, +18 % Dice) and SETR-small (+17 % bIoU, +11 % Dice). Because Fitzpatrick skin tone classifiers trained on Fitzpatrick-17k now exceed 95 % accuracy, the cost of skin tone labeling required for this technique has fallen dramatically. Fitzpatrick thresholding is simple, model-agnostic, requires no architectural changes, no re-training, and is virtually cost free. We demonstrate the inclusion of Fitzpatrick thresholding as a potential future fairness baseline.

[LG-76] Q-Learning with Fine-Grained Gap-Dependent Regret

链接: https://arxiv.org/abs/2510.06647
作者: Haochen Zhang,Zhong Zheng,Lingzhou Xue
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study fine-grained gap-dependent regret bounds for model-free reinforcement learning in episodic tabular Markov Decision Processes. Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps. We address this limitation by establishing fine-grained gap-dependent regret bounds for both UCB-based and non-UCB-based algorithms. In the UCB-based setting, we develop a novel analytical framework that explicitly separates the analysis of optimal and suboptimal state-action pairs, yielding the first fine-grained regret upper bound for UCB-Hoeffding (Jin et al., 2018). To highlight the generality of this framework, we introduce ULCB-Hoeffding, a new UCB-based algorithm inspired by AMB (Xu et al.,2021) but with a simplified structure, which enjoys fine-grained regret guarantees and empirically outperforms AMB. In the non-UCB-based setting, we revisit the only known algorithm AMB, and identify two key issues in its algorithm design and analysis: improper truncation in the Q -updates and violation of the martingale difference condition in its concentration argument. We propose a refined version of AMB that addresses these issues, establishing the first rigorous fine-grained gap-dependent regret for a non-UCB-based method, with experiments demonstrating improved performance over AMB.

[LG-77] Adapting Quantum Machine Learning for Energy Dissociation of Bonds

链接: https://arxiv.org/abs/2510.06563
作者: Swathi Chandrasekhar,Shiva Raj Pokhrel,Navneet Singh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of bond dissociation energies (BDEs) underpins mechanistic insight and the rational design of molecules and materials. We present a systematic, reproducible benchmark comparing quantum and classical machine learning models for BDE prediction using a chemically curated feature set encompassing atomic properties (atomic numbers, hybridization), bond characteristics (bond order, type), and local environmental descriptors. Our quantum framework, implemented in Qiskit Aer on six qubits, employs ZZFeatureMap encodings with variational ansatz (RealAmplitudes) across multiple architectures Variational Quantum Regressors (VQR), Quantum Support Vector Regressors (QSVR), Quantum Neural Networks (QNN), Quantum Convolutional Neural Networks (QCNN), and Quantum Random Forests (QRF). These are rigorously benchmarked against strong classical baselines, including Support Vector Regression (SVR), Random Forests (RF), and Multi-Layer Perceptrons (MLP). Comprehensive evaluation spanning absolute and relative error metrics, threshold accuracies, and error distributions shows that top-performing quantum models (QCNN, QRF) match the predictive accuracy and robustness of classical ensembles and deep networks, particularly within the chemically prevalent mid-range BDE regime. These findings establish a transparent baseline for quantum-enhanced molecular property prediction and outline a practical foundation for advancing quantum computational chemistry toward near chemical accuracy.

[LG-78] Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy

链接: https://arxiv.org/abs/2510.06515
作者: Chiara Mignacco,Matthieu Jonckheere,Gilles Stoltz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online matching problems arise in many complex systems, from cloud services and online marketplaces to organ exchange networks, where timely, principled decisions are critical for maintaining high system performance. Traditional heuristics in these settings are simple and interpretable but typically tailored to specific operating regimes, which can lead to inefficiencies when conditions change. We propose a reinforcement learning (RL) approach that learns to orchestrate a set of such expert policies, leveraging their complementary strengths in a data-driven, adaptive manner. Building on the Adv2 framework (Jonckheere et al., 2024), our method combines expert decisions through advantage-based weight updates and extends naturally to settings where only estimated value functions are available. We establish both expectation and high-probability regret guarantees and derive a novel finite-time bias bound for temporal-difference learning, enabling reliable advantage estimation even under constant step size and non-stationary dynamics. To support scalability, we introduce a neural actor-critic architecture that generalizes across large state spaces while preserving interpretability. Simulations on stochastic matching models, including an organ exchange scenario, show that the orchestrated policy converges faster and yields higher system level efficiency than both individual experts and conventional RL baselines. Our results highlight how structured, adaptive learning can improve the modeling and management of complex resource allocation and decision-making processes.

[LG-79] A General Constructive Upper Bound on Shallow Neural Nets Complexity

链接: https://arxiv.org/abs/2510.06372
作者: Frantisek Hakl,Vit Fojtik
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We provide an upper bound on the number of neurons required in a shallow neural network to approximate a continuous function on a compact set with a given accuracy. This method, inspired by a specific proof of the Stone-Weierstrass theorem, is constructive and more general than previous bounds of this character, as it applies to any continuous function on any compact set. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2510.06372 [stat.ML] (or arXiv:2510.06372v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2510.06372 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Frantisek Hakl PhD. [view email] [v1] Tue, 7 Oct 2025 18:40:40 UTC (45 KB) Full-text links: Access Paper: View a PDF of the paper titled A General Constructive Upper Bound on Shallow Neural Nets Complexity, by Frantisek Hakl and Vit FojtikView PDFTeX Source view license Current browse context: stat.ML prev | next new | recent | 2025-10 Change to browse by: cs cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-80] Diffusion-Guided Renormalization of Neural Systems via Tensor Networks

链接: https://arxiv.org/abs/2510.06361
作者: Nathan X. Kodama
类目: Neurons and Cognition (q-bio.NC); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: Reformatted version of Dissertation submitted for the Doctor of Philosophy in Systems and Control Engineering at Case Western Reserve University, 2025

点击查看摘要

Abstract:Far from equilibrium, neural systems self-organize across multiple scales. Exploiting multiscale self-organization in neuroscience and artificial intelligence requires a computational framework for modeling the effective non-equilibrium dynamics of stochastic neural trajectories. Non-equilibrium thermodynamics and representational geometry offer theoretical foundations, but we need scalable data-driven techniques for modeling collective properties of high-dimensional neural networks from partial subsampled observations. Renormalization is a coarse-graining technique central to studying emergent scaling properties of many-body and nonlinear dynamical systems. While widely applied in physics and machine learning, coarse-graining complex dynamical networks remains unsolved, affecting many computational sciences. Recent diffusion-based renormalization, inspired by quantum statistical mechanics, coarse-grains networks near entropy transitions marked by maximal changes in specific heat or information transmission. Here I explore diffusion-based renormalization of neural systems by generating symmetry-breaking representations across scales and offering scalable algorithms using tensor networks. Diffusion-guided renormalization bridges microscale and mesoscale dynamics of dissipative neural systems. For microscales, I developed a scalable graph inference algorithm for discovering community structure from subsampled neural activity. Using community-based node orderings, diffusion-guided renormalization generates renormalization group flow through metagraphs and joint probability functions. Towards mesoscales, diffusion-guided renormalization targets learning the effective non-equilibrium dynamics of dissipative neural trajectories occupying lower-dimensional subspaces, enabling coarse-to-fine control in systems neuroscience and artificial intelligence.

[LG-81] Mass Conservation on Rails – Rethinking Physics-Informed Learning of Ice Flow Vector Fields NEURIPS2025

链接: https://arxiv.org/abs/2510.06286
作者: Kim Bente,Roman Marchant,Fabio Ramos
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph); Machine Learning (stat.ML)
*备注: Accepted at the Tackling Climate Change with Machine Learning Workshop at NeurIPS 2025. 9 pages, 4 figures

点击查看摘要

Abstract:To reliably project future sea level rise, ice sheet models require inputs that respect physics. Embedding physical principles like mass conservation into models that interpolate Antarctic ice flow vector fields from sparse noisy measurements not only promotes physical adherence but can also improve accuracy and robustness. While physics-informed neural networks (PINNs) impose physics as soft penalties, offering flexibility but no physical guarantees, we instead propose divergence-free neural networks (dfNNs), which enforce local mass conservation exactly via a vector calculus trick. Our comparison of dfNNs, PINNs, and unconstrained NNs on ice flux interpolation over Byrd Glacier suggests that “mass conservation on rails” yields more reliable estimates, and that directional guidance, a learning strategy leveraging continent-wide satellite velocity data, boosts performance across models.

[LG-82] A Mixed-Methods Analysis of Repression and Mobilization in Bangladeshs July Revolution Using Machine Learning and Statistical Modeling

链接: https://arxiv.org/abs/2510.06264
作者: Md. Saiful Bari Siddiqui,Anupam Debashis Roy
类目: Applications (stat.AP); Computers and Society (cs.CY); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Submitted to Social Forces. Final version may vary from this preprint

点击查看摘要

Abstract:The 2024 July Revolution in Bangladesh represents a landmark event in the study of civil resistance. This study investigates the central paradox of the success of this student-led civilian uprising: how state violence, intended to quell dissent, ultimately fueled the movement’s victory. We employ a mixed-methods approach. First, we develop a qualitative narrative of the conflict’s timeline to generate specific, testable hypotheses. Then, using a disaggregated, event-level dataset, we employ a multi-method quantitative analysis to dissect the complex relationship between repression and mobilisation. We provide a framework to analyse explosive modern uprisings like the July Revolution. Initial pooled regression models highlight the crucial role of protest momentum in sustaining the movement. To isolate causal effects, we specify a Two-Way Fixed Effects panel model, which provides robust evidence for a direct and statistically significant local suppression backfire effect. Our Vector Autoregression (VAR) analysis provides clear visual evidence of an immediate, nationwide mobilisation in response to increased lethal violence. We further demonstrate that this effect was non-linear. A structural break analysis reveals that the backfire dynamic was statistically insignificant in the conflict’s early phase but was triggered by the catalytic moral shock of the first wave of lethal violence, and its visuals circulated around July 16th. A complementary machine learning analysis (XGBoost, out-of-sample R ^2 =0.65) corroborates this from a predictive standpoint, identifying “excessive force against protesters” as the single most dominant predictor of nationwide escalation. We conclude that the July Revolution was driven by a contingent, non-linear backfire, triggered by specific catalytic moral shocks and accelerated by the viral reaction to the visual spectacle of state brutality.

[LG-83] Developing a Sequential Deep Learning Pipeline to Model Alaskan Permafrost Thaw Under Climate Change

链接: https://arxiv.org/abs/2510.06258
作者: Addina Rahaman
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 20 pages, 16 figures. Number of figures are tentative and will be reduced in the future

点击查看摘要

Abstract:Changing climate conditions threaten the natural permafrost thaw-freeze cycle, leading to year-round soil temperatures above 0°C. In Alaska, the warming of the topmost permafrost layer, known as the active layer, signals elevated greenhouse gas release due to high carbon storage. Accurate soil temperature prediction is therefore essential for risk mitigation and stability assessment; however, many existing approaches overlook the numerous factors driving soil thermal dynamics. This study presents a proof-of-concept latitude-based deep learning pipeline for modeling yearly soil temperatures across multiple depths. The framework employs dynamic reanalysis feature data from the ERA5-Land dataset, static geologic and lithological features, sliding-window sequences for seasonal context, a derived scenario signal feature for long-term climate forcing, and latitude band embeddings for spatial sensitivity. Five deep learning models were tested: a Temporal Convolutional Network (TCN), a Transformer, a 1-Dimensional Convolutional Long-Short Term Memory (Conv1DLSTM), a Gated-Recurrent Unit (GRU), and a Bidirectional Long-Short Term Memory (BiLSTM). Results showed solid recognition of latitudinal and depth-wise temperature discrepancies, with the GRU performing best in sequential temperature pattern detection. Bias-corrected CMIP5 RCP data enabled recognition of sinusoidal temperature trends, though limited divergence between scenarios were observed. This study establishes an end-to-end framework for adopting deep learning in active layer temperature modeling, offering seasonal, spatial, and vertical temperature context without intrinsic restrictions on feature selection.

[LG-84] oward Uncertainty-Aware and Generalizable Neural Decoding for Quantum LDPC Codes

链接: https://arxiv.org/abs/2510.06257
作者: Xiangjun Mi,Frank Mueller
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum error correction (QEC) is essential for scalable quantum computing, yet decoding errors via conventional algorithms result in limited accuracy (i.e., suppression of logical errors) and high overheads, both of which can be alleviated by inference-based decoders. To date, such machine-learning (ML) decoders lack two key properties crucial for practical fault tolerance: reliable uncertainty quantification and robust generalization to previously unseen codes. To address this gap, we propose \textbfQuBA, a Bayesian graph neural decoder that integrates attention to both dot-product and multi-head, enabling expressive error-pattern recognition alongside calibrated uncertainty estimates. Building on QuBA, we further develop \textbfSAGU \textbf(Sequential Aggregate Generalization under Uncertainty), a multi-code training framework with enhanced cross-domain robustness enabling decoding beyond the training set. Experiments on bivariate bicycle (BB) codes and their coprime variants demonstrate that (i) both QuBA and SAGU consistently outperform the classical baseline belief propagation (BP), achieving a reduction of on average \emphone order of magnitude in logical error rate (LER), and up to \emphtwo orders of magnitude under confident-decision bounds on the coprime BB code [[154, 6, 16]] ; (ii) QuBA also surpasses state-of-the-art neural decoders, providing an advantage of roughly \emphone order of magnitude (e.g., for the larger BB code [[756, 16, \leq34]] ) even when considering conservative (safe) decision bounds; (iii) SAGU achieves decoding performance comparable to or even outperforming QuBA’s domain-specific training approach.

[LG-85] Neu-RadBERT for Enhanced Diagnosis of Brain Injuries and Conditions

链接: https://arxiv.org/abs/2510.06232
作者: Manpreet Singh(1),Sean Macrae(2),Pierre-Marc Williams(2),Nicole Hung(2),Sabrina Araujo de Franca(1),Laurent Letourneau-Guillon(2,3),François-Martin Carrier(2,4),Bang Liu(5),Yiorgos Alexandros Cavayas(1,2,6) ((1) Équipe de Recherche en Soins Intensifs, Centre de recherche du Centre intégré universitaire de santé et de services sociaux du Nord-de-l’Île-de-Montréal (2) Faculté de Médecine, Université de Montréal (3) Department of Radiology, Centre Hospitalier de l’Université de Montréal (4) Department of Anesthesia, Centre Hospitalier de l’Université de Montréal (5) Applied Research in Computer Linguistics Laboratory, Department of Computer Science and Operations Research, Université de Montréal (6) Division of Critical Care Medicine, Department of Medicine, Hôpital du Sacré-Cœur de Montréal)
类目: Tissues and Organs (q-bio.TO); Machine Learning (cs.LG)
*备注: Both Manpreet Singh and Sean Macrae contributed equally and should be considered co-first authors. Corresponding author: Yiorgos Alexandros Cavayas

点击查看摘要

Abstract:Objective: We sought to develop a classification algorithm to extract diagnoses from free-text radiology reports of brain imaging performed in patients with acute respiratory failure (ARF) undergoing invasive mechanical ventilation. Methods: We developed and fine-tuned Neu-RadBERT, a BERT-based model, to classify unstructured radiology reports. We extracted all the brain imaging reports (computed tomography and magnetic resonance imaging) from MIMIC-IV database, performed in patients with ARF. Initial manual labelling was performed on a subset of reports for various brain abnormalities, followed by fine-tuning Neu-RadBERT using three strategies: 1) baseline RadBERT, 2) Neu-RadBERT with Masked Language Modeling (MLM) pretraining, and 3) Neu-RadBERT with MLM pretraining and oversampling to address data skewness. We compared the performance of this model to Llama-2-13B, an autoregressive LLM. Results: The Neu-RadBERT model, particularly with oversampling, demonstrated significant improvements in diagnostic accuracy compared to baseline RadBERT for brain abnormalities, achieving up to 98.0% accuracy for acute brain injuries. Llama-2-13B exhibited relatively lower performance, peaking at 67.5% binary classification accuracy. This result highlights potential limitations of current autoregressive LLMs for this specific classification task, though it remains possible that larger models or further fine-tuning could improve performance. Conclusion: Neu-RadBERT, enhanced through target domain pretraining and oversampling techniques, offered a robust tool for accurate and reliable diagnosis of neurological conditions from radiology reports. This study underscores the potential of transformer-based NLP models in automatically extracting diagnoses from free text reports with potential applications to both research and patient care.

[LG-86] Layerwise Federated Learning for Heterogeneous Quantum Clients using Quorus

链接: https://arxiv.org/abs/2510.06228
作者: Jason Han,Nicholas S. DiBrita,Daniel Leeds,Jianqiang Li,Jason Ludmir,Tirthak Patel
类目: Quantum Physics (quant-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum machine learning (QML) holds the promise to solve classically intractable problems, but, as critical data can be fragmented across private clients, there is a need for distributed QML in a quantum federated learning (QFL) format. However, the quantum computers that different clients have access to can be error-prone and have heterogeneous error properties, requiring them to run circuits of different depths. We propose a novel solution to this QFL problem, Quorus, that utilizes a layerwise loss function for effective training of varying-depth quantum models, which allows clients to choose models for high-fidelity output based on their individual capacity. Quorus also presents various model designs based on client needs that optimize for shot budget, qubit count, midcircuit measurement, and optimization space. Our simulation and real-hardware results show the promise of Quorus: it increases the magnitude of gradients of higher depth clients and improves testing accuracy by 12.4% on average over the state-of-the-art.

信息检索

[IR-0] Ethical AI prompt recommendations in large language models using collaborative filtering

链接: https://arxiv.org/abs/2510.06924
作者: Jordan Nelson,Almas Baimagambetov,Konstantinos Avgerinakis,Nikolaos Polatidis
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted to by the International Journal of Parallel, Emergent Distributed Systems (Taylor and Francis) and has an assigned DOI. We have already chose to make this open access using CC BY. The article is not yet available online on the publisher’s website. The DOI is: this http URL

点击查看摘要

Abstract:As large language models (LLMs) shape AI development, ensuring ethical prompt recommendations is crucial. LLMs offer innovation but risk bias, fairness issues, and accountability concerns. Traditional oversight methods struggle with scalability, necessitating dynamic solutions. This paper proposes using collaborative filtering, a technique from recommendation systems, to enhance ethical prompt selection. By leveraging user interactions, it promotes ethical guidelines while reducing bias. Contributions include a synthetic dataset for prompt recommendations and the application of collaborative filtering. The work also tackles challenges in ethical AI, such as bias mitigation, transparency, and preventing unethical prompt engineering.

[IR-1] Reproducing and Extending Causal Insights Into Term Frequency Computation in Neural Rankers SIGIR

链接: https://arxiv.org/abs/2510.06728
作者: Cile van Marken,Roxana Petcu
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 6 figures, submitted to SIGIR-AP

点击查看摘要

Abstract:Neural ranking models have shown outstanding performance across a variety of tasks, such as document retrieval, re-ranking, question answering and conversational retrieval. However, the inner decision process of these models remains largely unclear, especially as models increase in size. Most interpretability approaches, such as probing, focus on correlational insights rather than establishing causal relationships. The paper ‘Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models’ by Chen et al. addresses this gap by introducing a framework for activation patching - a causal interpretability method - in the information retrieval domain, offering insights into how neural retrieval models compute document relevance. The study demonstrates that neural ranking models not only capture term-frequency information, but also that these representations can be localized to specific components of the model, such as individual attention heads or layers. This paper aims to reproduce the findings by Chen et al. and to further explore the presence of pre-defined retrieval axioms in neural IR models. We validate the main claims made by Chen et al., and extend the framework to include an additional term-frequency axiom, which states that the impact of increasing query term frequency on document ranking diminishes as the frequency becomes higher. We successfully identify a group of attention heads that encode this axiom and analyze their behavior to give insight into the inner decision-making process of neural ranking models.

[IR-2] Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM -in-the-loop Annotation Tasks SIGIR

链接: https://arxiv.org/abs/2510.06658
作者: Jiaman He,Zikang Leng,Dana McKay,Damiano Spina,Johanne R. Trippas
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR-AP 2025

点击查看摘要

Abstract:Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth’’ using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions – by both humans and LLMs – can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff’s \alpha , paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets – MovieLens 100K and PolitiFact – and find that the LLM is statistically indistinguishable from a human annotator in the former ( p = 0.004 ), but not in the latter ( p = 0.155 ), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application. Comments: Accepted at SIGIR-AP 2025 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.06658 [cs.IR] (or arXiv:2510.06658v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.06658 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] LLM -Powered Nuanced Video Attribute Annotation for Enhanced Recommendations RECSYS2025

链接: https://arxiv.org/abs/2510.06657
作者: Boyuan Long,Yueqi Wang,Hiloni Mehta,Mick Zomnir,Omkar Pathak,Changping Meng,Ruolin Jia,Yajun Peng,Dapeng Hong,Xia Wu,Mingyan Gao,Onkar Dalal,Ningren Han
类目: Information Retrieval (cs.IR)
*备注: RecSys 2025 Industry Track

点击查看摘要

Abstract:This paper presents a case study on deploying Large Language Models (LLMs) as an advanced “annotation” mechanism to achieve nuanced content understanding (e.g., discerning content “vibe”) at scale within a large-scale industrial short-form video recommendation system. Traditional machine learning classifiers for content understanding face protracted development cycles and a lack of deep, nuanced comprehension. The “LLM-as-annotators” approach addresses these by significantly shortening development times and enabling the annotation of subtle attributes. This work details an end-to-end workflow encompassing: (1) iterative definition and robust evaluation of target attributes, refined by offline metrics and online A/B testing; (2) scalable offline bulk annotation of video corpora using LLMs with multimodal features, optimized inference, and knowledge distillation for broad application; and (3) integration of these rich annotations into the online recommendation serving system, for example, through personalized restrict retrieval. Experimental results demonstrate the efficacy of this approach, with LLMs outperforming human raters in offline annotation quality for nuanced attributes and yielding significant improvements of user participation and satisfied consumption in online A/B tests. The study provides insights into designing and scaling production-level LLM pipelines for rich content evaluation, highlighting the adaptability and benefits of LLM-generated nuanced understanding for enhancing content discovery, user satisfaction, and the overall effectiveness of modern recommendation systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-10-09

目录

概览 (2025-10-09)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载