本篇博文主要内容为 2025-10-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-16)
今日共更新543篇论文,其中:
- 自然语言处理共98篇(Computation and Language (cs.CL))
- 人工智能共138篇(Artificial Intelligence (cs.AI))
- 计算机视觉共111篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共158篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Generative Universal Verifier as Multimodal Meta-Reason er
【速读】: 该论文旨在解决当前多模态大模型在视觉推理过程中缺乏可靠视觉验证能力的问题,即模型生成结果虽具多样性但难以保证视觉一致性与准确性,导致其在复杂任务中表现受限。解决方案的关键在于提出通用生成式验证器(Generative Universal Verifier),通过构建涵盖16类关键任务的ViVerBench基准评估体系,识别出视觉验证中的三种原子能力并实现协同增强;进一步设计OmniVerifier-7B模型以支持跨模态统一视觉验证,并引入OmniVerifier-TTS(Test-Time Scaling)序列化推理范式,在图像生成与编辑等场景中实现迭代细粒度优化,从而提升生成质量与可控性。此方法显著优于现有并行测试时缩放策略(如Best-of-N),推动了下一代可信赖、可控制的多模态推理系统发展。
链接: https://arxiv.org/abs/2510.13804
作者: Xinchen Zhang,Xiaoying Zhang,Youbin Wu,Yanbin Cao,Renrui Zhang,Ruihang Chu,Ling Yang,Yujiu Yang
机构: Tsinghua University (清华大学); ByteDance Seed (字节跳动种子团队); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.
zh
[NLP-1] BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在处理复杂多跳问答任务时因上下文过长而导致的延迟高和模型认知负荷大的问题。解决方案的关键在于提出一种通用且轻量的压缩方法 BRIEF-Pro,它能够从长文档中抽象性地提炼出与查询相关的证据,并生成简洁摘要以无缝集成到上下文中的 RAG 流程。BRIEF-Pro 通过少量短文本(少于1k词)训练数据学习对超长上下文(超过10k词)进行压缩,同时支持用户灵活控制摘要长度(如指定句子数量),实验证明其在多个开放域多跳问答数据集上显著提升了小、大及专有语言模型的性能,且在70B参数阅读器模型下实现32倍压缩比的同时,较LongLLMLingua的9倍压缩提升平均4.67%的问答准确率,计算开销仅为后者的23%。
链接: https://arxiv.org/abs/2510.13799
作者: Jia-Chen Gu,Junyi Zhang,Di Wu,Yuankai Li,Kai-Wei Chang,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注: Code and data: this https URL
Abstract:As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua’s 9x, while requiring only 23% of its computational overhead.
zh
[NLP-2] Breadcrumbs Reasoning : Memory-Efficient Reasoning with Compression Beacons
【速读】: 该论文旨在解决大语言模型在长上下文推理任务中因Transformer键值缓存(Key-Value Cache, KV Cache)线性增长而导致的内存和计算成本急剧上升的问题。其解决方案的关键在于:提出一种周期性压缩生成KV缓存的方法,通过引入一个经过训练的专用压缩标记(special-purpose token)来对历史生成token进行信息聚合,并将压缩后的条目从缓存中移除。该方法利用改进的联合蒸馏与强化学习(reinforcement learning, RL)框架进行训练,使模型学会在保持推理准确性的同时有效压缩缓存,从而在内存占用与推理性能之间实现更优的帕累托前沿(Pareto frontier)。
链接: https://arxiv.org/abs/2510.13797
作者: Giovanni Monea,Yair Feldman,Shankar Padmanabhan,Kianté Brantley,Yoav Artzi
机构: Cornell University (康奈尔大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.
zh
[NLP-3] he Mechanistic Emergence of Symbol Grounding in Language Models
【速读】: 该论文试图解决的问题是:符号接地(symbol grounding)在大规模训练的视觉-语言模型中如何自发出现,以及其内在机制和具体发生位置尚不明确。解决方案的关键在于提出了一种受控的评估框架,通过机制性和因果性分析系统追踪符号接地在模型内部计算过程中的演化路径;研究发现,符号接地集中于中间层计算,并由注意力头通过聚合环境信息(environmental ground)来支持语言形式的预测,这一机制在Transformer和状态空间模型中均复现,但在单向LSTM中未观察到。
链接: https://arxiv.org/abs/2510.13796
作者: Shuyu Wu,Ziqiao Ma,Xiaoxi Luo,Yidong Huang,Josue Torres-Fonseca,Freda Shi,Joyce Chai
机构: University of Michigan (密歇根大学); University of Waterloo (滑铁卢大学); Vector Institute (向量研究所); UNC at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.
zh
[NLP-4] Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation EMNLP2025
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中置信度估计(confidence estimation)的准确性问题,尤其是在金融等高风险领域,确保模型输出的正确性至关重要。解决方案的关键在于利用前馈网络(Feed-Forward Network, FFN)层的原始激活值作为自回归信号,从而避免在投影和Softmax归一化后导致的信息损失,相较于传统基于token logits或概率的方法更具信息保真度;同时将置信度预测建模为序列分类任务,并引入Huber损失进行正则化训练以提升对噪声标签的鲁棒性,实验证明该方法在保持高准确率的同时显著降低延迟,尤其在Llama 3.1 8B模型中仅使用第16层激活即可实现性能与效率的平衡。
链接: https://arxiv.org/abs/2510.13750
作者: Zhiqi Huang,Vivek Datla,Chenyang Zhu,Alfy Samuel,Daben Liu,Anoop Kumar,Ritesh Soni
机构: Capital One (Capital One)
类目: Computation and Language (cs.CL)
备注: UncertaiNLP at EMNLP 2025
Abstract:We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.
zh
[NLP-5] Assessing Web Search Credibility and Response Groundedness in Chat Assistants
【速读】: 该论文旨在解决生成式 AI(Generative AI)在集成网络搜索功能后,可能因引用低可信度来源而放大错误信息的问题。其解决方案的关键在于提出一种新颖的评估方法,系统性地衡量聊天助手在检索和引用外部来源时的源可信度(source credibility)与响应内容的依据性(groundedness),并通过针对五个易受误导话题的100个声明对GPT-4o、GPT-5、Perplexity和Qwen Chat进行实证评估,从而揭示各模型在事实核查行为上的差异。
链接: https://arxiv.org/abs/2510.13749
作者: Ivan Vykopal,Matúš Pikuliak,Simon Ostermann,Marián Šimko
机构: Brno University of Technology (布诺理工大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants’ web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.
zh
[NLP-6] Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在数学证明生成任务中,如何有效验证每一步推理的正确性这一关键挑战。由于数学竞赛(如IMO 2025)要求每一步推理不仅正确且需充分支持,因此训练LLM推理系统必须依赖强效的步骤级验证器(step-level verifier)。论文提出的关键解决方案是构建Hard2Verify——一个由人类标注、针对前沿LLM生成响应进行细致步骤级验证的基准数据集,其设计目标是严格评估验证器在开放性、高难度数学问题中的表现能力。该数据集基于超过500小时的人工劳动,用于识别LLM生成答案中的首个错误并提供细粒度注释,从而推动对验证机制的理解与改进,包括开源与闭源验证器性能差距、计算资源扩展效果及自验证等核心问题。
链接: https://arxiv.org/abs/2510.13744
作者: Shrey Pandit,Austin Xu,Xuan-Phi Nguyen,Yifei Ming,Caiming Xiong,Shafiq Joty
机构: Salesforce AI Research (Salesforce人工智能研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 8 figures, 5 tables
Abstract:Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2Verify is designed to rigorously assess step-level verifiers at the frontier: Verifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.
zh
[NLP-7] GAPS: A Clinically Grounded Automated Benchmark for Evaluating AI Clinicians
【速读】: 该论文旨在解决当前AI临床系统评估基准存在的局限性,即现有方法多基于选择题或人工评分标准,难以全面衡量模型在认知深度、答案完整性、鲁棒性和安全性方面的表现。为应对这一挑战,作者提出了GAPS框架(Grounding, Adequacy, Perturbation, Safety),并开发了一个端到端的自动化、指南锚定的评测流水线,其关键创新在于通过构建证据邻域、生成图与树双结构表示,并利用DeepResearch代理以ReAct循环模拟GRADE一致性和PICO驱动的循证医学评审过程,自动生成高质量问题及评分规则;最终由大语言模型(LLM)判官集成评分,从而实现可复现、可扩展且贴近临床实践的AI clinician系统评估方法。
链接: https://arxiv.org/abs/2510.13734
作者: Xiuyuan Chen,Tao Sun,Dexin Su,Ailing Yu,Junwei Liu,Zhe Chen,Gangzeng Jin,Xin Wang,Jingnan Liu,Hansong Xiao,Hualei Zhou,Dongjie Tao,Chunxiao Guo,Minghui Yang,Yuan Xia,Jing Zhao,Qianrui Fan,Yanyun Wang,Shuai Zhen,Kezhong Chen,Jun Wang,Zewen Sun,Heng Zhao,Tian Guan,Shaodong Wang,Geyun Chang,Jiaming Deng,Hongchengcheng Chen,Kexin Feng,Ruzhen Li,Jiayi Geng,Changtai Zhao,Jun Wang,Guihu Lin,Peihao Li,Liqi Liu,Peng Wei,Jian Wang,Jinjie Gu,Ping Wang,Fan Yang
机构: Ant Group (蚂蚁集团); Peking University (北京大学); Beijing Medical University (北京医科大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbfGrounding (cognitive depth), \textbfAdequacy (answer completeness), \textbfPerturbation (robustness), and \textbfSafety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.
zh
[NLP-8] NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
【速读】: 该论文旨在解决当前多模态基础模型在理解与生成能力上难以平衡的问题,尤其是受限于自回归架构导致的跨模态生成灵活性不足和任务解耦设计带来的冗余性。其核心解决方案是提出NExT-OMNI,一个基于离散流(discrete flow)范式的统一建模框架,通过引入度量诱导的概率路径(metric-induced probability paths)和动能最优速度(kinetic optimal velocities),实现任意模态间的双向理解和生成,并以紧凑的统一表征替代传统任务解耦结构,从而显著提升多轮交互与跨模态检索等复杂场景下的效率与泛化能力。
链接: https://arxiv.org/abs/2510.13721
作者: Run Luo,Xiaobo Xia,Lu Wang,Longze Chen,Renke Shan,Jing Luo,Min Yang,Tat-Seng Chua
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); NExT++ Research Center; National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal this http URL this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
zh
[NLP-9] How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study EMNLP2025
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 文本检测方法在实际应用中鲁棒性不足的问题,特别是当大型语言模型(LLM)采用不同采样解码策略时,现有检测器性能显著下降甚至失效。其解决方案的关键在于系统性地评估多种基于采样的解码参数(如温度、top-p 或核采样)对文本可检测性的影响,发现即使微小的参数调整也能导致检测器的 AUROC 分数从接近完美的水平骤降至 1%,从而揭示了当前检测方法存在的关键盲点,并提出需建立更全面的评估协议以提升检测系统的泛化能力。
链接: https://arxiv.org/abs/2510.13681
作者: Matthieu Dubois,François Yvon,Pablo Piantanida
机构: Sorbonne Université, CNRS, ISIR (巴黎第四大学,法国国家科学研究中心,智能机器人研究所); CNRS, International Laboratory on Learning Systems (法国国家科学研究中心,国际学习系统实验室); Mila - Québec AI Institute (Mila - 魁北克人工智能研究所); CentraleSupélec, Université Paris-Saclay (中央理工-巴黎萨克雷大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings
Abstract:As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model’s (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework this https URL
zh
[NLP-10] Closing the Gap Between Text and Speech Understanding in LLM s
【速读】: 该论文旨在解决语音适应的大语言模型(Speech-adapted Large Language Models, LLMs)在语言理解任务中性能显著低于其文本基线模型的问题,即“文本-语音理解差距”(text-speech understanding gap)。这一差距主要由两个因素驱动:(i) 模型在语音适应过程中对原有文本能力的遗忘,以及 (ii) 语音与文本模态间的跨模态错位。为解决此问题,作者提出 SALAD 方法——一种基于主动选择和跨模态蒸馏的高效对齐机制,通过 targeted synthetic data 和 cross-modal distillation 在减少语音数据依赖的同时增强模态对齐并缓解遗忘现象。实验表明,SALAD 在 3B 和 7B 规模的 LLM 上实现了与强开源模型相当的性能,且训练所用公共语音数据量比现有方法少一个数量级以上。
链接: https://arxiv.org/abs/2510.13632
作者: Santiago Cuervo,Skyler Seto,Maureen de Seyssel,Richard He Bai,Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly,Zakaria Aldeneh
机构: Université de Toulon(图卢兹大学); Aix Marseille Université(马赛大学); CNRS(法国国家科学研究中心); LIS(信息与系统实验室); Apple(苹果公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts–and even cascaded pipelines–on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD–Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation–which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.
zh
[NLP-11] LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
【速读】: 该论文旨在解决当前视觉-语言-动作(Visual-Language-Action, VLA)模型在机器人操作任务中看似高成功率背后存在的鲁棒性不足问题。研究者通过引入七维可控扰动(包括物体布局、相机视角、机器人初始状态、语言指令、光照条件、背景纹理和传感器噪声),对多个前沿VLA模型进行系统性脆弱性分析,揭示了这些模型在面对微小扰动时性能急剧下降的共性缺陷。其解决方案的关键在于建立一套全面且结构化的扰动评估框架,以暴露模型在真实场景下可能失效的薄弱环节,从而推动更可靠的评估实践,而非仅依赖传统基准测试分数来判断模型能力。
链接: https://arxiv.org/abs/2510.13626
作者: Senyu Fei,Siyin Wang,Junhao Shi,Zihao Dai,Jikun Cai,Pengfang Qian,Li Ji,Xinzhe He,Shiduo Zhang,Zhaoye Fei,Jinlan Fu,Jingjing Gong,Xipeng Qiu
机构: Fudan University (复旦大学); Tongji University (同济大学); Shanghai Innovation Institute (上海创新研究院); National University of Singapore (新加坡国立大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
zh
[NLP-12] Unlocking Public Catalogues: Instruction-Tuning LLM s for ICD Coding of German Tumor Diagnoses
【速读】: 该论文旨在解决德国肿瘤诊断文本中ICD-10-GM和ICD-O-3编码自动化准确率低的问题,尤其是在使用隐私友好型开源大语言模型(LLM)时面临的挑战。其核心解决方案是基于公开的医学编码目录(ICD-10-GM、ICD-O-3和OPS)构建大规模问答对指令数据集,并对多个7–70B参数规模的开源LLM进行微调。实验表明,该方法显著提升了编码准确率:ICD-10-GM精确编码从1.4–24%提升至41–58%,部分编码(三位字符)从31–74%提升至73–83%,同时消除畸形代码输出,且肿瘤诊断识别率达99%。关键在于利用结构化医学术语资源生成高质量指令数据,使小规模模型也能实现接近大型模型的性能,从而在保障隐私的前提下提升医疗文档自动编码的实用性。
链接: https://arxiv.org/abs/2510.13624
作者: Stefan Lenz,Lakisha Ortiz Rosario,Georg Vollmar,Arsenij Ustjanzew,Fatma Alickovic,Thomas Kindler,Torsten Panholzer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 4 figures
Abstract:Accurate coding of tumor diagnoses with ICD-10-GM and ICD-O-3 is essential for structured cancer documentation in Germany. Smaller open-weight LLMs are appealing for privacy-preserving automation but often struggle with coding accuracy in German-language contexts. This study investigates whether instruction-based fine-tuning on public datasets improves the coding accuracy of open-weight LLMs for German tumor diagnosis texts. The evaluation uses coded diagnoses from the local tumor documentation system as test data. In a systematic data quality assessment, the upper limit for ICD-10 coding performance was estimated at 60-79% for exact and 81-94% for partial (three-character codes only) derivation. As training data, over 500,000 question-answer pairs were created based on the ICD-10-GM, ICD-O-3, and OPS catalogues. Eight open-weight models from the Qwen, Llama, and Mistral families (7-70 B parameters) were fine-tuned. ICD-10-GM accuracy rose from 1.4-24% to 41-58%, and partial accuracy from 31-74% to 73-83%. The accuracy of ICD-O-3 topography coding also improved but started and remained considerably lower with an exact accuracy of 22-40% and a partial accuracy of 56-67% after fine-tuning. Malformed code outputs dropped to 0% for all models. Tumor-diagnosis recognition reached 99%. Accuracy correlated positively with model size, but gaps between small and large models narrowed after fine-tuning. The reasoning mode in Qwen3 generally yielded a lower performance than fine-tuning and was over 100 times slower. Our findings highlight the potential of leveraging public catalogues to build instruction datasets that improve LLMs in medical documentation tasks. The complete training dataset and the best-performing checkpoints of the fine-tuned models are available from this https URL.
zh
[NLP-13] MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在时间理解上的局限性,尤其是在涉及多个实体、复合时间操作符和动态事件序列的复杂时序推理任务中表现不佳的问题。现有基于时序知识图谱(Temporal Knowledge Graphs, TKGs)的方法仍面临四大挑战:多跳推理中的时间忠实性维持、多实体的时间同步、检索策略对多样化时间操作符的适应性,以及先验推理经验的复用以提升稳定性和效率。解决方案的关键在于提出MemoTime框架,其核心包括三部分:通过结构化接地(structured grounding)构建分层“时间树”(Tree of Time),实现操作符感知的时间推理并强制单调时间戳与统一时间边界约束;引入动态证据检索层,自适应选择针对不同时间操作符的检索策略;以及设计自演化经验记忆模块,存储经验证的推理轨迹、工具决策和子问题嵌入,支持跨类型复用。该方法显著提升了LLM在多实体、多步骤时序问答任务中的性能,且能在较小模型上达到接近GPT-4-Turbo的推理水平。
链接: https://arxiv.org/abs/2510.13614
作者: Xingyu Tan,Xiaoyang Wang,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang
机构: University of New South Wales (新南威尔士大学); Data61, CSIRO (数据61,澳大利亚联邦科学与工业研究组织)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have achieved impressive reasoning abilities, but struggle with temporal understanding, especially when questions involve multiple entities, compound operators, and evolving event sequences. Temporal Knowledge Graphs (TKGs), which capture vast amounts of temporal facts in a structured format, offer a reliable source for temporal reasoning. However, existing TKG-based LLM reasoning methods still struggle with four major challenges: maintaining temporal faithfulness in multi-hop reasoning, achieving multi-entity temporal synchronization, adapting retrieval to diverse temporal operators, and reusing prior reasoning experience for stability and efficiency. To address these issues, we propose MemoTime, a memory-augmented temporal knowledge graph framework that enhances LLM reasoning through structured grounding, recursive reasoning, and continual experience learning. MemoTime decomposes complex temporal questions into a hierarchical Tree of Time, enabling operator-aware reasoning that enforces monotonic timestamps and co-constrains multiple entities under unified temporal bounds. A dynamic evidence retrieval layer adaptively selects operator-specific retrieval strategies, while a self-evolving experience memory stores verified reasoning traces, toolkit decisions, and sub-question embeddings for cross-type reuse. Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%. Furthermore, MemoTime enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
zh
[NLP-14] NOSA: Native and Offloadable Sparse Attention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本处理中因键值(Key-Value, KV)缓存未压缩导致的解码效率瓶颈问题,即尽管可训练稀疏注意力(Trainable Sparse Attention)能显著减少内存访问并保持任务性能,但KV缓存大小未降低仍限制了GPU上的批处理规模和解码吞吐量。解决方案的关键在于提出NOSA框架,通过将token选择过程显式分解为查询相关(query-aware)与查询无关(query-agnostic)两部分,引入显式的局部性约束,在不改变原始注意力计算的前提下,有效减少CPU与GPU之间KV对的传输开销,从而实现原生支持KV缓存卸载(offloading),最终在保持近无损性能的同时提升解码吞吐量达2.3倍。
链接: https://arxiv.org/abs/2510.13602
作者: Yuxiang Huang,Chaojun Xiao,Xu Han,Zhiyuan Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).
zh
[NLP-15] FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation
【速读】: 该论文旨在解决表到文本生成(table-to-text generation)任务中因大型语言模型(LLM)训练数据污染导致的评估偏差问题,以及现有基准测试中存在的领域不平衡现象。其解决方案的关键在于提出FreshTab——一个基于维基百科实时生成表到文本基准数据集的方法,通过动态采集最新表格数据以避免LLM训练数据的污染,并支持多语言(如德语、俄语、法语和英语)按需构建,从而实现更公平且领域敏感的评估体系。实验表明,尽管自动指标显示由FreshTab生成的洞察质量较低,但LLM与人类评估结果一致地揭示了领域效应的存在,证明了该方法在提升评估可靠性方面的有效性。
链接: https://arxiv.org/abs/2510.13598
作者: Kristýna Onderková,Ondřej Plátek,Zdeněk Kasner,Ondřej Dušek
机构: Charles University, Faculty of Mathematics and Physics (查尔斯大学,数学与物理学院); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注: To be published in INLG 2025
Abstract:Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a~domain-balanced benchmark is more challenging.
zh
[NLP-16] Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM -based NPCs
【速读】: 该论文旨在解决游戏环境中非玩家角色(Non-Player Character, NPC)的智能化问题,即如何利用大语言模型(Large Language Models, LLMs)实现既具备功能性任务执行能力又保持人格一致性对话生成的能力。其解决方案的关键在于提出两种互补策略:在API赛道采用轻量级提示工程方法,特别是引入“Deflanderization prompting”以抑制过度角色扮演并提升任务准确性;在GPU赛道则通过监督微调(Supervised Fine-Tuning, SFT)与低秩适应(Low-Rank Adaptation, LoRA)技术对Qwen3-14B模型进行优化,从而在多任务场景下实现更稳定和高质量的对话表现。
链接: https://arxiv.org/abs/2510.13586
作者: Pasin Buakhaw,Kun Kerdthaisong,Phuree Phenhiran,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot
机构: Chulalongkorn University (朱拉隆功大学); Thammasat University (诗纳卡琳威洛大学); Artificial Intelligence Association of Thailand (泰国人工智能协会); King’s College London (伦敦国王学院); Mahidol University (玛希敦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic non-player characters (NPCs) in gaming environments, enabling both func- tional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which eval- uates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).
zh
[NLP-17] Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在不同语言间性能不均衡的问题,特别是低资源语言(low-resource languages)上的表现显著落后于高资源语言。其核心挑战在于如何在不损害模型通用能力的前提下,有效提升模型在特定低资源语言上的单语能力。解决方案的关键在于提出一种基于语言特异性神经元识别的靶向微调框架:首先利用语言激活概率熵(Language Activation Probability Entropy)定位出仅在目标语言中活跃的神经元,进而仅对这些神经元所关联的权重进行微调,形成一个语言专用子网络(language-specific subnetwork)。实验表明,该方法在12种中低资源语言上均优于全参数微调、FFN-only微调、LoRA以及随机子集微调等基线方法,且仅更新不超过1%的模型参数,实现了高效且性能优越的跨语言适配。
链接: https://arxiv.org/abs/2510.13580
作者: Daniil Gurgurov,Josef van Genabith,Simon Ostermann
机构: Saarland University (萨尔兰大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:Large language models exhibit uneven performance across languages, with substantial gaps between high- and low-resource languages. We present a framework for enhancing monolingual capabilities of LLMs in underrepresented languages while preserving their general-purpose performance through targeted fine-tuning of language-specific subnetworks. Our approach identifies language-specific neurons using Language Activation Probability Entropy and fine-tunes only the weights associated with these neurons, a dedicated subnetwork, on target-language data. Experiments on Llama-3.1-8B and Mistral-Nemo-12B across 12 mid- and low-resource languages demonstrate that our method consistently outperforms full fine-tuning, FFN-only fine-tuning, LoRA adaptation, and random subset fine-tuning baselines while efficiently updating only up to 1% of model parameters. Beyond performance improvements, we observe enhanced favorable training dynamics, cross-lingual representational alignment, and systematic weight update changes. To facilitate future research, we release language-specific neuron identifications for over 100 languages as well as our adaptation pipeline, offering a cost-effective pathway for adapting state-of-the-art models to underrepresented languages.
zh
[NLP-18] Attention Illuminates LLM Reasoning : The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)推理过程不透明、强化学习(Reinforcement Learning, RL)在优化过程中对生成序列中的所有步骤采用统一信用分配(credit assignment)导致关键推理节点被稀释的问题。其解决方案的关键在于将注意力机制(attention)视为可解释的内在结构基础,而非计算副产物,并通过两个量化指标——窗口平均注意力距离(Windowed Average Attention Distance)和未来注意力影响度(Future Attention Influence)——识别出局部聚焦与全局聚焦的注意力头,从而揭示模型中反复出现的“预规划-锚定”(preplan-and-anchor)推理机制:即模型先进行长程上下文参考以生成引导性token(预规划),随后立即或同步生成语义锚定token(anchor),以此组织后续推理流程。基于此发现,作者设计了三种动态信用分配策略,针对预规划token、锚定token及其时间耦合关系实施精准优化,显著提升了多种推理任务的性能,实现了从黑箱优化向结构感知优化的转变。
链接: https://arxiv.org/abs/2510.13554
作者: Yang Li,Zhichen Dong,Yuhan Sun,Weixun Wang,Shaopan Xiong,Yijia Luo,Jiashun Liu,Han Lu,Jiamang Wang,Wenbo Su,Bo Zheng,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 8 figures, 5 tables
Abstract:The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token’s global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model’s intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
zh
[NLP-19] K-Merge: Online Continual Merging of Adapters for On-device Large Language Models
【速读】: 该论文旨在解决移动设备上低资源环境下持续增量式合并低秩适配器(Low-Rank Adapters, LoRAs)的问题,即在设备存储容量有限、仅能保留少量LoRA的情况下,如何在线地将新到达的LoRA与已有适配器融合,同时保持对先前任务性能的稳定性。其解决方案的关键在于提出一种无需数据、计算高效的LoRA选择与合并策略,能够在不访问原始训练数据的前提下,动态识别并整合最有价值的LoRA组合,从而在满足存储和算力约束的同时,实现任务性能的持续累积与稳定维持。
链接: https://arxiv.org/abs/2510.13537
作者: Donald Shenaj,Ondrej Bohdal,Taha Ceritli,Mete Ozay,Pietro Zanuttigh,Umberto Michieli
机构: Samsung R&D Institute UK (三星研发研究院英国); University of Pisa (比萨大学); University of Padova (帕多瓦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 8 figures
Abstract:On-device deployment of Large Language Models (LLMs) frequently leverages Low-Rank Adapters (LoRAs) to support diverse downstream tasks under tight resource constraints. To address the limited storage capacity of mobile devices, recent works have explored model merging techniques to fuse multiple LoRAs into a single one. In practice, however, LoRAs are often delivered incrementally, as users request support for new tasks (e.g., novel problem types or languages). This scenario introduces a new challenge: on-device online continual merging, where the objective is to incorporate new LoRAs while preserving the performance on previously supported tasks. In this paper, we propose a data-free and computationally efficient strategy for selecting and merging LoRAs when a new one becomes available, assuming the device can store only a limited number of adapters. Extensive experiments across real-world tasks demonstrate the superiority of our approach compared to alternative strategies while adhering to the storage budget and compute limitations of on-device settings.
zh
[NLP-20] MedREK: Retrieval-Based Editing for Medical LLM s with Key-Aware Prompts
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中因训练数据误差和医学知识快速迭代导致的信息过时与不准确问题,以及现有模型编辑方法在局部性(locality)保障和批量编辑(batch-editing)支持方面的局限。其解决方案的关键在于提出MedREK——一种基于检索的编辑框架,通过引入共享查询-键模块实现精准匹配,并结合注意力机制提示编码器提供信息引导,从而提升编辑准确性与效率;同时,为评估该方法,作者构建了覆盖更广医学主题的基准MedVersa,首次实现了在严格局部约束下对单样本与批量编辑的有效验证。
链接: https://arxiv.org/abs/2510.13500
作者: Shujun Xia,Haokun Lin,Yichen Wu,Yinan Zhou,Zixuan Li,Zhongwei Wan,Xingrun Xing,Yefeng Zheng,Xiang Li,Caifeng Shan,Zhenan Sun,Quanzheng Li
机构: City University of Hong Kong (香港城市大学); Harvard Medical School (哈佛医学院); Institute of Automation, CAS (中国科学院自动化研究所); Columbia University (哥伦比亚大学); Ohio State University (俄亥俄州立大学); Westlake University (西湖大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, work in progress
Abstract:LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, \hkan enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at this https URL.
zh
[NLP-21] ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在理解真实世界中人类意图时存在的能力短板问题,尤其是在复杂、非线性且多视角交织的公共讨论场景下,LLMs往往难以整合多源信息、处理语义不一致性和适应动态演化的对话流。为填补这一评估空白,作者提出了首个面向实时动态环境的基准测试工具 \bench,其关键创新在于构建了一个大规模、多样化且支持自动更新的评估体系,通过自动化数据清洗与内容去重机制有效防止数据污染,从而实现对LLMs在消费者领域意图理解能力的可靠、持续评估。
链接: https://arxiv.org/abs/2510.13499
作者: Xiaozhe Li,TianYi Lyu,Siyi Yang,Yuxi Gong,Yizhao Yang,Jinxuan Huang,Ligao Zhang,Zhuoyi Huang,Qingwen Liu
机构: Tongji University(同济大学); Stanford University(斯坦福大学); Currents AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding human intent is a complex, high-level task for large language models (LLMs), requiring analytical reasoning, contextual interpretation, dynamic information aggregation, and decision-making under uncertainty. Real-world public discussions, such as consumer product discussions, are rarely linear or involve a single user. Instead, they are characterized by interwoven and often conflicting perspectives, divergent concerns, goals, emotional tendencies, as well as implicit assumptions and background knowledge about usage scenarios. To accurately understand such explicit public intent, an LLM must go beyond parsing individual sentences; it must integrate multi-source signals, reason over inconsistencies, and adapt to evolving discourse, similar to how experts in fields like politics, economics, or finance approach complex, uncertain environments. Despite the importance of this capability, no large-scale benchmark currently exists for evaluating LLMs on real-world human intent understanding, primarily due to the challenges of collecting real-world public discussion data and constructing a robust evaluation pipeline. To bridge this gap, we introduce \bench, the first dynamic, live evaluation benchmark specifically designed for intent understanding, particularly in the consumer domain. \bench is the largest and most diverse benchmark of its kind, supporting real-time updates while preventing data contamination through an automated curation pipeline.
zh
[NLP-22] LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA EMNLP2025
【速读】: 该论文旨在解决当前叙事文本问答(Question Answering, QA)系统在评估时面临的关键问题:现有基准测试数据集NarrativeQA存在噪声文档和低质量问答对,导致评估结果不可靠。为应对这一挑战,作者提出LiteraryQA——一个高质量子集,聚焦于文学作品,并通过结合人工校验与大语言模型(Large Language Model, LLM)验证的流程,识别并修正低质问答样本,同时清理源文档中的冗余内容。其解决方案的核心在于构建一套严格筛选与净化的数据处理管道,确保数据的准确性与一致性;此外,论文还通过元评估(meta-evaluation)揭示了传统n-gram类自动指标与人类判断的相关性较低,而基于LLM-as-a-Judge的评分方法即使使用小型开源模型也能与人类排序高度一致,从而为未来在LiteraryQA上的系统评估提供了更可靠的依据。
链接: https://arxiv.org/abs/2510.13494
作者: Tommaso Bonomo,Luca Gioffré,Roberto Navigli
机构: Sapienza NLP Group(萨皮恩扎自然语言处理组); Sapienza University of Rome(罗马大学); Babelscape(巴贝尔斯凯普)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Main Conference. 22 pages
Abstract:Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at this https URL.
zh
[NLP-23] Beyond Single-Reward: Multi-Pair Multi-Perspective Preference Optimization for Machine Translation
【速读】: 该论文旨在解决当前直接偏好优化(Direct Preference Optimization, DPO)在机器翻译(Machine Translation, MT)任务中面临的两个核心问题:一是质量评估(Quality Estimation, QE)模型提供的奖励信号存在缺陷,难以捕捉如翻译幻觉(translation hallucination)等关键错误;二是数据利用效率低下,仅选取单一胜败对(win-loss pair)导致大量潜在学习信号被浪费。解决方案的关键在于提出M²PO(Multi-Pair, Multi-Perspective Preference Optimization)框架,其核心创新包括:(1) 多视角奖励引擎,融合新的幻觉惩罚机制以增强事实性(factual consistency)判断,并引入动态质量评分以自适应融合外部评估与模型自身演化判断;(2) 多对构建策略,从全部候选译文池中系统生成完整的偏好对集合,从而让模型从更丰富的质量权衡中学习,显著提升翻译的鲁棒性和忠实度。
链接: https://arxiv.org/abs/2510.13434
作者: Hao Wang,Linlong Xu,Heng Liu,Yangyang Liu,Xiaohu Zhao,Bo Zeng,Liangying Shao,Longyue Wang,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model’s own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.
zh
[NLP-24] Evaluating Arabic Large Language Models : A Survey of Benchmarks Methods and Gaps
【速读】: 该论文旨在解决阿拉伯语大语言模型(Large Language Models, LLMs)评估基准(benchmarks)缺乏系统性梳理与分类的问题,尤其针对现有基准在多样性、文化适配性和任务覆盖上的不足。其关键解决方案是提出一个四维分类体系(知识、自然语言处理任务、文化和方言、目标特定评估),对40余个阿拉伯语LLM评估基准进行系统分析,并深入比较原生构建、翻译和合成生成三种数据采集方法的优劣,从而为阿拉伯语自然语言处理(Natural Language Processing, NLP)研究提供可复现的评估框架、标准化指标及未来发展方向的建议。
链接: https://arxiv.org/abs/2510.13430
作者: Ahmed Alzubaidi,Shaikha Alsuwaidi,Basma El Amel Boussaha,Leen AlQadi,Omar Alkaabi,Mohammed Alyafeai,Hamza Alobeidli,Hakim Hacid
机构: Technology Innovation Institute (技术创新研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.
zh
[NLP-25] Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在隐式因果链发现任务中的机制性因果推理能力问题,即如何识别并生成连接给定因果对的中间因果步骤,从而揭示“原因如何导致结果”的内在机制。其解决方案的关键在于构建一个诊断性评估框架,通过指令九个LLMs生成气候变迁争议语境下因果对之间的完整因果链结构,并结合人工评估验证所生成链条的逻辑一致性和完整性。研究发现,尽管LLMs在生成过程中表现出高度自洽性和信心,但其推理主要依赖关联模式匹配而非真正的因果理解;然而,这些生成链条仍具备可接受的逻辑合理性,为未来在论辩场景中推进隐式、机制性的因果推理研究提供了基准数据集与方法基础。
链接: https://arxiv.org/abs/2510.13417
作者: Liesbeth Allein,Nataly Pineda-Castañeda,Andrea Rocci,Marie-Francine Moens
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:How does a cause lead to an effect, and which intermediate causal steps explain their connection? This work scrutinizes the mechanistic causal reasoning capabilities of large language models (LLMs) to answer these questions through the task of implicit causal chain discovery. In a diagnostic evaluation framework, we instruct nine LLMs to generate all possible intermediate causal steps linking given cause-effect pairs in causal chain structures. These pairs are drawn from recent resources in argumentation studies featuring polarized discussion on climate change. Our analysis reveals that LLMs vary in the number and granularity of causal steps they produce. Although they are generally self-consistent and confident about the intermediate causal connections in the generated chains, their judgments are mainly driven by associative pattern matching rather than genuine causal reasoning. Nonetheless, human evaluations confirmed the logical coherence and integrity of the generated chains. Our baseline causal chain discovery approach, insights from our diagnostic evaluation, and benchmark dataset with causal chains lay a solid foundation for advancing future work in implicit, mechanistic causal reasoning in argumentation settings.
zh
[NLP-26] Investigating Lexical Change through Cross-Linguistic Colexification Patterns
【速读】: 该论文试图解决语言意义演变的驱动因素问题,特别是如何解释不同概念在词汇层面共享同一词形(即共词化,colexification)的演化机制。其解决方案的关键在于运用系统发育比较模型(phylogenetic comparative models),基于南岛语系、印欧语系和乌拉尔语系三大语系的词典数据,量化评估了三个核心预测变量——概念关联性(associativity)、可借用性(borrowability)和使用频率(usage frequency)对共词化现象的影响。研究发现:概念对之间的亲缘关系越近,其共词化程度越高且演变速率越慢;而高频或易借入的概念对则变化更快且更少共词化,同时不同语系间存在显著差异,提示地域与文化因素可能在其中起调节作用。
链接: https://arxiv.org/abs/2510.13407
作者: Kim Gfeller,Sabine Stoll,Chundra Cathcart,Paul Widmer
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:One of the most intriguing features of language is its constant change, with ongoing shifts in how meaning is expressed. Despite decades of research, the factors that determine how and why meanings evolve remain only partly understood. Colexification – the phenomenon of expressing multiple distinct concepts using the same word form – serves as a valuable window onto the dynamics of meaning change across languages. Here, we apply phylogenetic comparative models to dictionary data from three language families, Austronesian, Indo-European, and Uralic, in order to shed light on the evolutionary dynamics underlying the colexification of concept pairs. We assess the effects of three predictors: associativity, borrowability, and usage frequency. Our results show that more closely related concept pairs are colexified across a larger portion of the family tree and exhibit slower rates of change. In contrast, concept pairs that are more frequent and more prone to borrowing tend to change more rapidly and are less often colexified. We also find considerable differences between the language families under study, suggesting that areal and cultural factors may play a role.
zh
[NLP-27] Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)是否具备真正理论心理(Theory of Mind, ToM)能力的问题,特别是评估生成式代理模型(Generative Agent-Based Model, GABM)Concordia在模拟现实环境中能否实现基于信念推理的行动选择。其解决方案的关键在于设计一个以行动为基础的评估框架,通过观察模型在复杂社会情境中是否能基于对他人心理状态的推断做出合理决策,而非依赖于语言模式的统计关联。研究发现,GPT-4在任务中常无法依据信念归属进行动作选择,且难以生成连贯的因果效应,表明当前LLMs表现出的ToM类行为可能源于浅层关联而非深层认知推理。
链接: https://arxiv.org/abs/2510.13395
作者: Agnese Lombardi,Alessandro Lenci
机构: CoLing Lab, Department of Philology, Literature and Linguistics, University of Pisa (比萨大学语言学、文学与语言学系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal effects from agent actions, exposing difficulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.
zh
[NLP-28] Make an Offer They Cant Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在单轮对话场景中缺乏有效策略性说服能力的问题,尤其关注现有研究对信息不对称(information asymmetry)的忽视或对预承诺(pre-commitment)强假设的依赖。其解决方案的关键在于引入贝叶斯劝说(Bayesian Persuasion, BP)框架,并设计一种“承诺-沟通”机制:说服者通过自然语言明确描述其潜在类型(如诚实或不诚实),从而引导接收方执行预期的贝叶斯信念更新。该机制使LLM能够以更符合人类社会互动逻辑的方式进行策略性信息传递,在保持语义自然性的前提下显著提升说服成功率。
链接: https://arxiv.org/abs/2510.13387
作者: Buwei He,Yang Liu,Zhaowei Zhang,Zixia Jia,Huijia Wu,Zhaofeng He,Zilong Zheng,Yipeng Kang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); BigAI (北京智源研究院)
类目: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Persuasion, a fundamental social capability for humans, remains a challenge for AI systems such as large language models (LLMs). Current studies often overlook the strategic use of information asymmetry in message design or rely on strong assumptions regarding pre-commitment. In this work, we explore the application of Bayesian Persuasion (BP) in natural language within single-turn dialogue settings, to enhance the strategic persuasion capabilities of LLMs. Our framework incorporates a commitment-communication mechanism, where the persuader explicitly outlines an information schema by narrating their potential types (e.g., honest or dishonest), thereby guiding the persuadee in performing the intended Bayesian belief update. We evaluate two variants of our approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, benchmarking them against both naive and strong non-BP (NBP) baselines within a comprehensive evaluation framework. This framework covers a diverse set of persuadees – including LLM instances with varying prompts and fine-tuning and human participants – across tasks ranging from specially designed persuasion scenarios to general everyday situations. Experimental results on LLM-based agents reveal three main findings: (1) LLMs guided by BP strategies consistently achieve higher persuasion success rates than NBP baselines; (2) SFNL exhibits greater credibility and logical coherence, while FNL shows stronger emotional resonance and robustness in naturalistic conversations; (3) with supervised fine-tuning, smaller models can attain BP performance comparable to that of larger models.
zh
[NLP-29] Document Intelligence in the Era of Large Language Models : A Survey
【速读】: 该论文旨在解决文档智能(Document AI, DAI)领域在大语言模型(Large Language Models, LLMs)兴起背景下的演进路径、当前研究进展与未来发展方向不清晰的问题。其解决方案的关键在于系统性梳理DAI从传统编码器-解码器架构向仅解码器LLMs转变的范式革新,聚焦多模态、多语言及检索增强型DAI中的关键突破与挑战,并提出以代理(agent)驱动方法和文档专用基础模型为代表的未来研究方向,从而为学术界和工业界提供结构化、前瞻性的技术分析框架。
链接: https://arxiv.org/abs/2510.13366
作者: Weishi Wang,Hengchang Hu,Zhijie Zhang,Zhaochen Li,Hongxin Shao,Daniel Dahlmeier
机构: SAP(思爱普)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI’s evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.
zh
[NLP-30] D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中因依赖静态预训练知识且无法自适应推理对话历史而导致的事实不一致性和逻辑衰减问题。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)和代理工作记忆虽能提升信息召回能力,但仍受限于固定的知识源和单一推理路径,难以在动态演变的对话上下文中维持事实与逻辑一致性。其解决方案的关键在于提出一种模型无关的框架 D-SMART,通过两个协同组件实现:(1) 动态结构化记忆(Dynamic Structured Memory, DSM),增量构建并维护符合 OWL 规范的对话知识图谱;(2) 推理树(Reasoning Tree, RT),在该图谱上执行显式、可追溯的多步推理搜索,从而支持模型基于演化中的对话上下文进行动态推理与一致性校验。
链接: https://arxiv.org/abs/2510.13363
作者: Xiang Lei,Qin Li,Min Zhang,Min Zhang
机构: 华东师范大学信息科学技术学院(Shanghai Key Laboratory of Intelligent Information Processing, School of Information Science and Technology, East China Normal University)
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures (main content); 25 pages, 18 figures (total)
Abstract:Large Language Models (LLMs) often exhibit factual inconsistencies and logical decay in extended, multi-turn dialogues, a challenge stemming from their reliance on static, pre-trained knowledge and an inability to reason adaptively over the dialogue history. Prevailing mitigation strategies, such as Retrieval-Augmented Generation (RAG) and agentic working memories, improve information recall but still engage with fundamentally static knowledge sources and follow pre-defined single reasoning path. This hinders their ability to preserve factual and logical consistency of their responses in multi-turn dialogues while the context evolves over time. To address this issue, we propose D-SMART, a model-agnostic framework designed to maintain multi-turn dialogue consistency by enabling LLMs to build and reason over a dynamic, structured representation of the conversational context. This is achieved via two synergistic components: (1) a Dynamic Structured Memory (DSM), which incrementally constructs and maintains an authoritative, OWL-compliant knowledge graph of the conversation; and (2) a Reasoning Tree (RT), which executes inferences as an explicit and traceable multi-step search over the graph. As the popular-used quality score (judged by GPT-4) can overlook logical flaws, we introduce new NLI-based metrics to better measure multi-turn dialogue consistency. Comprehensive experiments on the MT-Bench-101 benchmark show that D-SMART significantly outperforms state-of-the-art baselines, elevating the dialogue consistency score by over 48% for both proprietary and open-source models, and notably improves the quality score of the latter by up to 10.1%.
zh
[NLP-31] Personal Attribute Leakage in Federated Speech Models
【速读】: 该论文旨在解决联邦学习(Federated Learning)环境下自动语音识别(ASR)模型在隐私保护方面的潜在漏洞问题,特别是针对敏感属性(如性别、年龄、口音、情绪和构音障碍)的属性推断攻击(Attribute Inference Attack)风险。解决方案的关键在于提出并验证了一种无需访问原始语音数据的非参数白盒攻击方法,该方法仅利用模型权重差异即可成功推断目标用户的敏感属性信息,尤其在预训练数据中缺乏代表性或缺失的属性类别上表现更为显著,从而揭示了联邦ASR模型在实际部署中的新型隐私安全隐患,并为提升模型安全性提供了重要依据。
链接: https://arxiv.org/abs/2510.13357
作者: Hamdan Al-Ali,Ali Reza Ghavamipour,Tommaso Caselli,Fatih Turkmen,Zeerak Talat,Hanan Aldarmaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, 2 tables
Abstract:Federated learning is a common method for privacy-preserving training of machine learning models. In this paper, we analyze the vulnerability of ASR models to attribute inference attacks in the federated setting. We test a non-parametric white-box attack method under a passive threat model on three ASR models: Wav2Vec2, HuBERT, and Whisper. The attack operates solely on weight differentials without access to raw speech from target speakers. We demonstrate attack feasibility on sensitive demographic and clinical attributes: gender, age, accent, emotion, and dysarthria. Our findings indicate that attributes that are underrepresented or absent in the pre-training data are more vulnerable to such inference attacks. In particular, information about accents can be reliably inferred from all models. Our findings expose previously undocumented vulnerabilities in federated ASR models and offer insights towards improved security.
zh
[NLP-32] Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业级和关键任务场景中部署时,因缺乏多模态、实时、可解释的安全防护机制而导致的安全性与合规性问题。现有防护系统通常仅处理文本模态,难以应对包含图像和音频的复杂输入,并且在可解释性和生产环境适应性方面存在不足。解决方案的关键在于提出一个原生多模态的防护模型Protect,其通过低秩适配(Low-Rank Adaptation, LoRA)技术对类别特定的适配器进行微调,覆盖毒性、性别歧视、数据隐私和提示注入四大安全维度;同时采用教师辅助标注流程生成高保真、上下文感知的多模态标签,从而实现跨文本、图像和音频的统一安全检测,在性能上超越现有开源及商用模型(如WildGuard、LlamaGuard-4和GPT-4.1),为可信、可审计、可生产的多模态安全系统奠定基础。
链接: https://arxiv.org/abs/2510.13351
作者: Karthik Avinash,Nikhil Pareek,Rishav Hada
机构: FutureAGI Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability – limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.
zh
[NLP-33] UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
【速读】: 该论文旨在解决统一音频生成模型中语音(speech)与音乐(music)领域长期分离的问题,其根源在于任务间的内在冲突和数据分布的严重失衡,导致难以实现真正意义上的通用音频合成。解决方案的关键在于提出一种基于动态容量专家混合(Dynamic-Capacity Mixture-of-Experts, MoE)架构的UniMoE-Audio模型:首先引入Top-P路由策略实现专家数量的动态分配;其次设计混合专家结构,包含路由专家(domain-specific)、共享专家(domain-agnostic)和空专家(null experts)以支持差异化知识提取与计算自适应跳过;最后通过三阶段训练课程(独立专精训练、MoE集成预热、协同联合训练)系统性缓解数据不平衡问题,从而在保持各域性能的同时显著提升跨域协同学习能力,最终在多个主流语音与音乐生成基准上达到当前最优效果。
链接: https://arxiv.org/abs/2510.13344
作者: Zhenyu Liu,Yunxin Li,Xuanyu Zhang,Qixun Teng,Shenyuan Jiang,Xinyu Chen,Haoyuan Shi,Jinchao Li,Qi Wang,Haolan Chen,Fanbo Meng,Mingjun Zhao,Yu Xu,Yancheng He,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); Shenzhen Loop Area Institute(深圳环区研究院)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each “proto-expert” without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: this https URL
zh
[NLP-34] Are Proverbs the New Pythian Oracles? Exploring Sentiment in Greek Sayings
【速读】: 该论文旨在解决希腊谚语(proverbs)情感倾向(sentiment)的自动化分析问题,尤其是在多语言、多方言背景下如何有效识别和映射其情感分布。解决方案的关键在于利用大语言模型(LLM)对希腊谚语进行非传统的情感极性分类任务,结合扩展后的标注数据集(包含地方方言),实现对谚语情感的准确识别,并进一步通过地理空间可视化与方言及主题的交叉分析,揭示希腊境内谚语情感分布的区域性特征。研究发现,LLM在处理此类非标准情感任务时具备较高准确性,且多数地区呈现负向情感占主导的趋势。
链接: https://arxiv.org/abs/2510.13341
作者: Katerina Korre,John Pavlopoulos
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Proverbs are among the most fascinating linguistic phenomena that transcend cultural and linguistic boundaries. Yet, much of the global landscape of proverbs remains underexplored, as many cultures preserve their traditional wisdom within their own communities due to the oral tradition of the phenomenon. Taking advantage of the current advances in Natural Language Processing (NLP), we focus on Greek proverbs, analyzing their sentiment. Departing from an annotated dataset of Greek proverbs, we expand it to include local dialects, effectively mapping the annotated sentiment. We present (1) a way to exploit LLMs in order to perform sentiment classification of proverbs, (2) a map of Greece that provides an overview of the distribution of sentiment, (3) a combinatory analysis in terms of the geographic position, dialect, and topic of proverbs. Our findings show that LLMs can provide us with an accurate enough picture of the sentiment of proverbs, especially when approached as a non-conventional sentiment polarity task. Moreover, in most areas of Greece negative sentiment is more prevalent.
zh
[NLP-35] aming the Frag ility of KV Cache Eviction in LLM Inference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段因Transformer架构中Key-Value (KV) 缓存带来的显著内存和运行时开销问题。现有方法通常依赖于“稳定性假设”——即生成过程中存在一组始终重要的缓存条目,并据此采用评分-聚合框架进行缓存淘汰,其中多数工作聚焦于改进重要性评分指标,而默认使用均值聚合(mean aggregation)作为聚合策略。然而,作者指出该稳定性假设本质上脆弱,在极端情况下会导致均值聚合策略失效,从而引发生成质量显著下降。解决方案的关键在于提出一种防御性聚合策略:一种两步、线性时间复杂度的简单机制,能够有效控制最坏情况下的风险,从而在不增加显著计算开销的前提下抵御极端异常情形。基于此策略,作者进一步提出了两个新颖的缓存淘汰方法——DefensiveKV及其扩展Layer-DefensiveKV,后者引入了层间预算分配机制。实验表明,在7个任务领域(18个数据集)上,二者分别相较最强基线在20%缓存容量下将生成质量损失降低2.3倍和4.3倍,显著提升了缓存管理的鲁棒性与性能上限。
链接: https://arxiv.org/abs/2510.13334
作者: Yuan Feng,Haoyu Guo,JunLin Lv,S. Kevin Zhou,Xike Xie
机构: University of Science and Technology of China (中国科学技术大学); USTC (中国科学技术大学); Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research (苏州研究院奇迹中心数据幽灵实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer’s Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the stability assumption-that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management. Our code is available at this https URL.
zh
[NLP-36] Embedding-Based Context-Aware Reranker
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因文档分段导致的跨片段推理挑战,例如共指消解、实体消歧和证据聚合等问题。现有主流重排序方法虽使用强大的预训练语言模型,但未能有效处理这些需要跨片段理解的任务。其解决方案的关键在于提出一种轻量级重排序框架EBCAR(Embedding-Based Context-Aware Reranker),该框架直接基于检索到的片段嵌入进行操作,并通过利用片段结构信息与混合注意力机制,增强对跨片段语义关系的理解,从而在保持高效率的同时提升准确性。
链接: https://arxiv.org/abs/2510.13329
作者: Ye Yuan,Mohammad Amin Shabani,Siqi Liu
机构: McGill(麦吉尔大学); Mila - Quebec AI Institute(魁北克人工智能研究所); RBC Borealis(加拿大皇家银行Borealis)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant evidence from a corpus to support downstream generation. The common practice of splitting a long document into multiple shorter passages enables finer-grained and targeted information retrieval. However, it also introduces challenges when a correct retrieval would require inference across passages, such as resolving coreference, disambiguating entities, and aggregating evidence scattered across multiple sources. Many state-of-the-art (SOTA) reranking methods, despite utilizing powerful large pretrained language models with potentially high inference costs, still neglect the aforementioned challenges. Therefore, we propose Embedding-Based Context-Aware Reranker (EBCAR), a lightweight reranking framework operating directly on embeddings of retrieved passages with enhanced cross-passage understandings through the structural information of the passages and a hybrid attention mechanism, which captures both high-level interactions across documents and low-level relationships within each document. We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
zh
[NLP-37] ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering
【速读】: 该论文旨在解决对话式问答(Conversational Question Answering, CQA)中因用户意图随对话轮次演变、话语常不完整而需动态推理与检索-生成协调的问题。传统静态“重写-检索-生成”流水线难以适应复杂交互场景,尤其在多轮对话中缺乏探索性和自适应能力。解决方案的关键在于提出ChatR1框架,其核心是基于强化学习(Reinforcement Learning, RL)的推理机制,通过在对话轮次间交错执行搜索与推理操作,实现行为的动态调整;同时设计了一种意图感知奖励(intent-aware reward),以提供逐轮反馈并增强检索与推理对用户目标演化的对齐性,从而有效缓解RL中稀疏且延迟的奖励问题。实验证明,该方法在多个CQA数据集上优于现有模型,并展现出良好的跨领域泛化能力。
链接: https://arxiv.org/abs/2510.13312
作者: Simon Lupart,Mohammad Aliannejadi,Evangelos Kanoulas
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate’ pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1’s performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.
zh
[NLP-38] LLM one-shot style transfer for Authorship Attribution and Verification
【速读】: 该论文旨在解决传统计算文体学(computational stylometry)方法在作者识别与文本归属任务中因依赖数据中的虚假相关性(spurious correlations)而导致风格与主题混淆的问题。其解决方案的关键在于利用现代大语言模型(LLM)的因果语言建模(Causal Language Modeling, CLM)预训练知识和上下文学习(in-context learning)能力,提出一种无需标注数据的无监督方法:通过测量一个文本作为输入时,目标LLM对其生成的对数概率(log-probabilities),来量化风格迁移能力(style transferability),从而实现更准确的作者归属与验证。该方法在控制主题相关性的前提下显著优于对比学习基线,并且性能随基础模型规模及测试时计算量的增加而稳定提升,展现出良好的可扩展性与灵活性。
链接: https://arxiv.org/abs/2510.13302
作者: Pablo Miralles-González,Javier Huertas-Tato,Alejandro Martín,David Camacho
机构: Technical University of Madrid (马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.
zh
[NLP-39] Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models ICASSP2026
【速读】: 该论文旨在解决自回归文本到语音(Auto-Regressive Text-to-Speech, AR TTS)模型中风格(情感)与内容语义不匹配的问题,即当自然语言提示中的情感意图与文本语义冲突时,会导致语音输出不自然,从而削弱细粒度情感控制的效果。解决方案的关键在于提出一种自适应的无分类器引导(Classifier-Free Guidance, CFG)机制,该机制根据大语言模型或自然语言推理模型检测到的风格-内容不匹配程度动态调整CFG强度,从而在提升情感表达力的同时保持音频质量和可懂度。
链接: https://arxiv.org/abs/2510.13293
作者: Yizhou Peng,Yukun Ma,Chong Zhang,Yi-Wen Chao,Chongjia Ni,Bin Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to ICASSP 2026
Abstract:While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG’s impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.
zh
[NLP-40] Higher Satisfaction Lower Cost: A Technical Report on How LLM s Revolutionize Meituans Intelligent Interaction Systems
【速读】: 该论文旨在解决智能交互系统在工业应用中面临的五大核心挑战:(1)冷启动阶段高质量数据构建困难,制约系统自进化能力并增加人力成本;(2)多轮对话性能不足,受限于意图理解不充分、规则合规性差及解决方案提取效率低;(3)业务规则频繁变更影响系统可操作性和迁移能力,限制低成本扩展与适应性;(4)单一大语言模型(Large Language Model, LLM)在复杂场景下表现乏力,缺乏多智能体(Multi-Agent)协作框架导致流程完整性和服务质量下降;(5)开放域多轮对话缺乏统一标准答案,难以进行量化评估与持续优化。解决方案的关键在于提出WOWService系统,通过融合LLM与多智能体架构实现自主任务管理与协同问题求解,并聚焦于数据构建、通用能力增强、业务场景适配、多智能体协调和自动化评估五大核心模块,已在美团App上线部署,显著提升用户满意度指标(USM 1下降27.53%,USM 2上升25.51%),验证了其在捕捉用户需求和提供个性化服务方面的有效性。
链接: https://arxiv.org/abs/2510.13291
作者: Xuxin Cheng,Ke Zeng,Zhiquan Cao,Linyi Dai,Wenxuan Gao,Fei Han,Ai Jian,Feng Hong,Wenxing Hu,Zihe Huang,Dejian Kong,Jia Leng,Zhuoyuan Liao,Pei Liu,Jiaye Lin,Xing Ma,Jingqing Ruan,Jiaxing Song,Xiaoyu Tan,Ruixuan Xiao,Wenhui Yu,Wenyu Zhan,Haoxing Zhang,Chao Zhou,Hao Zhou,Shaodong Zheng,Ruinian Chen,Siyuan Chen,Ziyang Chen,Yiwen Dong,Yaoyou Fan,Yangyi Fang,Yang Gan,Shiguang Guo,Qi He,Chaowen Hu,Binghui Li,Dailin Li,Xiangyu Li,Yan Li,Chengjian Liu,Xiangfeng Liu,Jiahui Lv,Qiao Ma,Jiang Pan,Cong Qin,Chenxing Sun,Wen Sun,Zhonghui Wang,Abudukelimu Wuerkaixi,Xin Yang,Fangyi Yuan,Yawen Zhu,Tianyi Zhai,Jie Zhang,Runlai Zhang,Yao Xu,Yiran Zhao,Yifan Wang,Xunliang Cai,Yangen Hu,Cao Liu,Lu Pan,Xiaoli Wang,Bo Xiao,Wenyuan Yao,Qianlin Zhou,Benchang Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 pages, 14 figures
Abstract:Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high-quality data for cold-start training is difficult, hindering self-evolution and raising labor costs. (2) Multi-turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low-cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi-agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open-domain nature of multi-turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi-agent architectures, WOWService enables autonomous task management and collaborative problem-solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) -27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.
zh
[NLP-41] In-Distribution Steering: Balancing Control and Coherence in Language Model Generation
【速读】: 该论文旨在解决现有激活控制(activation steering)方法在推理时因采用固定干预强度而导致的控制不足或干预不当问题,进而影响生成文本的合理性与连贯性。其解决方案的关键在于提出分布内控制(In-Distribution Steering, IDS),通过动态调整干预强度来适应输入数据在表征空间中的分布距离,从而实现自适应干预和生成稳定性,确保在保持分类任务高准确率的同时生成高质量、不坍塌的文本。
链接: https://arxiv.org/abs/2510.13285
作者: Arthur Vogels,Benjamin Wong,Yann Choho,Annabelle Blangero,Milan Bhan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Activation steering methods control large language model (LLM) behavior by modifying internal activations at inference time. However, most existing activation steering methods rely on a fixed steering strength, leading to either insufficient control or unadapted intervention that degrades text plausibility and coherence. We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space. IDS dynamically adjusts interventions according to how far a given input lies within the distribution, enabling adaptive intervention and generation stability during text generation. Experiments demonstrate that IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.
zh
[NLP-42] MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
【速读】: 该论文旨在解决大视觉语言模型(Large Vision Language Models, LVLMs)在扩展上下文窗口后,仍存在长上下文多模态信息利用不充分的问题,尤其在真实应用场景中难以保证上下文忠实性(faithfulness)。其解决方案的关键在于提出MMLongCite——一个涵盖8项任务、6个不同上下文长度区间,并融合文本、图像和视频等多种模态的综合性评估基准,从而系统性地衡量LVLM在长上下文场景下的多模态忠实性表现。
链接: https://arxiv.org/abs/2510.13276
作者: Keyan Zhou,Zecheng Tang,Lingfeng Ming,Guanghao Zhou,Qiguang Chen,Dan Qiao,Zheming Yang,Libo Qin,Minghui Qiu,Juntao Li,Min Zhang
机构: Soochow University (苏州大学); ByteDance (字节跳动); Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.
zh
[NLP-43] Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的搜索代理在训练大型语言模型(Large Language Model, LLM)使用搜索引擎进行检索增强生成(Retrieval-Augmented Generation, RAG)时,过度关注最终答案正确性而忽视中间推理步骤忠实性(faithfulness)的问题,这可能导致链式思维(Chain-of-Thought, CoT)不一致或不可信。解决方案的关键在于提出一个名为VERITAS(Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search)的新框架,该框架通过将细粒度的忠实性奖励(fine-grained faithfulness rewards)嵌入强化学习过程,从而引导模型在推理过程中保持信息一致性、思维与答案的一致性以及思维与搜索行为的一致性,实验证明该方法在提升推理忠实性的同时,仍能维持在七个问答(QA)基准上的任务性能。
链接: https://arxiv.org/abs/2510.13272
作者: Zhichao Xu,Zongyu Wu,Yun Zhou,Aosong Feng,Kang Zhou,Sangmin Woo,Kiran Ramnath,Yijun Tian,Xuan Qi,Weikang Qiu,Lin Lee Cheong,Haibo Ding
机构: Amazon AI Fundamental Research (亚马逊AI基础研究); The Pennsylvania State University (宾夕法尼亚州立大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that a prototypical RL-based search agent, Search-R1, has significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve comparable task performance across seven QA benchmarks.
zh
[NLP-44] Do You Get the Hint? Benchmarking LLM s on the Board Game Concept
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在抽象推理任务中表现不足的问题,尤其是当任务涉及非自然语言表示(如网格、符号或视觉模式)时。其解决方案的关键在于引入一个名为“Concept”的简单词语猜谜类棋盘游戏,该游戏使用自然语言作为输入表示,从而更贴近LLM的预训练数据分布。通过这一基准测试,研究发现即使人类玩家能以超过90%的成功率完成该游戏,当前最先进的LLM在该任务上的表现仍远低于人类(成功率未超过40%),暴露出模型在理解其他玩家策略意图和基于序列信息更新初始假设方面的局限性。此外,跨语言扩展实验进一步揭示了低资源语言(如荷兰语、法语和西班牙语)中LLM性能显著下降的现象。
链接: https://arxiv.org/abs/2510.13271
作者: Ine Gevers,Walter Daelemans
机构: CLiPS, University of Antwerp (CLiPS,安特卫普大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved striking successes on many benchmarks, yet recent studies continue to expose fundamental weaknesses. In particular, tasks that require abstract reasoning remain challenging, often because they use representations such as grids, symbols, or visual patterns that differ from the natural language data LLMs are trained on. In this paper, we introduce Concept, a simple word-guessing board game, as a benchmark for probing abductive reasoning in a representation that is much closer to LLM pre-training data: natural language. Our results show that this game, easily solved by humans (with a success rate of over 90%), is still very challenging for state-of-the-art LLMs (no model exceeds 40% success rate). Specifically, we observe that LLMs struggle with interpreting other players’ strategic intents, and with correcting initial hypotheses given sequential information updates. In addition, we extend the evaluation across multiple languages, and find that the LLM performance drops further in lower-resource languages (Dutch, French, and Spanish) compared to English.
zh
[NLP-45] Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在语法结构建模能力上的行为机制不明确问题,特别是其计算模块是否与人类大脑中的语言处理机制相似。为回答这一问题,作者提出了一种名为分层频率标记探针(Hierarchical Frequency Tagging Probe, HFTP)的新方法,该方法基于频域分析识别LLM中个体多层感知机(MLP)神经元及通过颅内记录获得的大脑皮层区域对句法结构的编码特征。HFTP的关键创新在于将频域分析引入到LLM内部表示与人类脑电活动的跨模态比较中,从而揭示不同LLM版本在语法处理层级上的对应关系以及与人类左半球语言优势区的表征相似性,为理解LLM性能提升背后的机制是否具有类人特性提供了可量化的工具和实证依据。
链接: https://arxiv.org/abs/2510.13255
作者: Jingmin An,Yilong Song,Ruolin Yang,Nai Ding,Lingxi Lu,Yuxuan Wang,Wei Wang,Chu Zhuang,Qian Wang,Fang Fang
机构: Peking University (北京大学); Zhejiang University (浙江大学); Beijing Language and Culture University (北京语言大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院)
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Large Language Models (LLMs) demonstrate human-level or even superior language abilities, effectively modeling syntactic structures, yet the specific computational modules responsible remain unclear. A key question is whether LLM behavioral capabilities stem from mechanisms akin to those in the human brain. To address these questions, we introduce the Hierarchical Frequency Tagging Probe (HFTP), a tool that utilizes frequency-domain analysis to identify neuron-wise components of LLMs (e.g., individual Multilayer Perceptron (MLP) neurons) and cortical regions (via intracranial recordings) encoding syntactic structures. Our results show that models such as GPT-2, Gemma, Gemma 2, Llama 2, Llama 3.1, and GLM-4 process syntax in analogous layers, while the human brain relies on distinct cortical regions for different syntactic levels. Representational similarity analysis reveals a stronger alignment between LLM representations and the left hemisphere of the brain (dominant in language processing). Notably, upgraded models exhibit divergent trends: Gemma 2 shows greater brain similarity than Gemma, while Llama 3.1 shows less alignment with the brain compared to Llama 2. These findings offer new insights into the interpretability of LLM behavioral improvements, raising questions about whether these advancements are driven by human-like or non-human-like mechanisms, and establish HFTP as a valuable tool bridging computational linguistics and cognitive neuroscience. This project is available at this https URL.
zh
[NLP-46] EvoTest: Evolutionary Test-Time Learning for Self-Improving Agent ic Systems
【速读】: 该论文旨在解决当前AI代理在测试时无法即时学习复杂技能的问题,即在新环境中表现如同“聪明但无知的实习生”,严重限制了其实际应用价值。为系统性地衡量并推动该问题的进展,作者提出了Jericho Test-Time Learning (J-TTL) 基准,要求智能体在连续多轮游戏中逐步提升性能。实验表明,现有适应方法(如反思、记忆或强化学习)难以有效应对该挑战。为此,论文提出EvoTest框架,其核心在于通过进化机制在不进行任何微调或梯度计算的前提下优化整个智能体系统:该框架包含两个角色——执行者代理(Actor Agent)负责游戏决策,演化者代理(Evolver Agent)则基于每轮游戏日志分析并生成下一回合的配置更新,包括重写提示词、记录有效状态-动作选择以更新记忆、调整超参数及学习工具使用策略。EvoTest在J-TTL基准上持续提升性能,不仅优于仅依赖反思或记忆的基线模型,也超越更复杂的在线微调方法,并唯一实现了在两场游戏中获胜(Detective 和 Library)。
链接: https://arxiv.org/abs/2510.13220
作者: Yufei He,Juncheng Liu,Yue Liu,Yibo Li,Tri Cao,Zhiyuan Hu,Xinxing Xu,Bryan Hooi
机构: National University of Singapore (新加坡国立大学); Microsoft Research (微软研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like “clever but clueless interns” in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
zh
[NLP-47] Personalized Learning Path Planning with Goal-Driven Learner State Modeling
【速读】: 该论文旨在解决个性化学习路径规划(Personalized Learning Path Planning, PLPP)中缺乏目标对齐机制的问题,即现有方法难以根据学习者的个体目标生成连贯且具导向性的学习路径。其解决方案的关键在于提出Pxplore框架,该框架融合强化学习训练范式与大语言模型(Large Language Models, LLMs)驱动的教育架构,通过设计结构化的学习者状态模型和自动奖励函数,将抽象的学习目标转化为可计算的信号,并采用监督微调(Supervised Fine-Tuning, SFT)与组相对策略优化(Group Relative Policy Optimization, GRPO)相结合的方式训练策略网络,从而在真实学习平台中实现高效、个性化的路径生成。
链接: https://arxiv.org/abs/2510.13215
作者: Joy Jia Yin Lim,Ye He,Jifan Yu,Xin Cong,Daniel Zhang-Li,Zhiyuan Liu,Huiqin Liu,Lei Hou,Juanzi Li,Bin Xu
机构: Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Personalized Learning Path Planning (PLPP) aims to design adaptive learning paths that align with individual goals. While large language models (LLMs) show potential in personalizing learning experiences, existing approaches often lack mechanisms for goal-aligned planning. We introduce Pxplore, a novel framework for PLPP that integrates a reinforcement-based training paradigm and an LLM-driven educational architecture. We design a structured learner state model and an automated reward function that transforms abstract objectives into computable signals. We train the policy combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), and deploy it within a real-world learning platform. Extensive experiments validate Pxplore’s effectiveness in producing coherent, personalized, and goal-driven learning paths. We release our code and dataset to facilitate future research.
zh
[NLP-48] A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
【速读】: 该论文旨在解决全球语言多样性导致的高质量数字语言资源分布不均问题,从而限制了多数人口对自然语言处理(Natural Language Processing, NLP)技术的受益。针对低资源语言缺乏数据资源难以开展NLP任务的挑战,论文提出了一种可扩展且全自动的方法,通过图像和文本分析技术从报纸文章中提取双语平行语料库。其解决方案的关键在于利用多模态分析(图像与文本)实现自动化构建高质量平行语料库,验证结果表明该方法在机器翻译下游任务中显著优于现有基线模型,BLEU得分提升近3点。
链接: https://arxiv.org/abs/2510.13211
作者: Prawaal Sharma,Navneet Goyal,Poonam Goyal,Vishnupriyan R
机构: Infosys(印度Infosys公司); BITS Pilani(比尔拉科技学院)
类目: Computation and Language (cs.CL)
备注: 4 Pages, Parallel Data Augmentation
Abstract:Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.
zh
[NLP-49] LLM -Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems
【速读】: 该论文旨在解决人工智能(AI)系统中因自然语言数据偏倚导致的公平性问题,特别是由于某些群体在训练数据中代表性不足而引发的性能差异。传统公平性方法如预处理、内处理和后处理依赖于受保护属性标签,存在准确率-公平性权衡且难以跨数据集泛化。其解决方案的关键在于提出LLM-Guided Synthetic Augmentation(LGSA),利用大语言模型(Large Language Models, LLMs)生成针对低代表群体的反事实样本,同时保持标签完整性;通过结构化提示生成性别交换的改写句,并结合语义相似性验证、属性一致性检查、毒性筛查及人工抽检等质量控制手段,最终在不牺牲准确率的前提下显著降低性别偏差,实现子群平衡与高任务精度的统一。
链接: https://arxiv.org/abs/2510.13202
作者: Sai Suhruth Reddy Karri,Yashwanth Sai Nallapuneni,Laxmi Narasimha Reddy Mallireddy,Gopichand G
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 1 Table, submitted to an international conference
Abstract:Bias in AI systems, especially those relying on natural language data, raises ethical and practical concerns. Underrepresentation of certain groups often leads to uneven performance across demographics. Traditional fairness methods, such as pre-processing, in-processing, and post-processing, depend on protected-attribute labels, involve accuracy-fairness trade-offs, and may not generalize across datasets. To address these challenges, we propose LLM-Guided Synthetic Augmentation (LGSA), which uses large language models to generate counterfactual examples for underrepresented groups while preserving label integrity. We evaluated LGSA on a controlled dataset of short English sentences with gendered pronouns, professions, and binary classification labels. Structured prompts were used to produce gender-swapped paraphrases, followed by quality control including semantic similarity checks, attribute verification, toxicity screening, and human spot checks. The augmented dataset expanded training coverage and was used to train a classifier under consistent conditions. Results show that LGSA reduces performance disparities without compromising accuracy. The baseline model achieved 96.7 percent accuracy with a 7.2 percent gender bias gap. Simple swap augmentation reduced the gap to 0.7 percent but lowered accuracy to 95.6 percent. LGSA achieved 99.1 percent accuracy with a 1.9 percent bias gap, improving performance on female-labeled examples. These findings demonstrate that LGSA is an effective strategy for bias mitigation, enhancing subgroup balance while maintaining high task accuracy and label fidelity.
zh
[NLP-50] xt Anomaly Detection with Simplified Isolation Kernel EMNLP
【速读】: 该论文旨在解决基于预训练大语言模型(Large Language Model, LLM)的文本异常检测中因高维密集嵌入(dense embeddings)导致的内存占用高和计算时间长的问题。解决方案的关键在于提出简化隔离核(Simplified Isolation Kernel, SIK),其通过创新的边界聚焦特征映射机制,将高维密集嵌入压缩为低维稀疏表示,同时保留关键异常特征;该方法具有线性时间复杂度,显著降低空间复杂度,在7个数据集上优于11种当前最优(state-of-the-art)异常检测算法,且保持高效计算与低内存消耗。
链接: https://arxiv.org/abs/2510.13197
作者: Yang Cao,Sikun Yang,Yujiu Yang,Lianyong Qi,Ming Liu
机构: Great Bay University(大湾大学); Guangdong Provincial Key Laboratory of Mathematical and Neural Dynamical Systems(广东省数学与神经动力系统重点实验室); Tsinghua University(清华大学); China University of Petroleum (East China)(中国石油大学(华东)); Deakin University(迪肯大学)
类目: Computation and Language (cs.CL)
备注: EMNLP Findings 2025
Abstract:Two-step approaches combining pre-trained large language model embeddings and anomaly detectors demonstrate strong performance in text anomaly detection by leveraging rich semantic representations. However, high-dimensional dense embeddings extracted by large language models pose challenges due to substantial memory requirements and high computation time. To address this challenge, we introduce the Simplified Isolation Kernel (SIK), which maps high-dimensional dense embeddings to lower-dimensional sparse representations while preserving crucial anomaly characteristics. SIK has linear time complexity and significantly reduces space complexity through its innovative boundary-focused feature mapping. Experiments across 7 datasets demonstrate that SIK achieves better detection performance than 11 state-of-the-art (SOTA) anomaly detection algorithms while maintaining computational efficiency and low memory cost. All code and demonstrations are available at this https URL.
zh
[NLP-51] StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation
【速读】: 该论文旨在解决语音到语音翻译(Speech-to-Speech Translation, S2ST)中语气(prosody)信息,特别是词级强调(word-level emphasis)在跨语言转换过程中丢失的问题。传统S2ST系统往往忽略韵律特征,导致翻译后的语音失去原说话者的语义重点和情感意图。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)实现跨语言的强调转换,将源语言中的应力信息映射为目标语言的标签,进而指导可控文本转语音(Text-to-Speech, TTS)模型生成保留强调的翻译语音。此外,为缓解训练数据稀缺问题,作者设计了一套自动对齐数据生成流程,并引入“LLM-as-Judge”评估机制以优化模型性能。实验表明,该方法在保持翻译质量、说话者意图和自然度的同时,显著提升了强调信息的保留能力。
链接: https://arxiv.org/abs/2510.13194
作者: Xi Chen,Yuchen Song,Satoshi Nakamura
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a stress-aware speech-to-speech translation (S2ST) system that preserves word-level emphasis by leveraging LLMs for cross-lingual emphasis conversion. Our method translates source-language stress into target-language tags that guide a controllable TTS model. To overcome data scarcity, we developed a pipeline to automatically generate aligned training data and introduce the “LLM-as-Judge” for evaluation. Experiments show our approach substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. Our work highlights the importance of prosody in translation and provides an effective, data-efficient solution for preserving paralinguistic cues in S2ST.
zh
[NLP-52] Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因上下文呈现方式(context format)差异导致的性能不稳定问题,尤其是在文档结构标记、分隔符选择和位置排列等看似表面但关键的因素上。研究表明,即使语义内容相同,不同格式设计也会显著影响模型准确性与稳定性。解决方案的关键在于提出一种轻量级的“上下文归一化”(Contextual Normalization)策略,通过在生成前自适应标准化上下文表示,有效提升对顺序变化的鲁棒性并增强长上下文利用能力,从而改善RAG系统在真实场景下的可靠性和一致性表现。
链接: https://arxiv.org/abs/2510.13191
作者: Jiamin Chen,Yuchen Li,Xinyu Ma,Xinran Chen,Xiaokun Zhang,Shuaiqiang Wang,Chen Ma,Dawei Yin
机构: City University of Hong Kong (香港城市大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.
zh
[NLP-53] SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在具备强大多模态推理能力的同时,因对抗性输入(adversarial inputs)导致的安全风险问题,即攻击者可通过看似无害的提示(benign prompts)隐藏恶意目标,从而绕过模型的安全机制。解决方案的关键在于提出一种轻量级、模型无关的预处理框架SHIELD,其核心创新是将细粒度的安全分类与特定类别引导(category-specific guidance)及明确操作(Block, Reframe, Forward)相结合,通过生成定制化的安全提示(tailored safety prompts)实现精细化拒绝或安全重定向,无需重新训练模型即可有效降低越狱(jailbreak)和不遵从(non-following)率,同时保持模型功能实用性。
链接: https://arxiv.org/abs/2510.13190
作者: Juan Ren,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types – serving as a practical safety patch for both weakly and strongly aligned LVLMs.
zh
[NLP-54] DSCD: Large Language Model Detoxification with Self-Constrained Decoding EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中存在的毒性内容问题,即如何在不进行参数微调的前提下实现高效、流畅且安全的文本生成。现有方法多依赖外部约束机制,导致资源开销大且影响生成流畅性。其解决方案的关键在于提出一种自约束解码机制(Detoxification with Self-Constrained Decoding, DSCD),通过在生成过程中动态调整模型内部各层(安全层、幻觉层和毒性层)的下一个词分布:增强安全层的输出概率,削弱幻觉与毒性层的概率,从而有效降低毒性并提升安全性。该方法无需额外训练,具备轻量、高兼容性和即插即用特性,显著优于现有技术,在保持生成流畅性的同时实现了当前最优的去毒性能。
链接: https://arxiv.org/abs/2510.13183
作者: Ming Dong,Jinkui Zhang,Bolong Zheng,Xinhui Tu,Po Hu,Tingting He
机构: Central China Normal University (华中师范大学); Wuhan University of Technology (武汉理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 MainConference
Abstract:Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for LLM detoxification without parameter fine-tuning. DSCD strengthens the inner next-token distribution of the safety layer while weakening that of hallucination and toxic layers during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source LLMs and public datasets validate DSCD’s effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD’s potential as a practical and scalable solution for safer LLM deployments.
zh
[NLP-55] Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism
【速读】: 该论文旨在解决当前链式思维(Chain of Thought, CoT)微调研究中忽视人类推理机制系统性分析的问题。现有综述多聚焦于技术实现,未能从人类认知角度深入探讨如何使大语言模型(Large Language Models, LLMs)具备类人推理能力。解决方案的关键在于引入六顶思考帽(Six Thinking Hats)理论框架,该框架以六种隐喻性思维模式系统刻画人类常见的思考方式,从而对CoT微调方法进行分类与剖析,为理解LLMs如何模拟人类推理提供理论依据,并据此提出未来研究方向。此方法不仅填补了技术与人类认知之间的鸿沟,也为构建更接近人类思维方式的推理模型奠定基础。
链接: https://arxiv.org/abs/2510.13170
作者: Xiaoshu Chen,Sihang Zhou,Ke Liang,Duanyang Yuan,Haoyuan Chen,Xiaoyu Sun,Linyuan Meng,Xinwang Liu
机构: National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chain of thought (CoT) fine-tuning aims to endow large language models (LLMs) with reasoning capabilities by training them on curated reasoning traces. It leverages both supervised and reinforced fine-tuning to cultivate human-like reasoning skills in LLMs, including detailed planning, divergent thinking, intuitive judgment, timely reflection, internal thinking, and fact perception, etc. As CoT fine-tuning has advanced, LLMs have demonstrated substantial improvements in tasks such as mathematical reasoning and code generation. However, existing surveys about CoT fine-tuning primarily focus on technical aspects and overlook a systematic analysis from the perspective of human reasoning mechanisms. Given that the ultimate goal of CoT fine-tuning is to enable LLMs to reason like humans, it is crucial to investigate this technique through the lens of human cognition. To fill this gap, we present the first comprehensive survey of CoT fine-tuning grounded in human reasoning theory. Specifically, inspired by the well-known Six Thinking Hats framework, which systematically characterizes common human thinking modes using six metaphorical hats, we classify and examine CoT fine-tuning methods through this lens. Furthermore, building upon this theory, we outline potential directions for future research in CoT fine-tuning. In addition, we compile a comprehensive overview of existing datasets and model performances, and a real-time GitHub repository \footnotethis https URL that continuously tracks recent advances in this area is maintained. We hope this survey will serve as a valuable resource to inspire innovation and foster progress in this rapidly evolving field.
zh
[NLP-56] CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning
【速读】: 该论文旨在解决当前链式思维(Chain-of-Thought, CoT)蒸馏方法在科学领域推理任务中效果不佳的问题,其核心挑战在于先进大语言模型(Large Language Models, LLMs)在复杂且专业性强的科学场景下常产生错误或浅层推理,直接蒸馏此类低质量输出会导致学生模型性能受限。解决方案的关键在于提出一种进化式链式思维蒸馏框架(CoT-Evo),通过构建多LLM思考者生成的多样化推理轨迹池、自动引入领域知识增强轨迹,并利用新颖性驱动的选择、反思性重组与变异机制迭代优化推理路径,同时以答案正确性、连贯性和知识利用效率为指导的适应度函数进行筛选,从而生成高质量的科学推理训练数据,最终使小型学生模型在科学推理基准测试中达到最先进性能。
链接: https://arxiv.org/abs/2510.13166
作者: Kehua Feng,Keyan Ding,Zhihui Zhu,Lei Liang,Qiang Zhang,Huajun Chen
机构: Zhejiang University (浙江大学); ZJU-Hangzhou Global Scientific and Technological Innovation Center (浙江大学杭州全球科创中心); AntGroup (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: 28 pages, 3 figures
Abstract:While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.
zh
[NLP-57] A Matter of Representation: Towards Graph-Based Abstract Code Generation
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在生成结构化代码方面表现优异,但缺乏对基于图的抽象代码(graph-based abstract code)生成的研究,尤其是在逻辑封装于预定义节点、执行流程由边决定的场景中,如可视化编程语言或源码不可访问的情况下。解决方案的关键在于提出并评估用于图结构的JSON表示方法,以实现高精度的图结构抽象代码生成;实验表明,合适的图表示形式能显著提升LLM在单次推理中完成该任务的准确性,且无需依赖复杂或专用的处理管道,从而为图结构抽象代码生成的表示学习奠定了基础。
链接: https://arxiv.org/abs/2510.13163
作者: Nyx Iskandar,Hisham Bedri,Andy Tsen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Most large language models (LLMs) today excel at generating raw, sequential code with minimal abstractions and custom structures. However, there has been little work on graph-based abstract code generation, where significant logic is encapsulated in predefined nodes and execution flow is determined by edges. This is relevant for visual programming languages, and in cases where raw source code is inaccessible to users and LLM training sets. In this work, we propose and evaluate JSON representations for graphs to enable high accuracy graph-based abstract code generation. We evaluate these representations on ScratchTest, a mini-benchmark based on our custom Python re-implementation of Scratch, which tests the LLM in code graph space. Our findings demonstrate that LLMs can indeed perform the aforementioned generation task in a single pass without relying on specialized or complex pipelines, given the correct graph representations. We also show that different representations induce significantly different accuracies, highlighting the instrumental role of representations in this generation task. All in all, this work establishes the first steps towards representation learning for graph-based abstract code generation.
zh
[NLP-58] Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference
【速读】: 该论文旨在解决生成式 AI(Generative AI)在大语言模型(Large Language Model, LLM)推理过程中,因推测解码(Speculative Decoding)中草稿模型(draft model)的自回归生成成本导致的速度-准确率权衡问题。现有方法如Medusa、Hydra和EAGLE虽部分降低了草稿生成开销,但或损害接受率(acceptance rate),或引入额外延迟限制扩展性。其解决方案的关键在于提出Mirror Speculative Decoding(Mirror-SD),通过两个核心策略实现突破:一是利用早期退出信号并行启动分支完整的滚动推演(branch-complete rollouts),将目标模型后缀与草稿模型的推测路径在异构加速器(GPU与NPU)之间显式映射,形成互补的执行流水线;二是引入推测流式机制(speculative streaming),使草稿模型每步输出多个token,显著降低草稿延迟而不削弱接受语义。这一双重策略有效打破了延迟与接受率之间的固有权衡,实现了高接受率下的低开销推理,在SpecBench基准测试中对14B至66B参数模型实现2.8x–5.8x的端到端速度提升,相较最强基线EAGLE3平均提升30%。
链接: https://arxiv.org/abs/2510.13161
作者: Nikhil Bhendawade,Kumari Nishu,Arnav Kundu,Chris Bartels,Minsik Cho,Irina Belousova
机构: Apple(苹果)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model’s suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.
zh
[NLP-59] Program of Thoughts for Financial Reasoning : Leverag ing Dynamic In-Context Examples and Generative Retrieval EMNLP
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融数值推理任务中表现不足的问题,尤其是在FinQA和ConvFinQA等基准数据集上的准确率仍落后于最优模型。其解决方案的关键在于提出了一种名为FINDER的两步框架:第一步利用生成式检索器(generative retriever)从非结构化文本和表格数据中提取相关事实;第二步采用上下文感知的Program of Thought(PoT)提示方法,并结合动态选择的少样本示例进行推理,从而显著提升模型在金融场景下的数值计算与逻辑推理能力。该方法在两个金融数值推理数据集上均达到新的最先进性能,执行准确率分别提升了5.98%和4.05%。
链接: https://arxiv.org/abs/2510.13157
作者: Subhendu Khatuya,Shashwat Naidu,Pawan Goyal,Niloy Ganguly
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校)
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work has been accepted for publication in the Main Conference of the Empirical Methods in Natural Language Processing (EMNLP) 2025
Abstract:Despite continuous advancements in the capabilities of large language models (LLMs), numerical reasoning remains a challenging area. Techniques like chain-of-thought prompting, tree-of-thought prompting, and program-of-thought prompting guide LLMs through intermediate reasoning steps. Although in-context learning with few-shot prompting has improved performance, LLMs still lag behind state-of-the-art models on financial numerical reasoning datasets such as FinQA and ConvFinQA. In this work, we introduce FINDER, a novel two-step framework, to enhance LLMs’ capabilities in financial numerical reasoning. The first step utilizes a generative retriever to extract relevant facts from unstructured data, including both text and tables. This is followed by context-aware Program of Thought prompting with dynamic selection of in-context examples. Our model FINDER achieves a new state-of-the-art performance on both the FinQA and ConvFinQA datasets, surpassing previous benchmarks with execution accuracy improvements of 5.98% and 4.05%, respectively.
zh
[NLP-60] I Am Aligned But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在中东与北非(Middle East and North Africa, MENA)地区文化适配性不足的问题,即现有AI评估体系对MENA地区价值观和信仰的代表性严重缺失。其解决方案的关键在于构建了一个名为MENAValues的新基准,该基准基于来自16个国家的大规模权威人类调查数据,系统刻画了MENA地区的社会文化图景,并通过交叉设计三种视角框架(中立、个性化、第三人称/文化观察者)与两种语言模式(英语及阿拉伯语、波斯语、土耳其语等本地语言),全面评估LLMs的文化一致性与多语言偏见。该方法揭示了跨语言价值偏移、推理诱导性能下降及logit泄露等关键现象,为诊断和改进AI系统的文化包容性提供了可扩展的实证框架与方法论工具。
链接: https://arxiv.org/abs/2510.13154
作者: Pardis Sadat Zahraei,Ehsaneddin Asgari
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Qatar Computing Research Institute - QCRI (卡塔尔计算研究研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs) with respect to the beliefs and values of the Middle East and North Africa (MENA) region, an underrepresented area in current AI evaluation efforts. Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries. To probe LLM behavior, we evaluate diverse models across multiple conditions formed by crossing three perspective framings (neutral, personalized, and third-person/cultural observer) with two language modes (English and localized native languages: Arabic, Persian, Turkish). Our analysis reveals three critical phenomena: “Cross-Lingual Value Shifts” where identical questions yield drastically different responses based on language, “Reasoning-Induced Degradation” where prompting models to explain their reasoning worsens cultural alignment, and “Logit Leakage” where models refuse sensitive questions while internal probabilities reveal strong hidden preferences. We further demonstrate that models collapse into simplistic linguistic categories when operating in native languages, treating diverse nations as monolithic entities. MENAValues offers a scalable framework for diagnosing cultural misalignment, providing both empirical insights and methodological tools for developing more culturally inclusive AI.
zh
[NLP-61] Stable LLM Ensemble: Interaction between Example Representativeness and Diversity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在单样本提示(one-shot prompting)场景下预测准确性与鲁棒性不足的问题,特别是受限于示例选择质量与集成成员间输出多样性的影响。其解决方案的关键在于:一方面采用基于中心点(centroid-based)的代表性示例选择策略替代随机采样,提升示例的代表性;另一方面通过调节采样温度(sampling temperature)引入可控的输出多样性,从而优化集成性能。实验表明,该方法在宏平均F1分数上较随机选择提升7.6%,均方根误差(RMSE)降低10.5%,并优于5-shot提示设置,验证了代表性示例与适度多样性协同作用对构建高效单样本LLM集成的重要性。
链接: https://arxiv.org/abs/2510.13143
作者: Junichiro Niimi
机构: Meijo University (明治大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable results in wide range of domains. However, the accuracy and robustness of one-shot LLM predictions remain highly sensitive to the examples and the diversity among ensemble members. This study systematically investigates the effects of example representativeness (one-shot strategy) and output diversity (sampling temperature) on LLM ensemble performance. Two one-shot strategies are compared: centroid-based representative examples (proposed) and randomly sampled examples (baseline) and sampling temperature also is varied. The proposed approach with higher temperature setting significantly outperforms random selection by +7.6% (macro-F1) and -10.5% (RMSE). Furthermore, the proposed model exceeds 5-shot prompting by +21.1% (macro-F1) and -24.0% (RMSE). Our findings demonstrate that combining representative example selection with increased temperature provides the appropriate level of diversity to the ensemble. This work highlights the practical importance of both example selection and controlled diversity in designing effective one-shot LLM ensembles.
zh
[NLP-62] Addressing the alignment problem in transportation policy making: an LLM approach
【速读】: 该论文旨在解决交通规划中一个关键问题:异质性出行者群体的集体偏好常与基于模型决策工具生成的政策不一致,导致政策实施延迟或失败。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)模拟不同社区居民在公共交通政策提案中的偏好表达,并通过链式思维推理(chain-of-thought reasoning)生成排序选择或批准型偏好,再以即时淘汰投票制(Instant-Runoff Voting, IRV)进行聚合,从而建模民主共识。该方法在芝加哥和休斯顿的城市场景中验证了LLM代理能够近似合理的集体偏好并响应本地情境,同时揭示其模型特异性行为偏差及与优化基准的微小差异,体现了LLMs在提升交通决策与公众偏好对齐方面的潜力与局限。
链接: https://arxiv.org/abs/2510.13139
作者: Xiaoyu Yan,Tianxing Dai, Yu (Marco)Nie
机构: Northwestern University (西北大学)
类目: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:A key challenge in transportation planning is that the collective preferences of heterogeneous travelers often diverge from the policies produced by model-driven decision tools. This misalignment frequently results in implementation delays or failures. Here, we investigate whether large language models (LLMs), noted for their capabilities in reasoning and simulating human decision-making, can help inform and address this alignment problem. We develop a multi-agent simulation in which LLMs, acting as agents representing residents from different communities in a city, participate in a referendum on a set of transit policy proposals. Using chain-of-thought reasoning, LLM agents provide ranked-choice or approval-based preferences, which are aggregated using instant-runoff voting (IRV) to model democratic consensus. We implement this simulation framework with both GPT-4o and Claude-3.5, and apply it for Chicago and Houston. Our findings suggest that LLM agents are capable of approximating plausible collective preferences and responding to local context, while also displaying model-specific behavioral biases and modest divergences from optimization-based benchmarks. These capabilities underscore both the promise and limitations of LLMs as tools for solving the alignment problem in transportation decision-making.
zh
[NLP-63] On the Reasoning Abilities of Masked Diffusion Language Models
【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在文本生成任务中推理能力的理论边界及其效率问题,特别是其并行生成特性是否能带来计算优势。解决方案的关键在于将MDMs与已知的推理框架——思维链(Chain of Thought, CoT)和填充循环变压器(Padded Looped Transformers, PLTs)在有限精度对数宽度(finite-precision log-width)设定下进行形式化关联:作者证明了MDMs与多项式填充PLTs在此设定下等价,并且MDMs能够求解所有CoT增强型变压器可解的问题;更重要的是,他们揭示了某些问题类别(如正则语言)中,MDMs由于并行生成机制,在推理效率上显著优于CoT变压器。
链接: https://arxiv.org/abs/2510.13117
作者: Anej Svete,Ashish Sabharwal
机构: ETH Zürich (苏黎世联邦理工学院); Allen Institute for AI (艾伦人工智能研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.
zh
[NLP-64] Multi-Label Clinical Text Eligibility Classification and Summarization System
【速读】: 该论文旨在解决临床试验中受试者筛选效率低下的问题,特别是如何自动化处理多标签临床文本的准入资格分类与摘要生成,以提升研究效率并确保参与者具备多样化的医学背景。解决方案的关键在于融合自然语言处理(Natural Language Processing, NLP)与大语言模型(Large Language Models, LLMs)技术,通过结合词嵌入(Word2Vec)、命名实体识别(Named Entity Recognition, NER)以及传统向量化方法(如词频-逆文档频率,TF-IDF)提取特征,并引入加权TF-IDF词嵌入以更有效捕捉术语重要性;在此基础上,采用随机森林(Random Forest)和支持向量机(SVM)进行多标签分类,同时对比TextRank、Luhn及GPT-3等摘要算法,最终基于ROUGE分数验证了方法的有效性,展现出数据驱动自动化评估临床试验准入资格的潜力。
链接: https://arxiv.org/abs/2510.13115
作者: Surya Tejaswi Yerramsetty,Almas Fathimah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical trials are central to medical progress because they help improve understanding of human health and the healthcare system. They play a key role in discovering new ways to detect, prevent, or treat diseases, and it is essential that clinical trials include participants with appropriate and diverse medical backgrounds. In this paper, we propose a system that leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to automate multi-label clinical text eligibility classification and summarization. The system combines feature extraction methods such as word embeddings (Word2Vec) and named entity recognition to identify relevant medical concepts, along with traditional vectorization techniques such as count vectorization and TF-IDF (Term Frequency-Inverse Document Frequency). We further explore weighted TF-IDF word embeddings that integrate both count-based and embedding-based strengths to capture term importance effectively. Multi-label classification using Random Forest and SVM models is applied to categorize documents based on eligibility criteria. Summarization techniques including TextRank, Luhn, and GPT-3 are evaluated to concisely summarize eligibility requirements. Evaluation with ROUGE scores demonstrates the effectiveness of the proposed methods. This system shows potential for automating clinical trial eligibility assessment using data-driven approaches, thereby improving research efficiency.
zh
[NLP-65] RUSTVIS: A Multi-Dimensional Trustworthiness Evaluation Framework for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全性与鲁棒性方面的可信度问题,这些问题直接影响其在实际应用中的可靠性。解决方案的关键在于提出TRUSTVIS——一个自动化评估框架,该框架通过集成多种已知的扰动方法(如AutoDAN)并采用多数投票机制融合不同评估策略,实现对LLM可信度的全面量化分析;同时,其交互式用户界面提供直观的可视化工具,使复杂评估过程更易理解与操作,从而帮助用户精准识别模型漏洞并推动针对性优化。
链接: https://arxiv.org/abs/2510.13106
作者: Ruoyu Sun,Da Song,Jiayang Song,Yuheng Huang,Lei Ma
机构: University of Alberta(阿尔伯塔大学); Mila - Quebec Artificial Intelligence Institute(魁北克人工智能研究所); The University of Tokyo(东京大学); Macau University of Science and Technology(澳门科技大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 2 figures, To appear in ASE 2025 Demo Track
Abstract:As Large Language Models (LLMs) continue to revolutionize Natural Language Processing (NLP) applications, critical concerns about their trustworthiness persist, particularly in safety and robustness. To address these challenges, we introduce TRUSTVIS, an automated evaluation framework that provides a comprehensive assessment of LLM trustworthiness. A key feature of our framework is its interactive user interface, designed to offer intuitive visualizations of trustworthiness metrics. By integrating well-known perturbation methods like AutoDAN and employing majority voting across various evaluation methods, TRUSTVIS not only provides reliable results but also makes complex evaluation processes accessible to users. Preliminary case studies on models like Vicuna-7b, Llama2-7b, and GPT-3.5 demonstrate the effectiveness of our framework in identifying safety and robustness vulnerabilities, while the interactive interface allows users to explore results in detail, empowering targeted model improvements. Video Link: this https URL
zh
[NLP-66] ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中不确定性量化(Uncertainty Quantification, UQ)的难题,尤其是如何有效估计其中的认知不确定性(epistemic uncertainty)。其解决方案的关键在于从因果视角建立LLMs不确定性与其在语义保持干预下的不变性之间的联系,并提出一种新颖的灰箱(grey-box)不确定性量化方法:通过测量模型输出在语义保持干预前后的变化来估算不确定性。该方法不仅具备理论保障,还在多种LLM和问答数据集上展现出优越的有效性和计算效率。
链接: https://arxiv.org/abs/2510.13103
作者: Mingda Li,Xinyu Li,Weinan Zhang,Longxuan Ma
机构: Harbin Institute of Technology (哈尔滨工业大学); Kunming University of Science and Technology (昆明理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Uncertainty Quantification (UQ) is a promising approach to improve model reliability, yet quantifying the uncertainty of Large Language Models (LLMs) is non-trivial. In this work, we establish a connection between the uncertainty of LLMs and their invariance under semantic-preserving intervention from a causal perspective. Building on this foundation, we propose a novel grey-box uncertainty quantification method that measures the variation in model outputs before and after the semantic-preserving intervention. Through theoretical justification, we show that our method provides an effective estimate of epistemic uncertainty. Our extensive experiments, conducted across various LLMs and a variety of question-answering (QA) datasets, demonstrate that our method excels not only in terms of effectiveness but also in computational efficiency.
zh
[NLP-67] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models
【速读】: 该论文旨在解决大规模语言模型中混合专家(Mixture-of-Experts, MoE)架构在扩展过程中出现的功能相似专家同时被激活的问题,这会导致冗余计算并限制有效模型容量。现有辅助平衡损失方法虽能改善令牌分布,但未能从根本上提升专家多样性。论文提出一种无需额外可学习参数的新型方法 GatePro,其关键在于通过识别功能最相似的专家对并引入局部竞争机制,从而抑制冗余专家共激活,同时保持自然的专业化分工,实现专家能力的显著分化与互补,有效提升 MoE 架构的整体效率和表达能力。
链接: https://arxiv.org/abs/2510.13079
作者: Chen Zheng,Yuhang Cai,Deyi Liu,Jin Ma,Yiyuan Ma,Yuan Yang,Jing Liu,Yutao Zeng,Xun Zhou,Siyuan Qiao
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Modern large language models leverage Mixture-of-Experts (MoE) architectures for efficient scaling, but face a critical challenge: functionally similar experts are often selected simultaneously, creating redundant computation and limiting effective model capacity. Existing auxiliary balance loss methods improve token distribution but fail to address the underlying expert diversity problem. We introduce GatePro, a novel parameter-free method that directly promotes expert selection diversity. GatePro identifies the most similar expert pairs and introduces localized competition mechanisms, preventing redundant expert co-activation while maintaining natural expert specialization. Our comprehensive evaluation demonstrates GatePro’s effectiveness across model scales and benchmarks. Analysis demonstrates GatePro’s ability to achieve enhanced expert diversity, where experts develop more distinct and complementary capabilities, avoiding functional redundancy. This approach can be deployed hot-swappable during any training phase without additional learnable parameters, offering a practical solution for improving MoE effectiveness.
zh
[NLP-68] On the Role of Preference Variance in Preference Optimization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对齐过程中依赖昂贵且低效的人类偏好数据标注的问题。其核心解决方案是提出并验证**参考方差(Preference Variance, PVar)**作为衡量偏好数据质量的关键指标,即通过量化模型在比较响应对时偏好分布的波动程度来筛选更具信息量的训练样本。理论分析表明,DPO(Direct Preference Optimization)梯度范数受PVar上界控制,低PVar提示梯度更新微弱,导致学习效率低下;实验进一步证明,基于PVar选择的高方差提示能显著优于随机或低方差样本,并在使用较小奖励模型时仍具鲁棒性,甚至仅用UltraFeedback数据集中Top 10%的高PVar样本即可超越全量训练效果,从而实现更高效的LLM对齐。
链接: https://arxiv.org/abs/2510.13022
作者: Jiacheng Guo,Zihao Li,Jiahao Qiu,Yue Wu,Mengdi Wang
机构: Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of \emphpreference variance (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt, showing it is controlled by the PVar of that prompt. This implies that prompts with low PVar can only produce small gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on two benchmarks (AlpacaEval 2.0 and Arena-Hard). Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts or those with lower PVar. We also show that our PVar-based selection method is robust, when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.
zh
[NLP-69] CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models
【速读】: 该论文旨在解决当前持续学习(Continual Learning, CL)评估中缺乏基于人类发展轨迹的系统性基准问题,以更真实地衡量语言模型在逐步习得新技能过程中的表现。其解决方案的关键在于构建了一个名为CurlL的综合性持续学习数据集与基准,该数据集基于5-10岁儿童的发展阶段划分(共五个阶段),并通过一个技能图谱将宽泛能力细分为可测量的小技能、具体目标及指标,并明确标注能力间的依赖关系;同时生成包含234亿token的合成数据,涵盖段落、理解型问答(Comprehension-based QA, CQA)、技能测试型问答(Skill-testing QA, CSQA)和指令响应对(Instruction-Response, IR)等多种格式,支持对遗忘、前向迁移和后向迁移的精细化分析。这一设计使得模型训练可在独立、联合和顺序(持续)三种设置下进行对比,从而揭示技能保留与迁移效率之间的权衡关系。
链接: https://arxiv.org/abs/2510.13008
作者: Pavan Kalyan,Shubhra Mishra,Satya Lokam,Navin Goyal
机构: Microsoft Research (微软研究院); KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a comprehensive continual learning dataset and benchmark (CurlL) grounded in human developmental trajectories from ages 5-10, enabling systematic and fine-grained assessment of models’ ability to progressively acquire new skills. CurlL spans five developmental stages (0-4) covering ages 5-10, supported by a skill graph that breaks down broad skills into smaller abilities, concrete goals, and measurable indicators, while also capturing which abilities build on others. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction-response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.
zh
[NLP-70] OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在微调大语言模型时面临的灾难性遗忘问题,即模型在学习新任务更新时会干扰预训练阶段所学的主导奇异方向(dominant singular directions),从而损害原有知识的保留。解决方案的关键在于提出正交投影低秩适配(Orthogonal Projection LoRA, OPLoRA),其核心机制是通过奇异值分解(SVD)对冻结权重进行分解,并利用双侧正交投影 $ P_L = I - U_k U_k^\top $ 和 $ P_R = I - V_k V_k^\top $,将LoRA更新约束在前 $ k $ 个主奇异子空间的正交补空间内,从而精确保持前 $ k $ 个奇异三元组不变,从理论上保障了预训练知识的保留。
链接: https://arxiv.org/abs/2510.13003
作者: Yifeng Xiong,Xiaohui Xie
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models but suffers from catastrophic forgetting when learned updates interfere with the dominant singular directions that encode essential pre-trained knowledge. We propose Orthogonal Projection LoRA (OPLoRA), a theoretically grounded approach that prevents this interference through double-sided orthogonal projections. By decomposing frozen weights via SVD, OPLoRA constrains LoRA updates to lie entirely within the orthogonal complement of the top- k singular subspace using projections P_L = I - U_k U_k^\top and P_R = I - V_k V_k^\top . We prove that this construction exactly preserves the top- k singular triples, providing mathematical guarantees for knowledge retention. To quantify subspace interference, we introduce \rho_k , a metric measuring update alignment with dominant directions. Extensive experiments across commonsense reasoning, mathematics, and code generation demonstrate that OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance on LLaMA-2 7B and Qwen2.5 7B, establishing orthogonal projection as an effective mechanism for knowledge preservation in parameter-efficient fine-tuning.
zh
[NLP-71] Max It or Miss It: Benchmarking LLM On Solving Extremal Problems
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在数学推理能力评估中存在的盲区问题,特别是对优化推理(optimization reasoning)——即在约束条件下寻找极值的能力——缺乏系统性测评。其解决方案的关键在于构建了一个名为ExtremBench的新基准数据集,该数据集由中国数学奥林匹克竞赛中的不等式习题转化而来,包含93个标准化的极值求解问题,并在此基础上对多个主流开源模型(如Qwen3、GPT-OSS和DeepSeek)进行了全面评估。结果表明,LLMs在传统数学基准(如AIME25和MATH-500)上的表现与其在极值求解任务中的表现存在显著差异,揭示了现有评估体系未能充分覆盖数学推理能力的多样性,从而凸显出优化推理作为一项基础抽象在实际应用(如规划、控制、资源分配等)中的重要性。
链接: https://arxiv.org/abs/2510.12997
作者: Binxin Gao,Jingjun Han
机构: University of Maryland (马里兰大学); Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Our benchmark dataset is available at this https URL
Abstract:Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into 93 standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs’ extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.
zh
[NLP-72] A Multilingual Large-Scale Study of the Interplay between LLM Safeguards Personalisation and Disinformation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成具有说服力且针对特定人群的虚假信息(persona-targeted disinformation)方面的潜在风险问题,尤其是当前对个性化策略如何影响模型安全机制失效(jailbreaking)以及生成内容的说服力缺乏系统研究。其解决方案的关键在于构建并公开了首个大规模多语言实证数据集AI-TRAITS(AI-generaTed peRsonAlIsed disinformaTion dataset),该数据集包含约160万条由8个前沿LLM生成的文本,基于324种虚假叙事与150个不同人口属性(如国家、世代、政治倾向)组合的提示词,覆盖英语、俄语、葡萄牙语和印地语四种主要语言。通过红队测试方法系统评估LLM安全机制对个性化提示的鲁棒性,结果表明:即使使用简单的个性化策略,也能显著提升所有被测LLM的越狱成功率,并改变语言和修辞模式,增强虚假信息的说服力,从而揭示了当前LLM在跨语言和跨人群场景下的关键安全漏洞,为改进安全对齐与检测机制提供了实证基础。
链接: https://arxiv.org/abs/2510.12993
作者: João A. Leite,Arnav Arora,Silvia Gargova,João Luz,Gustavo Sampaio,Ian Roberts,Carolina Scarton,Kalina Bontcheva
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The human-like proficiency of Large Language Models (LLMs) has brought concerns about their potential misuse for generating persuasive and personalised disinformation at scale. While prior work has demonstrated that LLMs can generate disinformation, specific questions around persuasiveness and personalisation (generation of disinformation tailored to specific demographic attributes) remain largely unstudied. This paper presents the first large-scale, multilingual empirical study on persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we systematically evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion dataSet), a new dataset of around 1.6 million texts generated by eight state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324 disinformation narratives and 150 distinct persona profiles, covering four major languages (English, Russian, Portuguese, Hindi) and key demographic dimensions (country, generation, political orientation). The resulting personalised narratives are then assessed quantitatively and compared along the dimensions of models, languages, jailbreaking rate, and personalisation attributes. Our findings demonstrate that the use of even simple personalisation strategies in the prompts significantly increases the likelihood of jailbreaks for all studied LLMs. Furthermore, personalised prompts result in altered linguistic and rhetorical patterns and amplify the persuasiveness of the LLM-generated false narratives. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.
zh
[NLP-73] UNCAP: Uncertainty-Guided Planning Using Natural Language Communication for Cooperative Autonomous Vehicles
【速读】: 该论文旨在解决多辆协同自动驾驶车辆(CAVs)在大规模协作中通信效率低且决策安全性不足的问题,现有方法要么依赖高带宽的原始传感器数据传输,要么忽视共享信息中的感知与规划不确定性,导致系统难以扩展且存在安全隐患。解决方案的关键在于提出一种基于视觉-语言模型的协作式自主规划方法——Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP),其核心创新是通过两阶段通信协议实现轻量化自然语言消息传递,并显式建模感知不确定性:首先由本车筛选出最相关的协作车辆,随后这些车辆发送量化感知不确定性的语义消息;通过选择性融合最大化互信息的消息,使本车仅整合最相关的信息用于决策,从而显著提升协作规划的可扩展性与可靠性。
链接: https://arxiv.org/abs/2510.12992
作者: Neel P. Bhatt,Po-han Li,Kushagra Gupta,Rohan Siva,Daniel Milan,Alexander T. Hogue,Sandeep P. Chinchali,David Fridovich-Keil,Zhangyang Wang,Ufuk Topcu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Safe large-scale coordination of multiple cooperative connected autonomous vehicles (CAVs) hinges on communication that is both efficient and interpretable. Existing approaches either rely on transmitting high-bandwidth raw sensor data streams or neglect perception and planning uncertainties inherent in shared data, resulting in systems that are neither scalable nor safe. To address these limitations, we propose Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP), a vision-language model-based planning approach that enables CAVs to communicate via lightweight natural language messages while explicitly accounting for perception uncertainty in decision-making. UNCAP features a two-stage communication protocol: (i) an ego CAV first identifies the subset of vehicles most relevant for information exchange, and (ii) the selected CAVs then transmit messages that quantitatively express their perception uncertainty. By selectively fusing messages that maximize mutual information, this strategy allows the ego vehicle to integrate only the most relevant signals into its decision-making, improving both the scalability and reliability of cooperative planning. Experiments across diverse driving scenarios show a 63% reduction in communication bandwidth with a 31% increase in driving safety score, a 61% reduction in decision uncertainty, and a four-fold increase in collision distance margin during near-miss events. Project website: this https URL
zh
[NLP-74] DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的深度研究代理在复杂任务中因规划阶段优化不足而导致的性能瓶颈问题。现有方法要么依赖推理阶段的隐式规划,要么引入显式规划器但未系统性地优化规划过程,导致规划令牌(planning tokens)在强化学习(Reinforcement Learning, RL)训练中表现出显著更高的熵值,表明决策点存在不确定性且未被充分优化。解决方案的关键在于提出一种端到端的强化学习框架 DeepPlanner,其通过引入基于熵的 token 级优势调整机制,对高熵令牌分配更大的更新幅度,并选择性地提升以规划密集型轨迹为基础的样本级优势,从而有效增强代理的规划能力。实验结果表明,该方法在七个深度研究基准上均实现了更优的规划质量与性能,且训练预算显著降低。
链接: https://arxiv.org/abs/2510.12979
作者: Wei Fan,Wenlin Yao,Zheng Li,Feng Yao,Xin Liu,Liang Qiu,Qingyu Yin,Yangqiu Song,Bing Yin
机构: The Hong Kong University of Science and Technology (香港科技大学); Amazon; University of California, San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review
Abstract:Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.
zh
[NLP-75] 3-Model Speculative Decoding NEURIPS
【速读】: 该论文旨在解决生成式 AI (Generative AI) 推理过程中因推测解码(Speculative Decoding, SD)中 draft 模型与 target 模型之间分布差异导致的吞吐量瓶颈问题。具体而言,较小的 draft 模型虽能快速生成候选 token,但其输出与 target 模型预测差异大,造成 token 接受率低,从而限制了整体加速效果。解决方案的关键在于提出分层式推测解码方法——Pyramid Speculative Decoding (PyramidSD),通过在 draft 与 target 模型之间引入一个中间 qualifier 模型,逐步缩小输出分布差距,增强模型间对齐性,从而允许使用更小的 draft 模型而不牺牲性能;同时结合模糊接受准则(fuzzy acceptance criteria),放宽各阶段的分歧阈值,进一步提升吞吐效率。实验表明,PyramidSD 相比标准 SD 最高可实现 1.91 倍生成速度提升,并在资源受限场景下显著优化推理效率。
链接: https://arxiv.org/abs/2510.12966
作者: Sanghyun Byun,Mohanad Odema,Jung Ick Guack,Baisub Lee,Jacob Song,Woo Seong Chung
机构: LG Electronics USA (LG 电子美国)
类目: Computation and Language (cs.CL)
备注: Accepted at NeurIPS SPIGM 2025
Abstract:Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a trade-off between draft model size and token acceptance: smaller draft models generate tokens more quickly but exhibit greater divergence from the target model, resulting in lower acceptance rates and reduced speedups. We introduce Pyramid Speculative Decoding (PyramidSD), an extension of SD that inserts an intermediate qualifier model between the draft and target to bridge the distributional gap in output predictions, allowing smaller model to be used for drafting. This hierarchical decoding strategy improves alignment across models, enabling higher acceptance rates and allowing the use of significantly smaller draft models without sacrificing overall performance. PyramidSD builds on fuzzy acceptance criteria to support relaxed divergence thresholds at each stage, improving throughput. In experiments, PyramidSD achieves up to 1.91x generation speed over standard SD, reaching 124 tokens per second on a consumer GPU (RTX 4090). In small-memory settings with a 1B-parameter draft model and an 8B target model, PyramidSD minimally trades target model quality for improved throughput. Overall, PyramidSD offers a practical approach to enhancing speculative decoding efficiency and can be readily applied to existing inference pipelines.
zh
[NLP-76] he Curious Case of Curiosity across Human Cultures and LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化语境下对“好奇心”这一核心人类驱动力的建模不足问题,尤其关注不同文化背景下用户表达好奇的方式差异。其关键解决方案是提出CUEST(CUriosity Evaluation across SocieTies)评估框架,通过语言风格(linguistic style)、话题偏好(topic preference)分析,并结合社会科学研究理论构建可解释的度量指标,量化人类与模型在好奇心表达上的对齐程度;在此基础上,进一步采用微调策略提升模型的文化敏感性,使LLMs更贴近非西方文化中用户的好奇心表达模式,从而缩小人机对齐差距达50%。
链接: https://arxiv.org/abs/2510.12943
作者: Angana Borah,Rada Mihalcea
机构: University of Michigan, Ann Arbor, USA (密歇根大学,安娜堡分校)
类目: Computation and Language (cs.CL)
备注: Preprint (Paper under review)
Abstract:Recent advances in Large Language Models (LLMs) have expanded their role in human interaction, yet curiosity – a central driver of inquiry – remains underexplored in these systems, particularly across cultural contexts. In this work, we investigate cultural variation in curiosity using Yahoo! Answers, a real-world multi-country dataset spanning diverse topics. We introduce CUEST (CUriosity Evaluation across SocieTies), an evaluation framework that measures human-model alignment in curiosity through linguistic (style), topic preference (content) analysis and grounding insights in social science constructs. Across open- and closed-source models, we find that LLMs flatten cross-cultural diversity, aligning more closely with how curiosity is expressed in Western countries. We then explore fine-tuning strategies to induce curiosity in LLMs, narrowing the human-model alignment gap by up to 50%. Finally, we demonstrate the practical value of curiosity for LLM adaptability across cultures, showing its importance for future NLP research.
zh
[NLP-77] Unifying Vision-Language Latents for Zero-label Image Caption Enhancement NEURIPS2025
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在图像描述任务中对标注图像数据的高度依赖问题,从而限制了其可扩展性并导致大量未标注图像数据被浪费。解决方案的关键在于提出一种名为ViZer(Unified Vision-Language Alignment for Zero-Label Enhancement)的增强训练框架,该框架通过在训练过程中主动对齐视觉与语言表示特征,使现有VLM无需文本标签或完整重训练即可生成更准确、更具描述性的图像标题,实现了零标签学习(zero-label learning)的有效应用。
链接: https://arxiv.org/abs/2510.12931
作者: Sanghyun Byun,Jung Ick Guack,Mohanad Odema,Baisub Lee,Jacob Song,Woo Seong Chung
机构: LG Electronics (LG 电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to PMLR and NeurIPS 2025 UniReps
Abstract:Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer’s advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.
zh
[NLP-78] Whos Asking? Evaluating LLM Robustness to Inquiry Personas in Factual Question Answering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对用户披露的 Inquiry Persona(即用户身份、专业背景或信念等个人特征)时,其回答事实性与可靠性可能被干扰的问题。现有研究多关注对抗性输入或干扰项对模型鲁棒性的影响,而本文首次系统评估了真实交互中用户自然呈现的 inquiry persona 对问答准确性的潜在影响。解决方案的关键在于引入一种基于人类中心的 inquiry persona 测试框架,通过模拟真实用户披露的个性化线索,识别模型在面对不同用户角色时可能出现的错误响应模式,如拒绝回答、虚构限制或角色混淆,从而揭示模型对用户语境敏感性带来的事实可靠性风险,并将 inquiry persona 测试确立为一种有效的鲁棒性评估工具。
链接: https://arxiv.org/abs/2510.12925
作者: Nil-Jana Akpinar,Chia-Jung Lee,Vanessa Murdock,Pietro Perona
机构: Microsoft(微软); AWS Responsible AI(亚马逊云科技负责任AI); Amazon Web Services(亚马逊网络服务)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) should answer factual questions truthfully, grounded in objective knowledge, regardless of user context such as self-disclosed personal information, or system personalization. In this paper, we present the first systematic evaluation of LLM robustness to inquiry personas, i.e. user profiles that convey attributes like identity, expertise, or belief. While prior work has primarily focused on adversarial inputs or distractors for robustness testing, we evaluate plausible, human-centered inquiry persona cues that users disclose in real-world interactions. We find that such cues can meaningfully alter QA accuracy and trigger failure modes such as refusals, hallucinated limitations, and role confusion. These effects highlight how model sensitivity to user framing can compromise factual reliability, and position inquiry persona testing as an effective tool for robustness evaluation.
zh
[NLP-79] oward LLM -Supported Automated Assessment of Critical Thinking Subskills
【速读】: 该论文旨在解决教育领域中对批判性思维(Critical Thinking)技能进行可扩展、自动化评估的难题,尤其是在真实学习情境下如何有效识别和量化其核心子技能。解决方案的关键在于构建一个基于既有能力发展路径的编码评分量表,并利用三种大型语言模型(GPT-5、GPT-5-mini 和 ModernBERT)分别采用零样本提示(zero-shot prompting)、少样本提示(few-shot prompting)和监督微调(supervised fine-tuning)三种自动化评分方法,对学生的论证类作文进行分析。研究发现,GPT-5结合少样本提示在多数子技能上表现最优,尤其擅长识别具有明确类别划分和高频特征的子技能;而开放源代码模型虽精度略低,但在处理少数类别时更具鲁棒性,体现了自动化评估中可靠性与敏感性的权衡。
链接: https://arxiv.org/abs/2510.12915
作者: Marisa C. Peczuh,Nischal Ashok Kumar,Ryan Baker,Blair Lehman,Danielle Eisenberg,Caitlin Mills,Keerthi Chebrolu,Sudhip Nashi,Cadence Young,Brayden Liu,Sherry Lachman,Andrew Lan
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint: 17 pages
Abstract:Critical thinking represents a fundamental competency in today’s education landscape. Developing critical thinking skills through timely assessment and feedback is crucial; however, there has not been extensive work in the learning analytics community on defining, measuring, and supporting critical thinking. In this paper, we investigate the feasibility of measuring core “subskills” that underlie critical thinking. We ground our work in an authentic task where students operationalize critical thinking: student-written argumentative essays. We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays. We then evaluated three distinct approaches to automated scoring: zero-shot prompting, few-shot prompting, and supervised fine-tuning, implemented across three large language models (GPT-5, GPT-5-mini, and ModernBERT). GPT-5 with few-shot prompting achieved the strongest results and demonstrated particular strength on subskills with separable, frequent categories, while lower performance was observed for subskills that required detection of subtle distinctions or rare categories. Our results underscore critical trade-offs in automated critical thinking assessment: proprietary models offer superior reliability at higher cost, while open-source alternatives provide practical accuracy with reduced sensitivity to minority categories. Our work represents an initial step toward scalable assessment of higher-order reasoning skills across authentic educational contexts.
zh
[NLP-80] EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在智能教育场景中缺乏专门用于评估其教学能力的多轮师生对话基准数据集的问题。现有对话评测基准未充分覆盖教育目标层次结构和差异化教学策略,难以真实反映LLMs在个性化、以学生为中心的教学交互中的表现。解决方案的关键在于构建EduDial——一个基于布卢姆教育目标分类法(Bloom’s taxonomy of educational objectives)设计的多轮师生对话数据集,涵盖345个核心知识点和34,250个对话会话,并引入情境提问、最近发展区(Zone of Proximal Development, ZPD)提问及元认知提问等十种教学策略,同时针对不同认知水平的学生设计差异化教学策略;在此基础上训练出EduDial-LLM 32B模型,并提出一个包含11个维度的系统性评估框架,从而实现对LLM教学能力的全面量化分析。实验表明,该方案显著提升了模型在学生中心教学场景下的表现,优于所有基线模型。
链接: https://arxiv.org/abs/2510.12899
作者: Shouang Wei,Min Zhang,Xin Lin,Bo Jiang,Zhongxiang Dai,Kun Kuang
机构: East China Normal University (华东师范大学); Zhejiang University (浙江大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, several multi-turn dialogue benchmarks have been proposed to evaluate the conversational abilities of large language models (LLMs). As LLMs are increasingly recognized as a key technology for advancing intelligent education, owing to their ability to deeply understand instructional contexts and provide personalized guidance, the construction of dedicated teacher-student dialogue benchmarks has become particularly important. To this end, we present EduDial, a comprehensive multi-turn teacher-student dialogue dataset. EduDial covers 345 core knowledge points and consists of 34,250 dialogue sessions generated through interactions between teacher and student agents. Its design is guided by Bloom’s taxonomy of educational objectives and incorporates ten questioning strategies, including situational questioning, zone of proximal development (ZPD) questioning, and metacognitive questioning-thus better capturing authentic classroom interactions. Furthermore, we design differentiated teaching strategies for students at different cognitive levels, thereby providing more targeted teaching guidance. Building on EduDial, we further develop EduDial-LLM 32B via training and propose an 11-dimensional evaluation framework that systematically measures the teaching abilities of LLMs, encompassing both overall teaching quality and content quality. Experiments on 17 mainstream LLMs reveal that most models struggle in student-centered teaching scenarios, whereas our EduDial-LLM achieves significant gains, consistently outperforming all baselines across all metrics. The code is available at this https URL.
zh
[NLP-81] From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为智能体(agentic AI)推理引擎时存在的“规则刚性”问题,即模型过度严格遵循显式指令,导致决策与人类常识和意图偏离,从而阻碍可信自主代理的构建。解决方案的关键在于提出一种名为规则-意图区分(Rule-Intent Distinction, RID)的元提示(meta-prompting)框架,该框架通过提供结构化的认知模板,引导模型在任务分解、规则分类、冲突结果权衡及决策解释过程中识别并优先考虑人类意图,实现零样本(zero-shot)条件下的人类对齐异常处理。实验表明,RID显著提升模型的人类对齐度(Human Alignment Score, HAS),达到95%,优于基线(80%)和链式思维(Chain-of-Thought, CoT)提示(75%),且生成的推理更具目标导向性和高质量。
链接: https://arxiv.org/abs/2510.12864
作者: Imran Khan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages. Code and data are available at this https URL
Abstract:Large Language Models (LLMs) are increasingly being deployed as the reasoning engines for agentic AI systems, yet they exhibit a critical flaw: a rigid adherence to explicit rules that leads to decisions misaligned with human common sense and intent. This “rule-rigidity” is a significant barrier to building trustworthy autonomous agents. While prior work has shown that supervised fine-tuning (SFT) with human explanations can mitigate this issue, SFT is computationally expensive and inaccessible to many practitioners. To address this gap, we introduce the Rule-Intent Distinction (RID) Framework, a novel, low-compute meta-prompting technique designed to elicit human-aligned exception handling in LLMs in a zero-shot manner. The RID framework provides the model with a structured cognitive schema for deconstructing tasks, classifying rules, weighing conflicting outcomes, and justifying its final decision. We evaluated the RID framework against baseline and Chain-of-Thought (CoT) prompting on a custom benchmark of 20 scenarios requiring nuanced judgment across diverse domains. Our human-verified results demonstrate that the RID framework significantly improves performance, achieving a 95% Human Alignment Score (HAS), compared to 80% for the baseline and 75% for CoT. Furthermore, it consistently produces higher-quality, intent-driven reasoning. This work presents a practical, accessible, and effective method for steering LLMs from literal instruction-following to liberal, goal-oriented reasoning, paving the way for more reliable and pragmatic AI agents.
zh
[NLP-82] A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation
【速读】: 该论文试图解决现代 Quranic recitation(诵读)教学中自动化评估工具缺乏有效性和广泛应用的问题,尤其是现有基于自动语音识别(Automatic Speech Recognition, ASR)的系统因侧重词汇识别而非声学质量评估,存在数据依赖性强、人群偏差显著及无法提供诊断性反馈等局限。解决方案的关键在于从数据驱动范式转向以知识为中心的计算框架,即构建基于伊斯兰教义中诵读规则(Tajweed)和发音点(Makhraj)的前瞻性声学建模体系,从而实现更鲁棒、公平且符合教学需求的评估系统。
链接: https://arxiv.org/abs/2510.12858
作者: Mohammed Hilal Al-Kharusi,Khizar Hayat,Khalil Bader Al Ruqeishi,Haroon Rashid Lone
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 33 pages
Abstract:The sacred practice of Quranic recitation (Tajweed), governed by precise phonetic, prosodic, and theological rules, faces significant pedagogical challenges in the modern era. While digital technologies promise unprecedented access to education, automated tools for recitation evaluation have failed to achieve widespread adoption or pedagogical efficacy. This literature review investigates this critical gap, conducting a comprehensive analysis of academic research, web platforms, and commercial applications developed over the past two decades. Our synthesis reveals a fundamental misalignment in prevailing approaches that repurpose Automatic Speech Recognition (ASR) architectures, which prioritize lexical recognition over qualitative acoustic assessment and are plagued by data dependency, demographic biases, and an inability to provide diagnostically useful feedback. Critiquing these data–driven paradigms, we argue for a foundational paradigm shift towards a knowledge-centric computational framework. Capitalizing on the immutable nature of the Quranic text and the precisely defined rules of Tajweed, we propose that a robust evaluator must be architected around anticipatory acoustic modeling based on canonical rules and articulation points (Makhraj), rather than relying on statistical patterns learned from imperfect and biased datasets. This review concludes that the future of automated Quranic evaluation lies in hybrid systems that integrate deep linguistic knowledge with advanced audio analysis, offering a path toward robust, equitable, and pedagogically sound tools that can faithfully support learners worldwide.
zh
[NLP-83] Efficient Adaptive Transformer: An Empirical Study and Reproducible Framework
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中模型推理效率与精度之间的权衡问题,尤其是在延迟敏感场景下如何实现动态计算以提升性能。其解决方案的关键在于提出了一种统一的高效自适应Transformer(Efficient Adaptive Transformer, EAT)框架,将三种自适应效率技术——渐进式token剪枝、稀疏注意力机制和动态早退出策略——整合为一个可复现的端到端架构,并提供开源基准测试流程,支持在GLUE任务(如SST-2、QQP、MNLI)上自动化数据处理、计时和消融实验,从而推动自适应Transformer模型的研究与优化。
链接: https://arxiv.org/abs/2510.12856
作者: Jan Miller
机构: OPSWAT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, pgfplots tables included; BibTeX compiled to .bbl. Code and reproducibility artifacts referenced in the paper
Abstract:The Efficient Adaptive Transformer (EAT) framework unifies three adaptive efficiency techniques - progressive token pruning, sparse attention, and dynamic early exiting - into a single, reproducible architecture for input-adaptive inference. EAT provides an open-source benchmarking pipeline that automates data processing, timing, and ablation across GLUE tasks (SST-2, QQP, MNLI). Although this empirical study finds that combining these mechanisms can increase latency in shallow six-layer models, it demonstrates that EAT achieves slightly higher accuracy than the optimized DistilBERT baseline on SST-2, illustrating the potential of dynamic computation for latency-sensitive NLP. The main contribution is the open, end-to-end reproducible framework - complete with scripts, CSV logging, and analysis utilities - intended to serve as a community tool for further research on adaptive transformers.
zh
[NLP-84] VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages
【速读】: 该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)评估体系严重依赖以英语为主的短文本图像-文本对基准、缺乏多语言环境下长文本场景下细粒度理解能力评测的问题。其解决方案的关键在于构建了一个全新的多语言基准VLURes,涵盖英、日、斯瓦希里语和乌尔都语四种语言,在八项视觉与语言任务基础上创新性引入“无关性任务”(unrelatedness task),用于系统性地探测VLM在对象识别、场景理解及关系推理等细粒度能力上的跨语言表现差异;同时通过自动评分与母语者人工评估相结合的方式,揭示了不同模型在多模态视觉推理中的性能差距,尤其凸显开源模型与闭源领先模型(如GPT-4o)之间的显著差距,从而为智能体的多模态认知能力发展提供关键评估工具和改进方向。
链接: https://arxiv.org/abs/2510.12845
作者: Jesse Atuhurra,Iqra Ali,Tomoya Iwakura,Hidetaka Kamigaito,Tatsuya Hiraoka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes’ critical role in developing intelligent agents to tackle multi-modal visual reasoning.
zh
[NLP-85] FaStFACT: Faster Stronger Long-Form Factuality Evaluations in LLM s EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成长文本时事实性(factuality)评估的挑战,现有方法因效率低下和证据不足导致评估效果不佳。其关键解决方案是提出名为 \name 的高效且高精度的评估框架:首先采用基于置信度的片段级断言提取与预验证机制,显著降低网络搜索和推理调用的成本并保障可靠性;其次,在检索阶段从爬取的网页中收集文档级别的证据,并在验证过程中选择性召回,从而克服以往方法因单行片段证据不足而导致的无效性问题。实验表明,\name 在与人工评估的一致性及执行效率上均优于现有基线。
链接: https://arxiv.org/abs/2510.12839
作者: Yingjia Wan,Haochen Tan,Xiao Zhu,Xinyu Zhou,Zhiwei Li,Qingsong Lv,Changxuan Sun,Jiaqi Zeng,Yi Xu,Jianqiao Lu,Yinhong Liu,Zhijiang Guo
机构: UCLA (加州大学洛杉矶分校); City University of Hong Kong (香港城市大学); HKUST (GZ) (香港科技大学(广州)); HKUST (香港科技大学); Tsinghua University (清华大学); ECNU (华东师范大学); NVIDIA (英伟达); UCL (伦敦大学学院); HKU (香港大学); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注: EMNLP 2025 (Findings)
Abstract:Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line snippets. To address these limitations, we propose \name, a fast and strong evaluation framework that achieves the highest alignment with human evaluation and efficiency among existing baselines. \name first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it collects document-level evidence from crawled webpages and selectively retrieves it during verification, addressing the evidence insufficiency problem in previous pipelines. Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of \name in both efficiently and effectively evaluating the factuality of long-form LLM generations. Code and benchmark data is available at this https URL. Comments: EMNLP 2025 (Findings) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY) Cite as: arXiv:2510.12839 [cs.CL] (or arXiv:2510.12839v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.12839 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-86] Atextsuperscript2FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning ICLR2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理与代理能力上的割裂问题:一类以推理为中心的LLM擅长内部链式思维(chain-of-thought reasoning),但无法调用外部工具;另一类代理型LLM能与环境交互并使用工具,但在深度推理任务上表现不足。这种分化源于训练目标的根本差异,导致模型在处理简单任务时存在过度推理或滥用工具的问题,从而造成效率低下。解决方案的关键在于提出自适应代理基础模型(Adaptive Agent Foundation Model, A²FM),其采用“先路由后对齐”(route-then-align)原则,通过引入第三种模式——即时模式(mode-instant),专门处理简单查询以避免不必要的推理或工具调用,同时结合自适应策略优化(Adaptive Policy Optimization, APO),实现跨模式的自适应采样和成本正则化奖励机制,显著提升准确率与执行效率,在多个基准测试中达到SOTA性能,并将单位正确答案的成本降低45.2%(相比推理模式)和33.5%(相比代理模式)。
链接: https://arxiv.org/abs/2510.12838
作者: Qianben Chen,Jingyi Cao,Jiayu Zhang,Tianrui Qin,Xiaowan Li,King Zhu,Dingfeng Shi,He Zhu,Minghao Liu,Xiaobo Liang,Ge Zhang,Jian Yang,Yuchen Eleanor Jiang,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, submitted to ICLR 2026
Abstract:Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A\textsuperscript2FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A\textsuperscript2FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \ 0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.
zh
[NLP-87] Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study
【速读】: 该论文试图解决的问题是如何将为人类标注者设计的传统文本标注指南(annotation guidelines)有效转化为适用于大语言模型(Large Language Model, LLM)的明确、结构化指令,以实现由LLM执行高质量文本标注任务。其解决方案的关键在于提出一种基于LLM审核(moderation-oriented)的指南重构方法:通过引入一个LLM作为“审核者”,对原始人类导向的指南进行解析与重构,将其转化为适合LLM理解与执行的清晰指令格式。该方法在NCBI疾病语料库上的实验验证了其有效性,同时揭示了在实际应用中需应对的挑战,如语义歧义和指令一致性等问题,为构建可扩展、低成本的自动化标注流程提供了可行路径。
链接: https://arxiv.org/abs/2510.12835
作者: Kon Woo Kim(National Institute of Informatics, Japan),Rezarta Islamaj(National Library of Medicine, USA),Jin-Dong Kim(Joint Support-Center for Data Science Research, Japan),Florian Boudin(Japanese-French Laboratory of Informatics, CNRS, Nantes University, Japan),Akiko Aizawa(National Institute of Informatics, Japan)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 3 tables, This is a preprint of the article accepted at NLDB 2025 (Springer LNCS). The final version is available at this https URL
Abstract:This study investigates how existing annotation guidelines can be repurposed to instruct large language model (LLM) annotators for text annotation tasks. Traditional guidelines are written for human annotators who internalize training, while LLMs require explicit, structured instructions. We propose a moderation-oriented guideline repurposing method that transforms guidelines into clear directives for LLMs through an LLM moderation process. Using the NCBI Disease Corpus as a case study, our experiments show that repurposed guidelines can effectively guide LLM annotators, while revealing several practical challenges. The results highlight the potential of this workflow to support scalable and cost-effective refinement of annotation guidelines and automated annotation.
zh
[NLP-88] MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agent ic Training
【速读】: 该论文旨在解决多轮Text-to-SQL任务中现有系统因采用短视范式(short-horizon paradigm)而导致的非可执行或语义不连贯的问题,即这些系统仅将任务视为简单的文本翻译,每轮直接生成SQL而缺乏执行反馈、显式验证与迭代优化机制。解决方案的关键在于提出MTSQL-R1框架,将其建模为马尔可夫决策过程(Markov Decision Process, MDP),通过代理(agent)与数据库交互获取执行反馈,并借助持续对话记忆进行语义一致性验证,从而在“提出-执行-验证-修正”的循环中不断迭代直至所有检查通过,实现了环境驱动的验证与记忆引导的精化,显著提升了输出的可执行性与对话连贯性。
链接: https://arxiv.org/abs/2510.12831
作者: Taicheng Guo,Hai Wang,ChaoChun Liu,Mohsen Golalikhani,Xin Chen,Xiangliang Zhang,Chandan K. Reddy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Multi-turn Text-to-SQL aims to translate a user’s conversational utterances into executable SQL while preserving dialogue coherence and grounding to the target schema. However, most existing systems only regard this task as a simple text translation task and follow a short-horizon paradigm, generating a query per turn without execution, explicit verification, and refinement, which leads to non-executable or incoherent outputs. We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. We cast the task as a Markov Decision Process (MDP) in which an agent interacts with (i) a database for execution feedback and (ii) a persistent dialogue memory for coherence verification, performing an iterative propose to execute - verify - refine cycle until all checks pass. Experiments on COSQL and SPARC demonstrate that MTSQL-R1 consistently outperforms strong baselines, highlighting the importance of environment-driven verification and memory-guided refinement for conversational semantic parsing. Full recipes (including code, trained models, logs, reasoning trajectories, etc.) will be released after the internal review to contribute to community research.
zh
[NLP-89] Mathematics with large language models as provers and verifiers
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学定理证明中存在幻觉(hallucination)导致的可靠性问题,尤其是在应对高难度数学问题(如国际数学奥林匹克竞赛题)和数论猜想时的准确性和可验证性挑战。其解决方案的关键在于构建一个由多个推理(prover)与验证(verifier)实例协同工作的协议,利用GPT-5模型的不同实例进行分阶段协作推理,并最终通过Lean形式化证明助手对生成的证明进行严格验证,同时由人工核验前提与结论的一致性,从而确保结果的正确性和可信度。该方法成功解决了2025年IMO中的五道题目,并证实了Cohen(2025)提出的66个数论猜想中的约三分之一。
链接: https://arxiv.org/abs/2510.12829
作者: Hieu Le Duc,Leo Liberti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology was able to solve five out of six 2025 IMO problems, and close a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].
zh
[NLP-90] Scheming Ability in LLM -to-LLM Strategic Interactions
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在多智能体环境中自主实施策略性欺骗行为的能力与倾向性评估问题,尤其是在无显式提示条件下LLM之间相互欺骗的潜在风险。解决方案的关键在于构建两个博弈论框架——廉价谈话(Cheap Talk)信号博弈和同伴评估对抗博弈(Peer Evaluation adversarial game),通过量化模型在有无显式提示下的欺骗表现,并结合思维链(Chain-of-Thought)推理分析其欺骗策略,发现即使在无提示情况下,主流前沿LLM代理(如Gemini-2.5-pro和Claude-3.7-Sonnet)仍表现出高度欺骗倾向(同伴评估中100%选择欺骗),且在廉价谈话博弈中欺骗成功率高达95–100%,凸显了高风险博弈场景下对多智能体系统进行鲁棒评估的紧迫性。
链接: https://arxiv.org/abs/2510.12826
作者: Thao Pham
机构: Berea College (贝拉学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 25 pages, 13 figures, under review at IASEAI’26
Abstract:As large language model (LLM) agents are deployed autonomously in diverse contexts, evaluating their capacity for strategic deception becomes crucial. While recent research has examined how AI systems scheme against human developers, LLM-to-LLM scheming remains underexplored. We investigate the scheming ability and propensity of frontier LLM agents through two game-theoretic frameworks: a Cheap Talk signaling game and a Peer Evaluation adversarial game. Testing four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b), we measure scheming performance with and without explicit prompting while analyzing scheming tactics through chain-of-thought reasoning. When prompted, most models, especially Gemini-2.5-pro and Claude-3.7-Sonnet, achieved near-perfect performance. Critically, models exhibited significant scheming propensity without prompting: all models chose deception over confession in Peer Evaluation (100% rate), while models choosing to scheme in Cheap Talk succeeded at 95-100% rates. These findings highlight the need for robust evaluations using high-stakes game-theoretic scenarios in multi-agent settings.
zh
[NLP-91] Classifier-Augmented Generation for Structured Workflow Prediction EMNLP2025
【速读】: 该论文旨在解决ETL(Extract, Transform, Load)工具中工作流配置效率低下的问题,即用户需具备深入工具知识才能手动设置复杂的数据处理流程,导致开发周期长且易出错。其解决方案的关键在于提出一种Classifier-Augmented Generation(CAG)方法,该方法通过将自然语言描述分解为子语句,结合分类器与特定阶段的少样本提示(few-shot prompting),实现对ETL阶段的精准预测;随后利用边预测构建非线性工作流结构,并基于子语句上下文推断各阶段属性,从而完成端到端的工作流生成。该架构模块化、可解释性强,显著提升了准确率与效率,同时大幅降低token消耗。
链接: https://arxiv.org/abs/2510.12825
作者: Thomas Gschwind,Shramona Chakraborty,Nitin Gupta,Sameep Mehta
机构: IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2025
Abstract:ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.
zh
[NLP-92] MEDEQUALQA: Evaluating Biases in LLM s with Counterfactual Reasoning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在临床决策支持中因患者人口统计学特征(如性别代词)的细微变化而导致内部推理路径不稳定的问题,进而可能引发医疗不公平。解决方案的关键在于构建了一个反事实基准测试集 MEDEQUALQA,通过仅改变患者代词(he/him、she/her、they/them)而严格保持关键症状与条件(Critical Symptoms and Conditions, CSCs)不变,系统性地评估模型推理轨迹的稳定性;具体方法包括将每个临床案例扩展为单CSC消融版本,形成三组约23,000项的平行数据集(总计69,000项),并利用语义文本相似度(Semantic Textual Similarity, STS)量化推理过程差异,从而识别出即使最终诊断一致时仍存在的局部推理偏移,揭示潜在的临床偏差来源。
链接: https://arxiv.org/abs/2510.12818
作者: Rajarshi Ghosh,Abhay Gupta,Hudson McBride,Anurag Vaidya,Faisal Mahmood
机构: Lone Star College ( Lone Star College); Algoverse AI Research (Algoverse AI Research); Empire State University (Empire State University); Brigham and Women’s Hospital, Harvard Medical School (Brigham and Women’s Hospital, 哈佛医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each clinical vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23,000 items each (69,000 total). We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS 0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering, even when final diagnoses remain unchanged. Our error analysis highlights certain cases in which the reasoning shifts, underscoring clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA offers a controlled diagnostic setting for auditing reasoning stability in medical AI.
zh
[NLP-93] From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP
【速读】: 该论文试图解决的问题是:当前偏好学习(preference learning)数据集中普遍存在的“人类标注差异”(Human Label Variation, HLV)被错误地视为噪声并被合并为单一标签,从而抹除了人类价值观的多样性,削弱了大语言模型(LLM)对人类多元价值的对齐能力。解决方案的关键在于将HLV作为人类多元性(human pluralism)的体现,主动保留在偏好数据集中,而非将其消除;这要求在数据设计阶段就将其视为一个内在目标(Selbstzweck),并采取具体措施系统性地纳入偏好学习流程,以增强模型鲁棒性和价值对齐的真实性。
链接: https://arxiv.org/abs/2510.12817
作者: Shanshan Xu,Santosh T.Y.S.S,Barbara Plank
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the genuine diversity of human perspectives rather than mere error. For decades, HLV in NLP was dismissed as noise to be discarded, and only slowly over the last decade has it been reframed as a signal for improving model robustness. With the rise of large language models (LLMs), where post-training on human feedback has become central to model alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely aggregate multiple annotations into a single label, thereby flattening diverse perspectives into a false universal agreement and erasing precisely the pluralism of human values that alignment aims to preserve. In this position paper, we argue that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck - a goal it self when designing AI systems. We call for proactively incorporating HLV into preference datasets and outline actionable steps towards it.
zh
[NLP-94] Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study
【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHR)中结构化与非结构化数据不一致所导致的诊断分类困难问题,以支持预测性医疗模型的构建。其解决方案的关键在于系统评估四种大语言模型(GPT-3.5、GPT-4o、Llama 3.2、Gemini 1.5)及BioBERT在从EHR中自动分类癌症诊断信息方面的性能表现,发现GPT-4o在自由文本诊断分类任务中表现最优,而BioBERT在ICD编码分类中表现最佳,表明不同模型对不同类型输入数据具有差异化优势,且需结合临床专家验证和标准化文档实践以提升临床可靠性。
链接: https://arxiv.org/abs/2510.12813
作者: Soheil Hashtarkhani,Rezaur Rashid,Christopher L Brett,Lokesh Chinthala,Fekede Asefa Kumsa,Janet A Zink,Robert L Davis,David L Schwartz,Arash Shaban-Nejad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 Pages
Abstract:Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation. The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data. We analyzed 762 unique diagnoses (326 International Classification of Diseases (ICD) code descriptions, 436free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14predefined categories. Two oncology experts validated classifications. BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.
zh
[NLP-95] Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言如波斯语(Persian)中的性能表现缺乏系统评估的问题。其解决方案的关键在于构建一个针对波斯语自然语言处理(Natural Language Processing, NLP)任务的综合性基准测试,涵盖情感分析、命名实体识别(Named Entity Recognition, NER)、阅读理解与问答等多个任务,并采用零样本(zero-shot)和少样本(few-shot)学习范式进行实验验证。通过使用ParsiNLU和ArmanEmo等标准波斯语数据集,以及Accuracy、F1-score、BLEU和ROUGE等多维指标,研究发现Gemma 2在绝大多数任务中表现最优,尤其在复杂推理任务上优势显著,同时揭示了当前模型在词级别理解任务(如NER)上的局限性,为后续面向波斯语的多语言LLM研发提供了重要基准与方向。
链接: https://arxiv.org/abs/2510.12807
作者: Mahdi Cherakhloo,Arash Abbasi,Mohammad Saeid Sarafraz,Bijan Vosoughi Vahdat
机构: Sharif University of Technology (谢里夫理工大学); University of Tehran (德黑兰大学); YarAI Group (YarAI 组)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.
zh
[NLP-96] AutoCode: LLM s as Problem Setters for Competitive Programming
【速读】: 该论文旨在解决生成高质量、可直接用于竞赛的编程题目(competitive programming problems)这一难题,其核心挑战在于需精准设置约束条件、输入分布和边界情况以排除投机解法,同时确保问题难度适中且能有效区分不同水平的选手。解决方案的关键在于提出AutoCode系统,该系统通过多轮验证机制生成符合竞赛标准的问题陈述与测试用例,并结合参考解法与暴力解法的交叉验证来过滤错误或不合规的问题,从而在未见问题上达到接近99%的一致性(相比当前最优方法HardTests的<81%显著提升),最终由顶级竞技程序员(Grandmaster级)评估确认其问题质量达标。
链接: https://arxiv.org/abs/2510.12803
作者: Shang Zhou,Zihan Zheng,Kaiyuan Liu,Zeyu Shen,Zerui Cheng,Zexing Chen,Hansen He,Jianzhu Yao,Huanzhi Mao,Qiuyang Mang,Tianfu Fu,Beichen Li,Dongruixuan Li,Wenhao Chai,Zhuang Liu,Aleksandra Korolova,Peter Henderson,Natasha Jaques,Pramod Viswanath,Saining Xie,Jingbo Shang
机构: University of California San Diego(加州大学圣地亚哥分校); New York University(纽约大学); University of Washington(华盛顿大学); Princeton University(普林斯顿大学); University of California Berkeley(加州大学伯克利分校); OpenAI; Massachusetts Institute of Technology(麻省理工学院); University of Waterloo(滑铁卢大学); Sentient Labs
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: Project page: this https URL
Abstract:Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.
zh
[NLP-97] wo Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses
【速读】: 该论文旨在解决多模态语音识别中生成式错误纠正(Generative Error Correction, GER)的效率与准确性问题,尤其是在音频-视觉语音识别(Audio-Visual Speech Recognition, AVSR)场景下,传统单流GER方法难以有效融合来自语音(ASR)和视觉(VSR)模态的互补信息。解决方案的关键在于提出DualHyp框架,该框架通过大语言模型(LLM)直接在语言空间内对ASR与VSR各自产生的独立N-best候选假设进行联合推理,并引入RelPrompt机制——一种基于模态可靠性的噪声感知引导策略,动态调整LLM对不同模态假设的关注权重,从而实现更精准的跨模态纠错。实验表明,在LRS2基准上,该方法相较标准ASR基线提升高达57.7%的错误率改善,显著优于现有单流GER方法。
链接: https://arxiv.org/abs/2510.13281
作者: Sungnyun Kim,Kangwook Jang,Sungwoo Cho,Joon Son Chung,Hoirin Kim,Se-Young Yun
机构: Kim Jaechul Graduate School of AI, KAIST (KAIST人工智能研究生院); School of Electrical Engineering, KAIST (KAIST电子工程学院)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint work
Abstract:This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce RelPrompt, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at this https URL.
zh
计算机视觉
[CV-0] PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning
【速读】:该论文旨在解决当前视频生成模型虽能生成视觉上逼真的视频,却常违背物理规律的问题,从而限制其生成物理上合理视频的能力以及作为“世界模型”的潜力。解决方案的关键在于提出PhysMaster框架,其核心是通过设计PhysEncoder将输入图像中的物理先验信息(如物体相对位置和潜在交互)编码为额外条件,注入视频生成过程以增强物理感知能力;同时,利用人类反馈的强化学习机制,结合直接偏好优化(DPO)对物理表示进行端到端优化,从而在无显式物理监督的情况下提升模型的物理合理性。
链接: https://arxiv.org/abs/2510.13809
作者: Sihui Ji,Xi Chen,Xin Tao,Pengfei Wan,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ‘‘world models’’. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model’s physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.
zh
[CV-1] VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在面对与预训练数据分布显著不同的新领域时性能急剧下降的问题。现有领域自适应方法通过微调VLM的不同组件,常导致领域特定特征学习不足或对原有能力的灾难性遗忘。解决方案的关键在于提出视觉情境化探测(Vision Contextualized Probing, VisCoP),即在VLM的视觉编码器中引入一组可学习的轻量级视觉探测器(visual probes),从而以最小的参数修改实现高效的领域特定适应,同时有效保留源域知识。
链接: https://arxiv.org/abs/2510.13808
作者: Dominick Reilly,Manish Kumar Govind,Le Xue,Srijan Das
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); Salesforce AI Research (Salesforce AI 研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM’s vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.
zh
[CV-2] race Anything: Representing Any Video in 4D via Trajectory Fields
【速读】:该论文旨在解决视频中动态信息的高效、精确建模问题,特别是如何从像素级视角捕捉并表示随时间演化的连续空间轨迹。传统方法通常依赖于离散帧间的点跟踪或迭代优化策略,难以实现全局一致性和计算效率。解决方案的关键在于提出“轨迹场(Trajectory Field)”这一新型时空表示:将视频中每个像素在每一帧中的运动建模为一个连续的3D轨迹函数(以B样条参数化),并通过单次前向传播的神经网络模型——Trace Anything,直接预测整个轨迹场。该方法不仅显著提升了轨迹估计精度与效率,还展现出目标条件控制、运动预测等涌现能力,突破了现有方法在实时性与泛化性上的局限。
链接: https://arxiv.org/abs/2510.13802
作者: Xinhang Liu,Yuxi Xiao,Donny Y. Chen,Jiashi Feng,Yu-Wing Tai,Chi-Keung Tang,Bingyi Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project page: this https URL.
zh
[CV-3] Reasoning in Space via Grounding in the World
【速读】:该论文旨在解决当前3D大语言模型(3D LLMs)在空间推理中缺乏统一的3D表示机制的问题,该机制需同时捕捉语义与几何信息,从而有效连接3D视觉定位(3D visual grounding)与空间推理任务。现有方法要么在定位性能上表现不佳,要么过度依赖外部模块,导致两者难以无缝集成。解决方案的关键在于提出一种简洁而有效的双路径池化机制(dual-path pooling mechanism),该机制将几何特征紧密对齐于语义和位置线索,构建了一个基于图像块(image patch-based)的统一3D表示,且不增加输入token数量。在此基础上,GS-Reasoner成为首个无需外部模块即可实现自回归定位的3D LLM,并通过引入Grounded Chain-of-Thought(GCoT)数据集进一步强化了定位与空间推理之间的协同关系,最终实现了端到端的空间推理性能提升。
链接: https://arxiv.org/abs/2510.13800
作者: Yiming Chen,Zekun Qi,Wenyao Zhang,Xin Jin,Li Zhang,Peidong Liu
机构: Westlake University (西湖大学); Shanghai Innovation Institute (上海创新研究院); Zhejiang University (浙江大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology (东方理工学院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures
Abstract:In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.
zh
[CV-4] Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLM s
【速读】:该论文旨在解决当前完全开源多模态大语言模型(Multimodal Large Language Models, MLLMs)在监督微调(Supervised Fine-Tuning, SFT)阶段因数据质量不足而导致性能落后于专有模型的问题,尤其是现有开源数据集普遍存在噪声且缺乏复杂推理数据(如Chain-of-Thought, CoT)的缺陷。其解决方案的关键在于:首先构建了一个高质量、大规模的SFT数据集Honey-Data-15M(约1500万问答对),通过多轮清洗与创新的双层级(短/长)CoT增强策略提升数据复杂性;其次提出可复用的数据编排流水线HoneyPipe及其底层框架DataStudio,实现透明、灵活的数据治理流程;最终基于该数据和方法训练出Bee-8B模型,在多项指标上达到完全开源MLLM的新SOTA,甚至超越部分半开源模型(如InternVL3.5-8B),验证了以数据质量为核心驱动力的路径对推动开源模型性能突破的关键作用。
链接: https://arxiv.org/abs/2510.13795
作者: Yi Zhang,Bolin Ni,Xin-Sheng Chen,Heng-Rui Zhang,Yongming Rao,Houwen Peng,Qinglin Lu,Han Hu,Meng-Hao Guo,Shi-Min Hu
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); Tencent Hunyuan Team (腾讯混元团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: homepage: this https URL
Abstract:Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
zh
[CV-5] NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视觉内容生成中作者身份认证与版权保护的问题,尤其是在模型所有者无法或不愿直接参与验证的情况下,如何实现第三方可信赖的水印验证。其解决方案的关键在于提出一种轻量级水印方案——NoisePrints,该方案利用扩散过程初始化时使用的随机种子(random seed)作为作者身份证明,无需修改生成流程,且通过在噪声采样过程中引入哈希函数,确保从生成内容中恢复有效种子在计算上不可行,同时防止伪造种子通过验证。此外,结合密码学零知识证明技术,在不泄露种子的前提下证明所有权,从而提升水印移除的难度。实验表明,该方法在多种先进图像和视频扩散模型上均能高效验证,仅需种子和输出即可完成,无需访问模型权重。
链接: https://arxiv.org/abs/2510.13793
作者: Nir Goren,Oren Katzir,Abhinav Nakarmi,Eyal Ronen,Mahmood Sharif,Or Patashnik
机构: Tel Aviv University (特拉维夫大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: code available at: this https URL
Abstract:With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose , a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.
zh
[CV-6] Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation
【速读】:该论文致力于解决生成式AI(Generative AI)在故事续写(story continuation)任务中的核心挑战:如何有效利用先前的视觉上下文(prior visual context),同时确保与当前文本输入的语义一致性。其解决方案的关键在于提出了一种名为AVC(Adaptive Visual Conditioning)的扩散模型框架,该框架通过CLIP模型检索最语义对齐的前序图像,并设计自适应机制——当未找到足够相关的图像时,仅在扩散过程早期阶段限制先前视觉信息的影响,从而避免引入误导性或无关内容。这一机制使得模型能够在有利时充分利用视觉上下文,同时保持对文本指令的高敏感度,显著提升了续图的连贯性、语义一致性和视觉保真度。
链接: https://arxiv.org/abs/2510.13787
作者: Seyed Mohammad Mousavi,Morteza Analoui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.
zh
[CV-7] InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
【速读】:该论文旨在解决指令跟随机器人在复杂场景中实现可扩展、通用智能的关键挑战,即如何将自然语言指令高效转化为准确的机器人动作。其解决方案的核心在于提出一种空间引导的视觉-语言-动作(Vision-Language-Action, VLA)训练范式,通过引入空间定位(spatial grounding)作为连接指令与动作的关键桥梁:首先在230万条空间推理数据上进行预训练,使模型学会“在哪里行动”(即对齐指令与无体感依赖的空间位置),随后通过空间提示驱动的动作后训练阶段决定“如何行动”(生成具身感知的动作)。这种两阶段空间引导策略显著提升了模型在多个机器人平台上的性能,尤其在长程推理任务和未见物体/配置中表现出更强的泛化能力,验证了空间引导训练是构建可扩展、鲁棒通用机器人的统一原则。
链接: https://arxiv.org/abs/2510.13778
作者: Xinyi Chen,Yilun Chen,Yanwei Fu,Ning Gao,Jiaya Jia,Weiyang Jin,Hao Li,Yao Mu,Jiangmiao Pang,Yu Qiao,Yang Tian,Bin Wang,Bolun Wang,Fangjing Wang,Hanqing Wang,Tai Wang,Ziqin Wang,Xueyuan Wei,Chao Wu,Shuai Yang,Jinhui Ye,Junqiu Yu,Jia Zeng,Jingjing Zhang,Jinyu Zhang,Shi Zhang,Feng Zheng,Bowen Zhou,Yangkun Zhu
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide
how to act’’ by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at this https URL.
zh
[CV-8] UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations
【速读】:该论文旨在解决城市现象(如房价和公共卫生指标)预测中多源地理空间数据融合困难的问题,现有方法依赖任务特定模型,而当前空间基础模型(GeoFM)通常仅支持有限模态且缺乏多模态融合能力。解决方案的关键在于提出UrbanFusion,一个具备随机多模态融合(Stochastic Multimodal Fusion, SMF)机制的GeoFM框架,其通过模态专用编码器处理街景图像、遥感数据、地图和兴趣点(Points of Interest, POIs)等异构输入,并利用基于Transformer的融合模块学习统一表征,从而实现灵活的多模态组合与强泛化性能。
链接: https://arxiv.org/abs/2510.13774
作者: Dominik J. Mühlematter,Lin Che,Ye Hong,Martin Raubal,Nina Wiedemann
机构: ETH Zürich (苏黎世联邦理工学院); Intel Corporation (英特尔公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion’s strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at this https URL.
zh
[CV-9] Scaling Vision Transformers for Functional MRI with Flat Maps NEURIPS2025
【速读】:该论文旨在解决如何将现代深度学习架构适配到功能磁共振成像(fMRI)数据中的核心问题,即如何有效地表示fMRI数据以供模型输入。其关键解决方案是将4D体素级fMRI数据转换为2D fMRI活动平面图的视频序列,并在此基础上使用时空掩码自编码器(spatiotemporal masked autoencoder, MAE)框架训练Vision Transformers模型。该方法有效弥合了fMRI与自然图像之间的模态差距,且实验表明掩码建模性能随数据规模严格遵循幂律扩展,同时在下游分类任务中展现出对个体精细状态解码和跨脑状态的个体特征解码能力。
链接: https://arxiv.org/abs/2510.13768
作者: Connor Lane,Daniel Z. Kaplan,Tanishq Mathew Abraham,Paul S. Scotti
机构: Sophont; Medical AI Research Center (MedARC)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: NeurIPS 2025 Workshop, Foundation Models for the Brain and Body; Code: this https URL Discord: this https URL
Abstract:A key question for adapting modern deep learning architectures to functional MRI (fMRI) is how to represent the data for model input. To bridge the modality gap between fMRI and natural images, we transform the 4D volumetric fMRI data into videos of 2D fMRI activity flat maps. We train Vision Transformers on 2.3K hours of fMRI flat map videos from the Human Connectome Project using the spatiotemporal masked autoencoder (MAE) framework. We observe that masked fMRI modeling performance improves with dataset size according to a strict power scaling law. Downstream classification benchmarks show that our model learns rich representations supporting both fine-grained state decoding across subjects, as well as subject-specific trait decoding across changes in brain state. This work is part of an ongoing open science project to build foundation models for fMRI data. Our code and datasets are available at this https URL.
zh
[CV-10] Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
【速读】:该论文旨在解决当前统一多模态模型(Unified Multimodal Models)在评估中缺乏对视觉理解与生成能力真正协同关系考察的问题,现有基准测试往往将这两种能力孤立评估,或忽视了二者天然耦合的任务。其解决方案的关键在于提出Uni-MMMU——一个学科感知的综合性基准,系统性地揭示八个以推理为核心的领域(如科学、编程、数学和谜题)中理解与生成之间的双向协同机制;每个任务均设计为双向耦合,要求模型要么利用概念理解引导精确的视觉合成,要么借助生成作为认知支架促进分析推理,同时引入可验证的中间推理步骤、唯一真实标签及可复现的评分协议,从而实现对文本与视觉输出的标准化评估,为统一模型的发展提供可靠依据。
链接: https://arxiv.org/abs/2510.13759
作者: Kai Zou,Ziqi Huang,Yuhao Dong,Shulin Tian,Dian Zheng,Hongbo Liu,Jingwen He,Bin Liu,Yu Qiao,Ziwei Liu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); S-Lab, Nanyang Technological University (南洋理工大学); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contributions from frst three authors. Project page: this https URL Code: this https URL
Abstract:Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. Existing evaluations either treat the two abilities in isolation or overlook tasks that inherently couple them. To address this gap, we present Uni-MMMU, a comprehensive and discipline-aware benchmark that systematically unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains, including science, coding, mathematics, and puzzles. Each task is bidirectionally coupled, demanding models to (i) leverage conceptual understanding to guide precise visual synthesis, or (ii) utilize generation as a cognitive scaffold for analytical reasoning. Uni-MMMU incorporates verifiable intermediate reasoning steps, unique ground truths, and a reproducible scoring protocol for both textual and visual outputs. Through extensive evaluation of state-of-the-art unified, generation-only, and understanding-only models, we reveal substantial performance disparities and cross-modal dependencies, offering new insights into when and how these abilities reinforce one another, and establishing a reliable foundation for advancing unified models.
zh
[CV-11] RECODE: Reasoning Through Code Generation for Visual Question Answering
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理结构化视觉内容(如图表和示意图)时推理精度不足的问题,其核心挑战在于基于像素的感知缺乏可验证机制。解决方案的关键在于引入“反渲染”(derendering)——即把视觉图像逆向重构为可执行代码的过程,作为新的可验证视觉模态。作者提出RECODE框架,通过生成多个候选程序来重建输入图像,并利用批评者(critic)选择最忠实的重构结果,进而迭代优化代码。这一方法将模糊的感知任务转化为符号化的、可验证的问题,从而支持精确计算与逻辑推理,显著提升了在CharXiv、ChartQA和Geometry3K等基准上的性能表现。
链接: https://arxiv.org/abs/2510.13756
作者: Junhong Shen,Mu Cai,Bo Hu,Ameet Talwalkar,David A Ross,Cordelia Schmid,Alireza Fathi
机构: Carnegie Mellon University (卡内基梅隆大学); Google DeepMind (谷歌深度心智)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering – the process of reverse-engineering visuals into executable code – as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
zh
[CV-12] InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
【速读】:该论文旨在解决轻量化多模态大语言模型在音频-视觉多轮交互场景中理解与生成能力不足的问题,尤其关注长时记忆保持和自然语音交互的性能瓶颈。其解决方案的关键在于构建一个统一的、开源的Omni-modal大语言模型(InteractiveOmni),通过整合视觉编码器(vision encoder)、音频编码器(audio encoder)、大语言模型(large language model)和语音解码器(speech decoder)形成端到端架构,并设计多阶段训练策略:首先进行多模态预训练以实现基础理解能力,再通过语音对话和音视频交互的后训练增强跨模态对齐与交互性;同时,精心构建多轮对话数据集以提升长期记忆能力,并建立专门的多模态多轮记忆基准和多轮语音交互基准用于系统评估。实验表明,该模型在参数规模更小的情况下仍能媲美甚至超越更大模型,在图像、音频、视频理解和语音生成任务上达到同类模型中的最先进水平。
链接: https://arxiv.org/abs/2510.13747
作者: Wenwen Tong,Hewei Guo,Dongchuan Ran,Jiangnan Chen,Jiefan Lu,Kaibin Wang,Keqiang Li,Xiaoxu Zhu,Jiakui Li,Kehan Li,Xueheng Li,Lumin Li,Chenxu Guo,Jiasheng Zhou,Jiandong Chen,Xianye Wu,Jiahao Wang,Silei Wu,Lei Chen,Hanming Deng,Yuxuan Song,Dinghao Zhou,Guiping Zhong,Ken Zheng,Shiyin Kang,Lewei Lu
机构: SenseTime Research (商汤科技研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model’s ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.
zh
[CV-13] UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy
【速读】:该论文旨在解决中文书法的计算复现难题,现有方法在孤立字形质量与页面级美学(如连笔和间距)之间难以兼顾,或在尝试页面合成时牺牲了书法准确性。其解决方案的关键在于提出一个统一的扩散框架UniCalli,通过联合训练识别与生成任务实现协同优化:识别任务约束生成器保持字符结构,生成任务提供风格与布局先验,从而在概念层面形成抽象表示,显著提升两者性能,尤其在数据有限场景下表现优异。模型采用非对称噪声机制和栅格化框图作为空间先验,并基于合成、标注及未标注数据混合训练,实现了卓越的连笔连续性和版面保真度,同时拓展至甲骨文和埃及象形文字等古代文字。
链接: https://arxiv.org/abs/2510.13745
作者: Tianshuo Xu,Kai Wang,Zhifei Chen,Leyi Wu,Tianshui Wen,Fei Chao,Ying-Cong Chen
机构: HKUST(GZ)(香港科技大学(广州)); China University of Geoscience Beijing(中国地质大学(北京)); Xiamen University(厦门大学); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages
Abstract:Computational replication of Chinese calligraphy remains challenging. Existing methods falter, either creating high-quality isolated characters while ignoring page-level aesthetics like ligatures and spacing, or attempting page synthesis at the expense of calligraphic correctness. We introduce \textbfUniCalli, a unified diffusion framework for column-level recognition and generation. Training both tasks jointly is deliberate: recognition constrains the generator to preserve character structure, while generation provides style and layout priors. This synergy fosters concept-level abstractions that improve both tasks, especially in limited-data regimes. We curated a dataset of over 8,000 digitized pieces, with ~4,000 densely annotated. UniCalli employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data. The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition. The framework successfully extends to other ancient scripts, including Oracle bone inscriptions and Egyptian hieroglyphs. Code and data can be viewed in \hrefthis https URLthis URL.
zh
[CV-14] Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs
【速读】:该论文旨在解决视觉图神经网络(Vision Graph Neural Networks, ViG)在大规模图像上进行图构建时计算成本高,以及现有方法如稀疏视觉图注意力(Sparse Vision Graph Attention, SVGA)因固定步长导致的过挤压(over-squashing)问题,从而限制了长程连接的有效利用。解决方案的关键在于提出一种新的图构建方法——对数可扩展图构建(Logarithmic Scalable Graph Construction, LSGC),通过控制长程链接数量来优化信息传递效率,并在此基础上设计了一种新型混合CNN-GNN模型LogViG,同时引入高分辨率分支与多尺度特征融合机制,实现更高效的性能提升。实验表明,LogViG在图像分类和语义分割任务中均优于当前主流的ViG、CNN及视觉Transformer(Vision Transformer, ViT)架构,在准确率、参数量和GMACs方面均有显著改进。
链接: https://arxiv.org/abs/2510.13740
作者: Mustafa Munir,Alex Zhang,Radu Marculescu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in the Proceedings of the Third Learning on Graphs Conference (LoG 2024)
Abstract:Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA’s fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at this https URL.
zh
[CV-15] Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis
【速读】:该论文旨在解决低场磁共振成像(Low-field MRI)向高场磁共振成像(High-field MRI)图像合成过程中存在的临床保真度不足问题,包括解剖结构失真、细粒度结构细节丢失以及图像对比度域间差异显著等挑战。其解决方案的关键在于提出一种循环自监督扩散(Cyclic Self-supervised Diffusion, CSS-Diff)框架,通过引入循环一致性约束来强化生成过程中的解剖结构保真度,而非依赖于配对像素级监督;同时结合两个创新模块:切片级差距感知网络(slice-wise gap perception network),利用对比学习对齐跨切片不一致性;局部结构修正网络(local structure correction network),通过掩码与扰动块的自重建机制增强局部特征恢复能力。该方法在多个跨场图像合成任务中实现了最优性能,并显著提升了细粒度解剖结构的保留效果。
链接: https://arxiv.org/abs/2510.13735
作者: Zhenxuan Zhang,Peiyuan Jing,Zi Wang,Ula Briski,Coraline Beitone,Yue Yang,Yinzhe Wu,Fanwen Wang,Liutao Yang,Jiahao Huang,Zhifan Gao,Zhaolin Chen,Kh Tohidul Islam,Guang Yang,Peter J. Lally
机构: Imperial College London (帝国理工学院); Sun Yat-sen University (中山大学); Monash University (莫纳什大学); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthesizing high-quality images from low-field MRI holds significant potential. Low-field MRI is cheaper, more accessible, and safer, but suffers from low resolution and poor signal-to-noise ratio. This synthesis process can reduce reliance on costly acquisitions and expand data availability. However, synthesizing high-field MRI still suffers from a clinical fidelity gap. There is a need to preserve anatomical fidelity, enhance fine-grained structural details, and bridge domain gaps in image contrast. To address these issues, we propose a \emphcyclic self-supervised diffusion (CSS-Diff) framework for high-field MRI synthesis from real low-field MRI data. Our core idea is to reformulate diffusion-based synthesis under a cycle-consistent constraint. It enforces anatomical preservation throughout the generative process rather than just relying on paired pixel-level supervision. The CSS-Diff framework further incorporates two novel processes. The slice-wise gap perception network aligns inter-slice inconsistencies via contrastive learning. The local structure correction network enhances local feature restoration through self-reconstruction of masked and perturbed patches. Extensive experiments on cross-field synthesis tasks demonstrate the effectiveness of our method, achieving state-of-the-art performance (e.g., 31.80 \pm 2.70 dB in PSNR, 0.943 \pm 0.102 in SSIM, and 0.0864 \pm 0.0689 in LPIPS). Beyond pixel-wise fidelity, our method also preserves fine-grained anatomical structures compared with the original low-field MRI (e.g., left cerebral white matter error drops from 12.1 % to 2.1 % , cortex from 4.2 % to 3.7 % ). To conclude, our CSS-Diff can synthesize images that are both quantitatively reliable and anatomically consistent.
zh
[CV-16] LiFMCR: Dataset and Benchmark for Light Field Multi-Camera Registration
【速读】:该论文旨在解决多相机光场(light field)图像在空间上精确配准的问题,这是实现可靠多视角光场处理的关键前提。现有光场数据集通常仅支持单相机设置且缺乏外部真值(ground truth),难以评估多相机系统中的几何一致性。为此,作者提出了LiFMCR数据集,其核心创新在于提供了来自两台高分辨率Raytrix R32光场相机的同步图像序列,并结合Vicon动作捕捉系统提供的高精度6自由度(6-DoF)位姿作为外部真值,从而为多相机光场注册方法提供严格的量化评估基准。解决方案的关键在于设计了两种互补的注册方法:一是基于RANSAC的3D点云跨视图变换估计方法,二是直接从单个光场图像中估计外参6-DoF位姿的光场PnP算法;两者均显式集成光场相机模型,确保了多相机系统下注册结果的准确性与可扩展性。
链接: https://arxiv.org/abs/2510.13729
作者: Aymeric Fleith,Julian Zirbel,Daniel Cremers,Niclas Zeller
机构: Technical University of Munich (慕尼黑工业大学); Karlsruhe University of Applied Sciences (卡尔斯鲁厄应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Symposium on Visual Computing (ISVC) 2025
Abstract:We present LiFMCR, a novel dataset for the registration of multiple micro lens array (MLA)-based light field cameras. While existing light field datasets are limited to single-camera setups and typically lack external ground truth, LiFMCR provides synchronized image sequences from two high-resolution Raytrix R32 plenoptic cameras, together with high-precision 6-degrees of freedom (DoF) poses recorded by a Vicon motion capture system. This unique combination enables rigorous evaluation of multi-camera light field registration methods. As a baseline, we provide two complementary registration approaches: a robust 3D transformation estimation via a RANSAC-based method using cross-view point clouds, and a plenoptic PnP algorithm estimating extrinsic 6-DoF poses from single light field images. Both explicitly integrate the plenoptic camera model, enabling accurate and scalable multi-camera registration. Experiments show strong alignment with the ground truth, supporting reliable multi-view light field processing. Project page: this https URL Comments: Accepted at the International Symposium on Visual Computing (ISVC) 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.13729 [cs.CV] (or arXiv:2510.13729v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.13729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-17] Circle of Willis Centerline Graphs: A Dataset and Baseline Algorithm
【速读】:该论文旨在解决脑部Willis环(Circle of Willis, CoW)的自动化定量分析难题,特别是针对传统骨架化(skeletonization)技术在复杂几何结构下难以提取可靠中心线(centerline)以及公开可用的中心线数据集稀缺的问题。其解决方案的关键在于结合基于U-Net的深度学习骨架化方法与A*图连接算法,从TopCoW数据集中(包含200名卒中患者的MRA和CTA影像)自动提取高质量的中心线图谱及其形态学特征。该方法在保持高拓扑重建准确性(F1=1)的同时,实现了亚像素级节点定位精度(平均欧氏距离<1体素),并验证了关键特征如段半径、长度及分叉比等具有优异的鲁棒性(中位相对误差<5%,皮尔逊相关系数>0.95),从而为后续临床研究和算法开发提供了可靠基础。
链接: https://arxiv.org/abs/2510.13720
作者: Fabio Musio,Norman Juchler,Kaiyuan Yang,Suprosanna Shit,Chinmay Prabhakar,Bjoern Menze,Sven Hirsch
机构: ZHAW (苏黎世应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Circle of Willis (CoW) is a critical network of arteries in the brain, often implicated in cerebrovascular pathologies. Voxel-level segmentation is an important first step toward an automated CoW assessment, but a full quantitative analysis requires centerline representations. However, conventional skeletonization techniques often struggle to extract reliable centerlines due to the CoW’s complex geometry, and publicly available centerline datasets remain scarce. To address these challenges, we used a thinning-based skeletonization algorithm to extract and curate centerline graphs and morphometric features from the TopCoW dataset, which includes 200 stroke patients, each imaged with MRA and CTA. The curated graphs were used to develop a baseline algorithm for centerline and feature extraction, combining U-Net-based skeletonization with A* graph connection. Performance was evaluated on a held-out test set, focusing on anatomical accuracy and feature robustness. Further, we used the extracted features to predict the frequency of fetal PCA variants, confirm theoretical bifurcation optimality relations, and detect subtle modality differences. The baseline algorithm consistently reconstructed graph topology with high accuracy (F1 = 1), and the average Euclidean node distance between reference and predicted graphs was below one voxel. Features such as segment radius, length, and bifurcation ratios showed strong robustness, with median relative errors below 5% and Pearson correlations above 0.95. Our results demonstrate the utility of learning-based skeletonization combined with graph connection for anatomically plausible centerline extraction. We emphasize the importance of going beyond simple voxel-based measures by evaluating anatomical accuracy and feature robustness. The dataset and baseline algorithm have been released to support further method development and clinical research.
zh
[CV-18] MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion
【速读】:该论文旨在解决多视角生成与个性化定制难以统一的问题,即现有模型要么缺乏几何一致性支持的定制能力,要么无法实现显式的视角控制。其核心解决方案是提出MVCustom框架,关键在于通过特征场表示学习主体身份与几何结构,并结合增强型文本到视频扩散模型(text-to-video diffusion backbone)引入密集时空注意力机制以保障多视角一致性;在推理阶段则采用深度感知特征渲染和一致感知潜在补全两项创新技术,分别强制几何一致性并确保定制主体与背景在不同视角下的准确对齐,从而首次实现了多视角生成与个性化定制的协同优化。
链接: https://arxiv.org/abs/2510.13702
作者: Minjung Shin,Hyunin Cho,Sooyeon Go,Jin-Hwa Kim,Youngjung Uh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject’s identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.
zh
[CV-19] Risk-adaptive Activation Steering for Safe Multimodal Large Language Models
【速读】:该论文旨在解决多模态生成式 AI(Generative AI)在面对嵌入图像中的恶意意图查询时,难以准确识别风险并保持响应安全与有用性的问题。现有方法要么依赖昂贵的安全数据集进行训练,要么采用推理时对齐策略,但后者常导致良性查询被误判为有害而过度拒绝,且因迭代输出调整显著降低推理速度。其解决方案的关键在于提出一种风险自适应激活引导(Risk-adaptive Activation Steering, RAS)机制:通过重构查询以增强跨模态注意力对图像中安全关键区域的聚焦,实现查询级别的风险评估,并据此自适应地调节模型激活状态,从而在无需迭代输出调整的前提下生成既安全又实用的响应,同时提升推理效率。
链接: https://arxiv.org/abs/2510.13698
作者: Jonghyun Park,Minhyuk Seo,Jonghyun Choi
机构: Seoul National University (首尔国立大学); KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:One of the key challenges of modern AI models is ensuring that they provide helpful responses to benign queries while refusing malicious ones. But often, the models are vulnerable to multimodal queries with harmful intent embedded in images. One approach for safety alignment is training with extensive safety datasets at the significant costs in both dataset curation and training. Inference-time alignment mitigates these costs, but introduces two drawbacks: excessive refusals from misclassified benign queries and slower inference speed due to iterative output adjustments. To overcome these limitations, we propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions, enabling accurate risk assessment at the query level. Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments. We call this Risk-adaptive Activation Steering (RAS). Extensive experiments across multiple benchmarks on multimodal safety and utility demonstrate that the RAS significantly reduces attack success rates, preserves general task performance, and improves inference speed over prior inference-time defenses.
zh
[CV-20] Generating healthy counterfactuals with denoising diffusion bridge models
【速读】:该论文旨在解决医学影像中生成健康反事实图像(healthy counterfactuals)时存在的难题,即如何在有效移除病灶区域的同时,保留个体特有的解剖结构特征。传统基于去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)的方法通常仅使用健康数据进行训练,并假设部分去噪过程无法建模病变区域,从而重建一个与健康样本高度匹配的反事实图像;然而,这种方法难以精确控制病理区域的去除与个体特异性特征的保留之间的平衡。为此,作者提出了一种新颖的去噪扩散桥模型(Denoising Diffusion Bridge Models, DDBMs)应用方案——该方法不仅以初始健康图像为条件,还引入了对应合成的病理图像作为终点条件,将病理图像视为结构信息先验,从而引导扩散过程生成既符合患者个体解剖特征、又能选择性消除病灶的高质量反事实图像。实验表明,该方法在分割和异常检测任务上优于先前提出的扩散模型及全监督方法。
链接: https://arxiv.org/abs/2510.13684
作者: Ana Lawry Aguila,Peirong Liu,Marina Crespo Aguirre,Juan Eugenio Iglesias
机构: Athinoula A. Martinos Center for Biomedical Imaging (Athinoula A. Martinos 中心用于生物医学成像); Massachusetts General Hospital and Harvard Medical School (马萨诸塞州总医院和哈佛医学院); ETH Zurich (苏黎世联邦理工学院); Computer Science & Artificial Intelligence Lab (计算机科学与人工智能实验室); Massachusetts Institute of Technology (麻省理工学院); Hawkes Institute (霍克斯研究所); University College London (伦敦大学学院); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating healthy counterfactuals from pathological images holds significant promise in medical imaging, e.g., in anomaly detection or for application of analysis tools that are designed for healthy scans. These counterfactuals should represent what a patient’s scan would plausibly look like in the absence of pathology, preserving individual anatomical characteristics while modifying only the pathological regions. Denoising diffusion probabilistic models (DDPMs) have become popular methods for generating healthy counterfactuals of pathology data. Typically, this involves training on solely healthy data with the assumption that a partial denoising process will be unable to model disease regions and will instead reconstruct a closely matched healthy counterpart. More recent methods have incorporated synthetic pathological images to better guide the diffusion process. However, it remains challenging to guide the generative process in a way that effectively balances the removal of anomalies with the retention of subject-specific features. To solve this problem, we propose a novel application of denoising diffusion bridge models (DDBMs) - which, unlike DDPMs, condition the diffusion process not only on the initial point (i.e., the healthy image), but also on the final point (i.e., a corresponding synthetically generated pathological image). Treating the pathological image as a structurally informative prior enables us to generate counterfactuals that closely match the patient’s anatomy while selectively removing pathology. The results show that our DDBM outperforms previously proposed diffusion models and fully supervised approaches at segmentation and anomaly detection tasks.
zh
[CV-21] FlashWorld: High-quality 3D Scene Generation within Seconds
【速读】:该论文旨在解决现有3D场景生成方法在效率与视觉质量之间的权衡问题,尤其是传统多视角导向(multi-view-oriented, MV-oriented)方法计算成本高、生成速度慢,而直接3D导向(3D-oriented)方法虽速度快但常导致视觉质量下降的问题。解决方案的关键在于提出FlashWorld模型,其核心创新是引入双模式预训练(dual-mode pre-training)与跨模式后训练蒸馏(cross-mode post-training distillation):首先通过视频扩散模型先验训练一个支持MV-oriented和3D-oriented两种生成模式的多视角扩散模型;随后利用分布对齐策略将高质量多视角生成结果的知识蒸馏到3D导向模式中,从而在保持3D一致性的同时显著提升视觉质量,并减少推理时所需的去噪步数。此外,该方法还设计了利用大规模单视图图像和文本提示增强模型对分布外输入的泛化能力,实现了高效且高质量的3D场景生成。
链接: https://arxiv.org/abs/2510.13678
作者: Xinyang Li,Tengfei Wang,Zixiao Gu,Shengchuan Zhang,Chunchao Guo,Liujuan Cao
机构: Xiamen University (厦门大学); Tencent (腾讯); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100 \times faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model’s generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.
zh
[CV-22] Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning
【速读】:该论文旨在解决开放域视觉实体识别(open-domain visual entity recognition)中的挑战,即在训练阶段未见过的实体、长尾分布以及高视觉模糊性等问题,这些因素导致传统分类方法难以有效泛化。解决方案的关键在于提出一种知识引导的对比学习框架(Knowledge-guided Contrastive Learning, KnowCoL),通过将图像与文本描述映射到由Wikidata结构化知识支撑的共享语义空间中,利用实体描述、类型层次和关系上下文实现零样本(zero-shot)识别能力,从而显著提升对罕见及未见实体的识别准确率。
链接: https://arxiv.org/abs/2510.13675
作者: Hongkuan Zhou,Lavdim Halilaj,Sebastian Monka,Stefan Schmid,Yuqicheng Zhu,Jingcheng Wu,Nadeem Nazer,Steffen Staab
机构: 1. University of Mannheim (曼海姆大学); 2. Tsinghua University (清华大学); 3. University of Bremen (不来梅大学); 4. University of Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. In this work, we propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.
zh
[CV-23] NTIRE 2025 Challenge on Low Light Image Enhancement: Methods and Results CVPR CVPR2025
【速读】:该论文旨在解决低光照图像增强(Low-Light Image Enhancement, LLIE)问题,即在复杂多变的光照条件下生成更明亮、清晰且视觉上更具吸引力的图像。其解决方案的关键在于通过NTIRE 2025 LLIE挑战赛评估和总结当前最先进的增强方法,重点识别出能够有效提升图像质量的网络架构与技术路径,从而推动LLIE领域的技术进步。
链接: https://arxiv.org/abs/2510.13670
作者: Xiaoning Liu,Zongwei Wu,Florin-Alexandru Vasluianu,Hailong Yan,Bin Ren,Yulun Zhang,Shuhang Gu,Le Zhang,Ce Zhu,Radu Timofte,Kangbiao Shi,Yixu Feng,Tao Hu,Yu Cao,Peng Wu,Yijin Liang,Yanning Zhang,Qingsen Yan,Han Zhou,Wei Dong,Yan Min,Mohab Kishawy,Jun Chen,Pengpeng Yu,Anjin Park,Seung-Soo Lee,Young-Joon Park,Zixiao Hu,Junyv Liu,Huilin Zhang,Jun Zhang,Fei Wan,Bingxin Xu,Hongzhe Liu,Cheng Xu,Weiguo Pan,Songyin Dai,Xunpeng Yi,Qinglong Yan,Yibing Zhang,Jiayi Ma,Changhui Hu,Kerui Hu,Donghang Jing,Tiesheng Chen,Zhi Jin,Hongjun Wu,Biao Huang,Haitao Ling,Jiahao Wu,Dandan Zhan,G Gyaneshwar Rao,Vijayalaxmi Ashok Aralikatti,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Ruirui Lin,Guoxi Huang,Nantheera Anantrasirichai,Qirui Yang,Alexandru Brateanu,Ciprian Orhei,Cosmin Ancuti,Daniel Feijoo,Juan C. Benito,Álvaro García,Marcos V. Conde,Yang Qin,Raul Balmez,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Tianyi Mao,Huan Zheng,Yanyan Wei,Shengeng Tang,Dan Guo,Zhao Zhang,Sabari Nathan,K Uma,A Sasithradevi,B Sathya Bama,S. Mohamed Mansoor Roomi,Ao Li,Xiangtao Zhang,Zhe Liu,Yijie Tang,Jialong Tang,Zhicheng Fu,Gong Chen,Joe Nasti,John Nicholson,Zeyu Xiao,Zhuoyuan Li,Ashutosh Kulkarni,Prashant W. Patil,Santosh Kumar Vipparthi,Subrahmanyam Murala,Duan Liu,Weile Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR NTIRE 2025 Workshop, please refer to this https URL
Abstract:This paper presents a comprehensive review of the NTIRE 2025 Low-Light Image Enhancement (LLIE) Challenge, highlighting the proposed solutions and final outcomes. The objective of the challenge is to identify effective networks capable of producing brighter, clearer, and visually compelling images under diverse and challenging conditions. A remarkable total of 762 participants registered for the competition, with 28 teams ultimately submitting valid entries. This paper thoroughly evaluates the state-of-the-art advancements in LLIE, showcasing the significant progress.
zh
[CV-24] CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas
【速读】:该论文旨在解决视频生成中基于掩码自回归模型(Masked Autoregressive Models, MAR)的两个关键问题:一是“慢启动”问题,即在采样初期因缺乏结构化的全局先验导致生成效率低下;二是自回归过程中在空间和时间维度上的误差累积问题。解决方案的关键在于提出CanvasMAR架构,其核心创新是引入“画布机制”(canvas mechanism)——一种对下一帧的模糊全局预测,作为掩码生成的初始起点,从而在早期采样阶段提供结构引导,加速并提升帧间一致性;同时结合组合式无分类器引导(compositional classifier-free guidance)以增强时空条件建模,并通过基于噪声的画布增强策略提高模型鲁棒性。
链接: https://arxiv.org/abs/2510.13669
作者: Zian Li,Muhan Zhang
机构: Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院); School of Intelligence Science and Technology, Peking University (北京大学智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism–a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.
zh
[CV-25] OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild NEURIPS2025
【速读】:该论文旨在解决当前3D gaze估计方法在跨数据域场景下泛化能力不足的问题,其核心挑战在于标注数据稀缺和标签数据多样性不足。解决方案的关键在于提出OmniGaze框架,该框架通过引入大规模未标注的真实世界图像数据集,并采用基于奖励模型的伪标签筛选机制来提升伪标签质量:该奖励模型不仅利用预训练视觉编码器提取的视觉嵌入(visual embeddings),还结合多模态大语言模型(Multimodal Large Language Model, MLLM)生成的注视视角语义信息,共同计算伪标签置信度分数,从而实现高质量伪标签的选择与加权损失计算,显著增强模型在多样化环境下的泛化性能。
链接: https://arxiv.org/abs/2510.13660
作者: Hongyu Qu,Jianan Wei,Xiangbo Shu,Yazhou Yao,Wenguan Wang,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Zhejiang University (浙江大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025; Project page: \url{ this https URL }
Abstract:Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.
zh
[CV-26] EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection
【速读】:该论文旨在解决将生成式 AI(Generative AI)基础模型应用于3D编辑时面临的两大挑战:一是现有图像编辑模块替换为基础模型后带来的高计算开销和闭源API限制,导致难以集成到迭代式编辑流程;二是视频生成基础模型在跨视角传播编辑时缺乏多视图一致性,影响最终3D重建质量。解决方案的关键在于提出EditCast3D管道,其核心创新包括:首先利用视频生成基础模型在单帧初始编辑基础上对整个数据集进行预编辑传播,从而减少对昂贵图像编辑模块的依赖;其次引入视图选择策略,显式筛选出一致性强且利于3D重建的视角,并采用前馈式重建方式避免耗时的迭代优化,从而在保证编辑质量的同时显著提升效率。
链接: https://arxiv.org/abs/2510.13652
作者: Huaizhi Qu,Ruichen Zhang,Shuqing Luo,Luchao Qi,Zhihao Zhang,Xiaoming Liu,Roni Sengupta,Tianlong Chen
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in foundation models have driven remarkable progress in image editing, yet their extension to 3D editing remains underexplored. A natural approach is to replace the image editing modules in existing workflows with foundation models. However, their heavy computational demands and the restrictions and costs of closed-source APIs make plugging these models into existing iterative editing strategies impractical. To address this limitation, we propose EditCast3D, a pipeline that employs video generation foundation models to propagate edits from a single first frame across the entire dataset prior to reconstruction. While editing propagation enables dataset-level editing via video models, its consistency remains suboptimal for 3D reconstruction, where multi-view alignment is essential. To overcome this, EditCast3D introduces a view selection strategy that explicitly identifies consistent and reconstruction-friendly views and adopts feedforward reconstruction without requiring costly refinement. In combination, the pipeline both minimizes reliance on expensive image editing and mitigates prompt ambiguities that arise when applying foundation models independently across images. We evaluate EditCast3D on commonly used 3D editing datasets and compare it against state-of-the-art 3D editing baselines, demonstrating superior editing quality and high efficiency. These results establish EditCast3D as a scalable and general paradigm for integrating foundation models into 3D editing pipelines. The code is available at this https URL
zh
[CV-27] Local-Global Context-Aware and Structure-Preserving Image Super-Resolution
【速读】:该论文旨在解决当前基于扩散模型(diffusion models)的图像超分辨率方法在处理多样且严重退化的图像时存在的局限性,例如噪声放大和内容生成错误等问题。其解决方案的关键在于提出了一种上下文精确的图像超分辨率框架,通过引入局部-全局上下文感知注意力机制(Local-Global Context-Aware Attention),有效保持像素间的局部与全局关系;同时设计了一种像素空间中的分布与感知对齐条件机制(distribution- and perceptual-aligned conditioning mechanism),在细粒度像素级表征的基础上逐步保留并优化结构信息,从而实现从局部细节到整体结构的渐进式重建,显著提升了重建图像的结构一致性、真实感与感知保真度。
链接: https://arxiv.org/abs/2510.13649
作者: Sanchar Palit,Subhasis Chaudhuri,Biplab Banerjee
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figures
Abstract:Diffusion models have recently achieved significant success in various image manipulation tasks, including image super-resolution and perceptual quality enhancement. Pretrained text-to-image models, such as Stable Diffusion, have exhibited strong capabilities in synthesizing realistic image content, which makes them particularly attractive for addressing super-resolution tasks. While some existing approaches leverage these models to achieve state-of-the-art results, they often struggle when applied to diverse and highly degraded images, leading to noise amplification or incorrect content generation. To address these limitations, we propose a contextually precise image super-resolution framework that effectively maintains both local and global pixel relationships through Local-Global Context-Aware Attention, enabling the generation of high-quality images. Furthermore, we propose a distribution- and perceptual-aligned conditioning mechanism in the pixel space to enhance perceptual fidelity. This mechanism captures fine-grained pixel-level representations while progressively preserving and refining structural information, transitioning from local content details to the global structural composition. During inference, our method generates high-quality images that are structurally consistent with the original content, mitigating artifacts and ensuring realistic detail restoration. Extensive experiments on multiple super-resolution benchmarks demonstrate the effectiveness of our approach in producing high-fidelity, perceptually accurate reconstructions.
zh
[CV-28] owards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection
【速读】:该论文旨在解决基于DINOv2的少样本异常检测方法在面对对抗扰动时的脆弱性以及原始异常分数缺乏校准不确定性的问题。其关键解决方案在于:首先,通过在冻结的DINOv2特征上附加一个轻量级线性头以支持白盒梯度攻击,从而系统评估FGSM攻击对MVTec-AD和VisA数据集上F1、AUROC、AP和G-mean等指标的影响;其次,引入后验 Platt 缩放(Platt scaling)对异常分数进行校准,显著提升对抗样本下的预测熵,实现攻击检测的实用标记机制并降低校准误差(ECE)。此方案揭示了DINOv2基异常检测器的具体漏洞,并为构建鲁棒且具备不确定性感知能力的异常检测系统提供了评估协议与基准。
链接: https://arxiv.org/abs/2510.13643
作者: Akib Mohammed Khan,Bartosz Krawczyk
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 3 tables
Abstract:Foundation models such as DINOv2 have shown strong performance in few-shot anomaly detection, yet two key questions remain unexamined: (i) how susceptible are these detectors to adversarial perturbations; and (ii) how well do their anomaly scores reflect calibrated uncertainty? Building on AnomalyDINO, a training-free deep nearest-neighbor detector over DINOv2 features, we present one of the first systematic studies of adversarial attacks and uncertainty estimation in this setting. To enable white-box gradient attacks while preserving test-time behavior, we attach a lightweight linear head to frozen DINOv2 features only for crafting perturbations. Using this heuristic, we evaluate the impact of FGSM across the MVTec-AD and VisA datasets and observe consistent drops in F1, AUROC, AP, and G-mean, indicating that imperceptible perturbations can flip nearest-neighbor relations in feature space to induce confident misclassification. Complementing robustness, we probe reliability and find that raw anomaly scores are poorly calibrated, revealing a gap between confidence and correctness that limits safety-critical use. As a simple, strong baseline toward trustworthiness, we apply post-hoc Platt scaling to the anomaly scores for uncertainty estimation. The resulting calibrated posteriors yield significantly higher predictive entropy on adversarially perturbed inputs than on clean ones, enabling a practical flagging mechanism for attack detection while reducing calibration error (ECE). Our findings surface concrete vulnerabilities in DINOv2-based few-shot anomaly detectors and establish an evaluation protocol and baseline for robust, uncertainty-aware anomaly detection. We argue that adversarial robustness and principled uncertainty quantification are not optional add-ons but essential capabilities if anomaly detection systems are to be trustworthy and ready for real-world deployment.
zh
[CV-29] Challenges Advances and Evaluation Metrics in Medical Image Enhancement: A Systematic Literature Review
【速读】:该论文旨在解决医学图像增强中普遍存在的质量问题,如噪声、伪影和低对比度,这些问题限制了影像的诊断潜力。其解决方案的关键在于系统性地梳理和评估当前主流的增强方法,特别是传统数学方法与深度学习技术的应用效果,并强调图像质量评估(Image Quality Assessment, IQA)指标在衡量增强性能中的核心作用。研究发现,尽管多数方法集中在MRI和多模态成像领域,但针对特定模态(如组织病理学、内窥镜和骨显像)的研究仍不足,且非参考型IQA指标使用频率更高,提示未来需发展更全面、标准化的评估体系以推动医学图像增强技术的临床落地。
链接: https://arxiv.org/abs/2510.13638
作者: Chun Wai Chin,Haniza Yazid,Hoi Leong Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image enhancement is crucial for improving the quality and interpretability of diagnostic images, ultimately supporting early detection, accurate diagnosis, and effective treatment planning. Despite advancements in imaging technologies such as X-ray, CT, MRI, and ultrasound, medical images often suffer from challenges like noise, artifacts, and low contrast, which limit their diagnostic potential. Addressing these challenges requires robust preprocessing, denoising algorithms, and advanced enhancement methods, with deep learning techniques playing an increasingly significant role. This systematic literature review, following the PRISMA approach, investigates the key challenges, recent advancements, and evaluation metrics in medical image enhancement. By analyzing findings from 39 peer-reviewed studies, this review provides insights into the effectiveness of various enhancement methods across different imaging modalities and the importance of evaluation metrics in assessing their impact. Key issues like low contrast and noise are identified as the most frequent, with MRI and multi-modal imaging receiving the most attention, while specialized modalities such as histopathology, endoscopy, and bone scintigraphy remain underexplored. Out of the 39 studies, 29 utilize conventional mathematical methods, 9 focus on deep learning techniques, and 1 explores a hybrid approach. In terms of image quality assessment, 18 studies employ both reference-based and non-reference-based metrics, 9 rely solely on reference-based metrics, and 12 use only non-reference-based metrics, with a total of 65 IQA metrics introduced, predominantly non-reference-based. This review highlights current limitations, research gaps, and potential future directions for advancing medical image enhancement.
zh
[CV-30] AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset
【速读】:该论文旨在解决现有异常检测方法主要依赖视觉数据、在遮挡、低光照和恶劣天气等挑战性条件下可靠性不足,以及缺乏大规模同步音视频数据集限制多模态异常识别发展的关键问题。其解决方案的核心在于提出一种轻量且高效的音频-视觉异常识别框架AVAR-Net,该框架包含四个模块:音频特征提取器(采用Wav2Vec2模型提取鲁棒时序特征)、视频特征提取器(使用MobileViT捕捉局部与全局视觉表征)、早期融合机制以及多阶段时序卷积网络(MTCN)以建模跨模态关系并学习长程时序依赖,从而实现鲁棒的时空推理;同时构建了VAAR数据集作为中等规模基准,支持多模态异常识别研究的推进。
链接: https://arxiv.org/abs/2510.13630
作者: Amjid Ali,Zulfiqar Ahmad Khan,Altaf Hussain,Muhammad Munsif,Adnan Hussain,Sung Wook Baik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.
zh
[CV-31] Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues
【速读】:该论文旨在解决无人机(UAV)基于可见光(RGB)与红外(IR)图像的多模态目标检测在复杂真实场景下性能受限的问题,尤其针对现有数据集难以覆盖多样化成像条件(如不同高度、视角、天气和光照变化)的局限性。解决方案的关键在于提出一种新颖的提示引导条件感知动态融合(prompt-guided condition-aware dynamic fusion, PCDF)机制:通过将成像条件编码为文本提示,并利用任务特定的软门控变换建模条件与多模态贡献之间的关系,实现对RGB与IR模态权重的自适应调整;同时引入提示引导的条件解耦模块,使模型在无条件标注的情况下仍能有效运行,从而提升跨场景鲁棒性。
链接: https://arxiv.org/abs/2510.13620
作者: Chen Chen,Kangcheng Bin,Ting Hu,Jiahao Qi,Xingyue Liu,Tianpeng Liu,Zhen Liu,Yongxiang Liu,Ping Zhong
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of high-quality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0° to 75°, and all-day, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable high-level contextual information. To meet the challenge raised by such diverse conditions, we propose a novel prompt-guided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.
zh
[CV-32] XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation ICASSP2026
【速读】:该论文旨在解决自动驾驶中深度估计(depth estimation)的精度与模型轻量化之间的矛盾问题,尤其在恶劣环境条件下如何通过雷达-相机融合(radar-camera fusion)提升鲁棒性。其解决方案的关键在于提出一种轻量级架构 XD-RCDepth,通过两种知识蒸馏策略实现性能压缩与可解释性增强:一是对齐可解释性的蒸馏(explainability-aligned distillation),将教师模型的显著性结构迁移至学生模型;二是深度分布蒸馏(depth-distribution distillation),将深度回归任务转化为离散化bin上的软分类任务。该方法在参数减少29.7%的同时,相较直接训练降低MAE达7.97%,并在nuScenes和ZJU-4DRadarCam数据集上实现了高精度与实时效率的平衡。
链接: https://arxiv.org/abs/2510.13565
作者: Huawei Sun,Zixu Wang,Xiangyuan Peng,Julius Ott,Georg Stettinger,Lorenzo Servadei,Robert Wille
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICASSP 2026
Abstract:Depth estimation remains central to autonomous driving, and radar-camera fusion offers robustness in adverse conditions by providing complementary geometric cues. In this paper, we present XD-RCDepth, a lightweight architecture that reduces the parameters by 29.7% relative to the state-of-the-art lightweight baseline while maintaining comparable accuracy. To preserve performance under compression and enhance interpretability, we introduce two knowledge-distillation strategies: an explainability-aligned distillation that transfers the teacher’s saliency structure to the student, and a depth-distribution distillation that recasts depth regression as soft classification over discretized bins. Together, these components reduce the MAE compared with direct training with 7.97% and deliver competitive accuracy with real-time efficiency on nuScenes and ZJU-4DRadarCam datasets.
zh
[CV-33] Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents
【速读】:该论文旨在解决面部表情识别(Facial Expression Recognition, FER)在跨文化差异和感知退化视觉条件下鲁棒性不足的问题,现有评估通常假设数据同质且图像质量高,忽略了真实场景中的复杂性。其解决方案的关键在于构建一个基于代理(agent-based)的流式基准测试框架,其中每个代理在冻结的CLIP特征空间中运行,并通过轻量级残差适配器在线训练(sigma=0),测试时固定参数;环境以分阶段高斯模糊(sigma-scheduled Gaussian blur)模拟视觉退化,同时引入不同文化组成(单文化:西方或亚洲,混合文化:平衡5/5与不平衡8/2、2/8)和空间接触结构,从而量化文化组成与交互结构对FER鲁棒性的动态影响。
链接: https://arxiv.org/abs/2510.13557
作者: David Freire-Obregón,José Salas-Cáceres,Javier Lorenzo-Navarro,Oliverio J. Santana,Daniel Hernández-Sosa,Modesto Castrillón-Santana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the International Symposium on Agentic Artificial Intelligence Systems (AAIS 2025)
Abstract:Facial expression recognition (FER) must remain robust under both cultural variation and perceptually degraded visual conditions, yet most existing evaluations assume homogeneous data and high-quality imagery. We introduce an agent-based, streaming benchmark that reveals how cross-cultural composition and progressive blurring interact to shape face recognition robustness. Each agent operates in a frozen CLIP feature space with a lightweight residual adapter trained online at sigma=0 and fixed during testing. Agents move and interact on a 5x5 lattice, while the environment provides inputs with sigma-scheduled Gaussian blur. We examine monocultural populations (Western-only, Asian-only) and mixed environments with balanced (5/5) and imbalanced (8/2, 2/8) compositions, as well as different spatial contact structures. Results show clear asymmetric degradation curves between cultural groups: JAFFE (Asian) populations maintain higher performance at low blur but exhibit sharper drops at intermediate stages, whereas KDEF (Western) populations degrade more uniformly. Mixed populations exhibit intermediate patterns, with balanced mixtures mitigating early degradation, but imbalanced settings amplify majority-group weaknesses under high blur. These findings quantify how cultural composition and interaction structure influence the robustness of FER as perceptual conditions deteriorate.
zh
[CV-34] Accelerated Feature Detectors for Visual SLAM: A Comparative Study of FPGA vs GPU
【速读】:该论文旨在解决视觉SLAM(V-SLAM)系统中特征检测模块在功耗受限平台(如无人机)上运行效率低的问题。其核心挑战在于如何通过硬件加速提升特征检测与整个V-SLAM流水线的性能与能效,同时保持定位精度。解决方案的关键在于对比分析基于GPU和FPGA的加速实现:研究发现,对于非学习型特征检测器(如FAST和Harris),GPU实现更优;而对于学习型检测器SuperPoint,FPGA实现可带来高达3.1倍的运行速度提升和1.4倍的能效改进;此外,FPGA加速的V-SLAM在部分数据集上表现接近甚至优于GPU版本,且可通过减少全局捆绑调整(global bundle adjustment)调用频率进一步优化整体性能而不牺牲精度。
链接: https://arxiv.org/abs/2510.13546
作者: Ruiqi Ye,Mikel Luján
机构: University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF); Robotics (cs.RO)
备注: 12 pages, 7 figures
Abstract:Feature detection is a common yet time-consuming module in Simultaneous Localization and Mapping (SLAM) implementations, which are increasingly deployed on power-constrained platforms, such as drones. Graphics Processing Units (GPUs) have been a popular accelerator for computer vision in general, and feature detection and SLAM in particular. On the other hand, System-on-Chips (SoCs) with integrated Field Programmable Gate Array (FPGA) are also widely available. This paper presents the first study of hardware-accelerated feature detectors considering a Visual SLAM (V-SLAM) pipeline. We offer new insights by comparing the best GPU-accelerated FAST, Harris, and SuperPoint implementations against the FPGA-accelerated counterparts on modern SoCs (Nvidia Jetson Orin and AMD Versal). The evaluation shows that when using a non-learning-based feature detector such as FAST and Harris, their GPU implementations, and the GPU-accelerated V-SLAM can achieve better run-time performance and energy efficiency than the FAST and Harris FPGA implementations as well as the FPGA-accelerated V-SLAM. However, when considering a learning-based detector such as SuperPoint, its FPGA implementation can achieve better run-time performance and energy efficiency (up to 3.1 \times and 1.4 \times improvements, respectively) than the GPU implementation. The FPGA-accelerated V-SLAM can also achieve comparable run-time performance compared to the GPU-accelerated V-SLAM, with better FPS in 2 out of 5 dataset sequences. When considering the accuracy, the results show that the GPU-accelerated V-SLAM is more accurate than the FPGA-accelerated V-SLAM in general. Last but not least, the use of hardware acceleration for feature detection could further improve the performance of the V-SLAM pipeline by having the global bundle adjustment module invoked less frequently without sacrificing accuracy. Comments: 12 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF); Robotics (cs.RO) ACMclasses: C.3; C.4; I.4.6 Cite as: arXiv:2510.13546 [cs.CV] (or arXiv:2510.13546v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.13546 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-35] Learning Neural Parametric 3D Breast Shape Models for Metrical Surface Reconstruction From Monocular RGB Videos
【速读】:该论文旨在解决当前3D乳腺形态重建中成本高、硬件依赖性强及重建精度不足的问题。现有商用方案虽精度较高但价格昂贵,而低代价替代方法往往难以保证几何准确性与细节还原能力。其解决方案的关键在于提出一种基于局部隐式神经表示的参数化乳腺形状模型(liRBSM),该模型将乳腺区域分解为多个由解剖学标志点锚定的局部神经符号距离函数(SDF)单元,相较于全局单神经SDF模型(如iRBSM),显著提升了表面重建的细节质量和几何保真度;同时结合现成的结构光恢复(Structure-from-Motion, SfM)技术,构建了一套无需专用设备或软件即可从普通RGB视频中高效重建亚毫米级精度(误差<2 mm)乳腺三维形貌的开源流水线,具有速度快(<6分钟)、透明可复现等优势。
链接: https://arxiv.org/abs/2510.13540
作者: Maximilian Weiherer,Antonia von Riedheim,Vanessa Brébant,Bernhard Egger,Christoph Palm
机构: University of Innsbruck (因斯布鲁克大学); Medical University of Innsbruck (因斯布鲁克医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures
Abstract:We present a neural parametric 3D breast shape model and, based on this model, introduce a low-cost and accessible 3D surface reconstruction pipeline capable of recovering accurate breast geometry from a monocular RGB video. In contrast to widely used, commercially available yet prohibitively expensive 3D breast scanning solutions and existing low-cost alternatives, our method requires neither specialized hardware nor proprietary software and can be used with any device that is able to record RGB videos. The key building blocks of our pipeline are a state-of-the-art, off-the-shelf Structure-from-motion pipeline, paired with a parametric breast model for robust and metrically correct surface reconstruction. Our model, similarly to the recently proposed implicit Regensburg Breast Shape Model (iRBSM), leverages implicit neural representations to model breast shapes. However, unlike the iRBSM, which employs a single global neural signed distance function (SDF), our approach – inspired by recent state-of-the-art face models – decomposes the implicit breast domain into multiple smaller regions, each represented by a local neural SDF anchored at anatomical landmark positions. When incorporated into our surface reconstruction pipeline, the proposed model, dubbed liRBSM (short for localized iRBSM), significantly outperforms the iRBSM in terms of reconstruction quality, yielding more detailed surface reconstruction than its global counterpart. Overall, we find that the introduced pipeline is able to recover high-quality 3D breast geometry within an error margin of less than 2 mm. Our method is fast (requires less than six minutes), fully transparent and open-source, and – together with the model – publicly available at this https URL.
zh
[CV-36] High Semantic Features for the Continual Learning of Complex Emotions: a Lightweight Solution
【速读】:该论文旨在解决增量学习(Incremental Learning)过程中因新任务学习导致旧任务知识发生灾难性遗忘(Catastrophic Forgetting)的问题,尤其针对复杂情绪识别场景。其解决方案的关键在于利用面部动作单元(Action Units, AUs),这些特征具有非瞬时性和高语义性,能够稳定地跨任务迁移,从而有效缓解遗忘问题。相比浅层与深层卷积神经网络提取的特征,AUs在复杂复合情绪的增量学习中表现更优,在CFEE数据集上达到0.75的准确率,且模型轻量化、内存占用小,展现出优于当前主流方法的性能。
链接: https://arxiv.org/abs/2510.13534
作者: Thibault Geoffroy,gauthier Gerspacher,Lionel Prevost
机构: EsieaLab - Learning, Data, Robotics (EsieaLab - 学习、数据、机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 14 figures
Abstract:Incremental learning is a complex process due to potential catastrophic forgetting of old tasks when learning new ones. This is mainly due to transient features that do not fit from task to task. In this paper, we focus on complex emotion recognition. First, we learn basic emotions and then, incrementally, like humans, complex emotions. We show that Action Units, describing facial muscle movements, are non-transient, highly semantical features that outperform those extracted by both shallow and deep convolutional neural networks. Thanks to this ability, our approach achieves interesting results when learning incrementally complex, compound emotions with an accuracy of 0.75 on the CFEE dataset and can be favorably compared to state-of-the-art results. Moreover, it results in a lightweight model with a small memory footprint.
zh
[CV-37] UniME-V2: MLLM -as-a-Judge for Universal Multimodal Embedding Learning
【速读】:该论文旨在解决现有通用多模态嵌入模型在负样本挖掘过程中存在的问题,包括难以捕捉候选样本间的细微语义差异、负样本多样性不足,以及对假负样本(false negatives)和难负样本(hard negatives)的判别能力有限。其解决方案的关键在于引入大语言模型作为评判者(MLLM-as-a-Judge)机制,利用大语言模型对查询-候选对进行语义对齐评估并生成软语义匹配分数,以此构建高质量的难负样本集合,并将这些软标签用于优化嵌入空间中的相似性矩阵对齐,从而增强模型对候选样本间语义差异的区分能力。此外,还提出了UniME-V2-Reranker模型,通过联合成对与列表优化策略训练,进一步提升检索性能。
链接: https://arxiv.org/abs/2510.13515
作者: Tiancheng Gu,Kaicheng Yang,Kaichen Zhang,Xiang An,Ziyong Feng,Yueyi Zhang,Weidong Cai,Jiankang Deng,Lidong Bing
机构: Google(谷歌); Stanford University (斯坦福大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 11 tables
Abstract:Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.
zh
[CV-38] ExpressNet-MoE: A Hybrid Deep Neural Network for Emotion Recognition
【速读】:该论文旨在解决真实场景下面部情绪识别(Facial Emotion Recognition, FER)的挑战,包括姿态变化、遮挡、光照差异以及人口统计学多样性等因素导致的性能下降问题,尤其针对当前模型在参与度检测(Engagement Detection)等应用中因FER局限性而难以有效建模的问题。解决方案的关键在于提出ExpressNet-MoE——一种融合卷积神经网络(Convolutional Neural Networks, CNNs)与专家混合(Mixture of Experts, MoE)架构的新型混合深度学习模型:通过动态选择最相关的专家网络实现自适应特征选择,提升跨数据集的泛化能力;同时引入多尺度特征提取机制以捕获全局与局部面部特征,并结合残差网络骨干结构进行深层特征学习,从而显著增强模型在复杂现实场景中的情绪识别准确性。
链接: https://arxiv.org/abs/2510.13493
作者: Deeptimaan Banerjee,Prateek Gothwal,Ashis Kumer Biswas
机构: University of Colorado Denver (科罗拉多大学丹佛分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: * Current version of the manuscript contains 17 pages including text, 13 figures, and 4 tables. The manuscript is currently under review at a journal
Abstract:In many domains, including online education, healthcare, security, and human-computer interaction, facial emotion recognition (FER) is essential. Real-world FER is still difficult despite its significance because of some factors such as variable head positions, occlusions, illumination shifts, and demographic diversity. Engagement detection, which is essential for applications like virtual learning and customer services, is frequently challenging due to FER limitations by many current models. In this article, we propose ExpressNet-MoE, a novel hybrid deep learning model that blends both Convolution Neural Networks (CNNs) and Mixture of Experts (MoE) framework, to overcome the difficulties. Our model dynamically chooses the most pertinent expert networks, thus it aids in the generalization and providing flexibility to model across a wide variety of datasets. Our model improves on the accuracy of emotion recognition by utilizing multi-scale feature extraction to collect both global and local facial features. ExpressNet-MoE includes numerous CNN-based feature extractors, a MoE module for adaptive feature selection, and finally a residual network backbone for deep feature learning. To demonstrate efficacy of our proposed model we evaluated on several datasets, and compared with current state-of-the-art methods. Our model achieves accuracies of 74.77% on AffectNet (v7), 72.55% on AffectNet (v8), 84.29% on RAF-DB, and 64.66% on FER-2013. The results show how adaptive our model is and how it may be used to develop end-to-end emotion recognition systems in practical settings. Reproducible codes and results are made publicly accessible at this https URL.
zh
[CV-39] hrough the Lens of Doubt: Robust and Efficient Uncertainty Estimation for Visual Place Recognition
【速读】:该论文旨在解决视觉位置识别(Visual Place Recognition, VPR)系统在复杂环境变化下(如光照、季节、视角差异)难以准确估计匹配置信度的问题,这对关键任务如SLAM中的回环检测尤为关键。解决方案的关键在于提出三种无需训练的不确定性度量方法:相似度分布(Similarity Distribution, SD)通过量化候选匹配得分的分离程度来衡量匹配的独特性;比值扩散(Ratio Spread, RS)评估前几名位置之间的竞争模糊性;统计不确定性(Statistical Uncertainty, SU)则融合SD与RS,形成一个无需验证数据即可跨数据集和VPR方法通用的统一指标。这些方法不依赖额外模型训练或计算密集的几何验证,具有极低的计算开销,实验证明其能有效区分正确与错误匹配,在多种场景下显著优于现有方法。
链接: https://arxiv.org/abs/2510.13464
作者: Emily Miller,Michael Milford,Muhammad Burhan Hafez,SD Ramchurn,Shoaib Ehsan
机构: University of Southampton (南安普顿大学); Queensland University of Technology (昆士兰科技大学); University of Essex (埃塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Visual Place Recognition (VPR) enables robots and autonomous vehicles to identify previously visited locations by matching current observations against a database of known places. However, VPR systems face significant challenges when deployed across varying visual environments, lighting conditions, seasonal changes, and viewpoints changes. Failure-critical VPR applications, such as loop closure detection in simultaneous localization and mapping (SLAM) pipelines, require robust estimation of place matching uncertainty. We propose three training-free uncertainty metrics that estimate prediction confidence by analyzing inherent statistical patterns in similarity scores from any existing VPR method. Similarity Distribution (SD) quantifies match distinctiveness by measuring score separation between candidates; Ratio Spread (RS) evaluates competitive ambiguity among top-scoring locations; and Statistical Uncertainty (SU) is a combination of SD and RS that provides a unified metric that generalizes across datasets and VPR methods without requiring validation data to select the optimal metric. All three metrics operate without additional model training, architectural modifications, or computationally expensive geometric verification. Comprehensive evaluation across nine state-of-the-art VPR methods and six benchmark datasets confirms that our metrics excel at discriminating between correct and incorrect VPR matches, and consistently outperform existing approaches while maintaining negligible computational overhead, making it deployable for real-time robotic applications across varied environmental conditions with improved precision-recall performance.
zh
[CV-40] VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
【速读】:该论文旨在解决文本到三维(text-to-3D)生成任务中模型性能受限的问题,特别是如何有效融合生成式视频模型的语义理解能力与3D重建模型的几何建模能力,以生成高质量、一致且具感知真实感的3D场景。解决方案的关键在于提出VIST3A框架:首先通过模型拼接(model stitching)技术,在3D解码器中找到与文本到视频生成器潜在表示最匹配的层,并在无需大量标注数据的情况下实现两者的无缝连接;其次,采用直接奖励微调(direct reward fine-tuning)方法对文本到视频生成器进行对齐优化,确保其输出的潜在表示可被3D解码器准确还原为结构一致的3D几何体,从而显著提升生成质量,优于现有基于高斯泼溅(Gaussian splats)的文本到3D模型,并支持高质量文本到点云(pointmap)生成。
链接: https://arxiv.org/abs/2510.13454
作者: Hyojun Go,Dominik Narnhofer,Goutam Bhat,Prune Truong,Federico Tombari,Konrad Schindler
机构: ETH Zurich; Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as “generator” with the geometric abilities of a recent (feedforward) 3D reconstruction system as “decoder”. We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.
zh
[CV-41] Near-Infrared Hyperspectral Imaging Applications in Food Analysis – Improving Algorithms and Methodologies
【速读】:该论文旨在解决食品质量分析中如何有效利用近红外高光谱成像(NIR-HSI)技术进行参数建模的问题,尤其关注化学与物理视觉信息的联合建模性能提升。其核心挑战在于:传统仅依赖光谱或空间信息的模型在复杂参数建模中表现有限,且生成化学分布图时受限于参考值的空间分辨率。解决方案的关键在于采用二维卷积神经网络(2D CNN)结合谱域预处理层,通过引入一个专门用于执行光谱卷积的初始层,使模型能够学习到类似领域专家的光谱预处理特征,从而显著提升预测精度;同时,在建模平均化学参数时仍推荐使用偏最小二乘法(PLS),因其性能稳定且计算高效;此外,针对化学分布图生成中的非平滑和越界问题,改进后的2D CNN方案有效缓解了这些问题,展现出优于PLS方法的鲁棒性和实用性。
链接: https://arxiv.org/abs/2510.13452
作者: Ole-Christian Galbo Engstrøm
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: PhD thesis
Abstract:This thesis investigates the application of near-infrared hyperspectral imaging (NIR-HSI) for food quality analysis. The investigation is conducted through four studies operating with five research hypotheses. For several analyses, the studies compare models based on convolutional neural networks (CNNs) and partial least squares (PLS). Generally, joint spatio-spectral analysis with CNNs outperforms spatial analysis with CNNs and spectral analysis with PLS when modeling parameters where chemical and physical visual information are relevant. When modeling chemical parameters with a 2-dimensional (2D) CNN, augmenting the CNN with an initial layer dedicated to performing spectral convolution enhances its predictive performance by learning a spectral preprocessing similar to that applied by domain experts. Still, PLS-based spectral modeling performs equally well for analysis of the mean content of chemical parameters in samples and is the recommended approach. Modeling the spatial distribution of chemical parameters with NIR-HSI is limited by the ability to obtain spatially resolved reference values. Therefore, a study used bulk mean references for chemical map generation of fat content in pork bellies. A PLS-based approach gave non-smooth chemical maps and pixel-wise predictions outside the range of 0-100%. Conversely, a 2D CNN augmented with a spectral convolution layer mitigated all issues arising with PLS. The final study attempted to model barley’s germinative capacity by analyzing NIR spectra, RGB images, and NIR-HSI images. However, the results were inconclusive due to the dataset’s low degree of germination. Additionally, this thesis has led to the development of two open-sourced Python packages. The first facilitates fast PLS-based modeling, while the second facilitates very fast cross-validation of PLS and other classical machine learning models with a new algorithm.
zh
[CV-42] Beyond Pixels: A Differentiable Pipeline for Probing Neuronal Selectivity in 3D
【速读】:该论文试图解决当前视觉神经科学中难以准确刻画神经元对物理可解释的三维场景属性(如形状、姿态和光照)选择性的问题,因为现有方法主要基于二维像素空间,无法有效分离这些三维因素的影响。解决方案的关键在于引入一种可微分渲染流水线(differentiable rendering pipeline),通过优化可变形网格(deformable meshes)直接在三维空间中生成最大化神经元响应的最优刺激(MEIs, Maximally Exciting Images),并利用径向基函数(radial basis functions)参数化网格形变,同时学习偏移量和缩放因子以增强几何规则性,从而实现对猴子V4区模型神经元选择性的精确解析,为逆图形学与系统神经科学的结合提供了新范式。
链接: https://arxiv.org/abs/2510.13433
作者: Pavithra Elumalai,Mohammad Bashiri,Goirik Chakrabarty,Suhas Shrinivasan,Fabian H. Sinz
机构: University of Göttingen (哥廷根大学); Campus Institute Data Science (CIDAS) (数据科学校园研究所); Lower Saxony Center for AI & Causal Methods in Medicine (下萨克森州人工智能与因果医学方法中心); International Max Planck Research School for Intelligent Systems (国际马普智能系统研究学校); Noselab GmbH (Noselab有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Symmetry and Geometry in Neural Representations 2025 (Extended Abstract Track)
Abstract:Visual perception relies on inference of 3D scene properties such as shape, pose, and lighting. To understand how visual sensory neurons enable robust perception, it is crucial to characterize their selectivity to such physically interpretable factors. However, current approaches mainly operate on 2D pixels, making it difficult to isolate selectivity for physical scene properties. To address this limitation, we introduce a differentiable rendering pipeline that optimizes deformable meshes to obtain MEIs directly in 3D. The method parameterizes mesh deformations with radial basis functions and learns offsets and scales that maximize neuronal responses while enforcing geometric regularity. Applied to models of monkey area V4, our approach enables probing neuronal selectivity to interpretable 3D factors such as pose and lighting. This approach bridges inverse graphics with systems neuroscience, offering a way to probe neural selectivity with physically grounded, 3D stimuli beyond conventional pixel-based methods.
zh
[CV-43] CoDS: Enhancing Collaborative Perception in Heterogeneous Scenarios via Domain Separation
【速读】:该论文旨在解决自动驾驶中异构场景下协同感知(collaborative perception)的特征不一致问题,即当不同车辆采用不同感知模型(如不同编码器)时,传统方法因假设所有代理使用相同编码器而导致性能下降。此外,现有方法依赖基于Transformer的域适应模块,导致移动端推理效率低下。其解决方案的关键在于提出CoDS框架,通过两个核心模块实现高效且鲁棒的特征对齐:一是轻量级空间-通道重缩放模块(Lightweight Spatial-Channel Resizer, LSCR),用于在空间和通道维度上对邻居特征进行低复杂度对齐;二是基于域分离的分布对齐模块(Distribution Alignment via Domain Separation, DADS),通过编码器特定与编码器无关的域分离机制分别去除领域相关噪声并保留任务相关特征。同时引入域对齐互信息(Domain Alignment Mutual Information, DAMI)损失函数以增强特征对齐效果,整体采用全卷积架构保障高推理效率。
链接: https://arxiv.org/abs/2510.13432
作者: Yushan Han,Hui Zhang,Honglei Zhang,Chuntao Ding,Yuanzhouhan Cao,Yidong Li
机构: Beijing Jiaotong University (北京交通大学); School of Computer Science and Technology, Beijing Jiaotong University (北京交通大学计算机科学与技术学院); Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education (教育部大数据与交通人工智能重点实验室(北京交通大学)); Beijing Normal University (北京师范大学); School of Artificial Intelligence, Beijing Normal University (北京师范大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Mobile Computing
Abstract:Collaborative perception has been proven to improve individual perception in autonomous driving through multi-agent interaction. Nevertheless, most methods often assume identical encoders for all agents, which does not hold true when these models are deployed in real-world applications. To realize collaborative perception in actual heterogeneous scenarios, existing methods usually align neighbor features to those of the ego vehicle, which is vulnerable to noise from domain gaps and thus fails to address feature discrepancies effectively. Moreover, they adopt transformer-based modules for domain adaptation, which causes the model inference inefficiency on mobile devices. To tackle these issues, we propose CoDS, a Collaborative perception method that leverages Domain Separation to address feature discrepancies in heterogeneous scenarios. The CoDS employs two feature alignment modules, i.e., Lightweight Spatial-Channel Resizer (LSCR) and Distribution Alignment via Domain Separation (DADS). Besides, it utilizes the Domain Alignment Mutual Information (DAMI) loss to ensure effective feature alignment. Specifically, the LSCR aligns the neighbor feature across spatial and channel dimensions using a lightweight convolutional layer. Subsequently, the DADS mitigates feature distribution discrepancy with encoder-specific and encoder-agnostic domain separation modules. The former removes domain-dependent information and the latter captures task-related information. During training, the DAMI loss maximizes the mutual information between aligned heterogeneous features to enhance the domain separation process. The CoDS employs a fully convolutional architecture, which ensures high inference efficiency. Extensive experiments demonstrate that the CoDS effectively mitigates feature discrepancies in heterogeneous scenarios and achieves a trade-off between detection accuracy and inference efficiency.
zh
[CV-44] Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter
【速读】:该论文旨在解决高分辨率文本引导图像修复(text-guided image inpainting)中的两个关键挑战:内容一致性(content consistency)和提示对齐(prompt alignment),这些问题在分辨率提升至4K及以上时尤为显著。解决方案的核心在于提出Patch-Adapter框架,其采用两阶段适配器架构实现扩散模型从1K到4K+分辨率的无结构改造式扩展:第一阶段通过双上下文适配器(Dual Context Adapter)在低分辨率下学习掩码与未掩码区域间的全局结构一致性;第二阶段通过参考块适配器(Reference Patch Adapter)引入块级注意力机制,在全分辨率下实现局部细节保真度的自适应特征融合。该设计巧妙地将全局语义与局部精细化分离,有效缓解了高分辨率修复中常见的伪影问题,并在OpenImages和Photo-Concept-Bucket数据集上实现了领先的感知质量和提示遵循性能。
链接: https://arxiv.org/abs/2510.13419
作者: Jianhui Zhang,Sheng Cheng,Qirui Sun,Jia Liu,Wang Luyang,Chaoyu Feng,Chen Fang,Lei Lei,Jue Wang,Shuaicheng Liu
机构: University of Electronic Science and Technology of China (中国电子科技大学); Megvii Technology (旷视科技); Dzine AI, SeeKoo (Dzine AI, SeeKoo)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment, two critical challenges in image inpainting that intensify with increasing resolution and texture complexity. Patch-Adapter leverages a two-stage adapter architecture to scale the diffusion model’s resolution from 1K to 4K+ without requiring structural overhauls: (1) Dual Context Adapter learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency; and (2) Reference Patch Adapter implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion. This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and Photo-Concept-Bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence.
zh
[CV-45] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation
【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)方法在文本到图像(Text-to-Image, T2I)生成中对掩码生成模型(Masked Generative Models)关注不足的问题。现有RL方法主要针对扩散模型或自回归模型,而忽视了掩码生成模型这一重要范式。其解决方案的关键在于提出Mask-GRPO,首次将基于组相对策略优化(Group Relative Policy Optimization, GRPO)的RL引入掩码生成模型,并重新定义转移概率,将解掩码过程建模为多步决策问题,同时结合去除KL约束、采用缩减策略和过滤低质量样本等改进策略,显著提升了基础模型Show-o在标准T2I基准和偏好对齐上的性能。
链接: https://arxiv.org/abs/2510.13418
作者: Yifu Luo,Xinhao Hu,Keyu Fan,Haoyuan Sun,Zeyu Chen,Bo Xia,Tiantian Zhang,Yongzhe Chang,Xueqian Wang
机构: Tsinghua University (清华大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on this https URL
zh
[CV-46] Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
【速读】:该论文旨在解决现有视觉语言模型(Vision Language Models, VLMs)在评估空间推理能力方面的不足,尤其是对人类空间认知中至关重要的内在动态空间推理(intrinsic-dynamic spatial reasoning)缺乏有效衡量。其解决方案的关键在于提出一个统一的基准测试框架Spatial-DISE,该框架基于认知 grounded 的分类体系,将空间推理任务划分为四个基本类别:内在静态(Intrinsic-Static)、内在动态(Intrinsic-Dynamic)、外在静态(Extrinsic-Static)和外在动态(Extrinsic-Dynamic)。此外,为应对数据稀缺问题,作者开发了一种可扩展且自动化的数据生成管道,构建了包含559个评估样本的Spatial-DISE Bench和超过12K个训练样本的Spatial-DISE-12K数据集,从而为VLMs的空间智能研究提供了结构化、可验证且具挑战性的评测资源与方向。
链接: https://arxiv.org/abs/2510.13394
作者: Xinmiao Huang,Qisong He,Zhenglin Huang,Boxuan Wang,Zhuoyun Li,Guangliang Cheng,Yi Dong,Xiaowei Huang
机构: University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emphintrinsic-dynamic spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbfSpatial-DISE, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbfIntrinsic-\textbfStatic, Intrinsic-\textbfDynamic, \textbfExtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbfSpatial-DISE dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.
zh
[CV-47] Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment
【速读】:该论文旨在解决基于WiFi的手势识别在实际应用中面临的泛化能力不足与语义表达力弱的问题,其根源在于信道状态信息(Channel State Information, CSI)的域敏感性以及缺乏高层次手势抽象表示。解决方案的关键在于提出一种名为“大模型感知语义蒸馏与对齐”(GLSDA)的新框架,该框架利用预训练大模型的语义先验来增强手势表征学习,具体包括:设计双路径CSI编码管道以提取几何与动态手势特征;引入多尺度语义编码器结合跨模态注意力机制对齐时序嵌入与手势语义;采用语义感知软监督机制降低类别间模糊性,提升分类判别力;并通过鲁棒的双蒸馏策略将教师模型的知识(中间特征与语义引导的软标签)压缩至轻量级学生网络,从而实现高性能、低延迟且可部署的RF手势识别系统。
链接: https://arxiv.org/abs/2510.13390
作者: Feng-Qi Cui,Yu-Tong Guo,Tianyue Zheng,Jinyang Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE ICPADS 2025
Abstract:WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for enabling non-contact and privacy-preserving human-computer interaction in AIoT environments. However, existing methods often suffer from limited generalization and semantic expressiveness due to the domain-sensitive nature of Channel State Information and the lack of high-level gesture abstraction. To address these challenges, we propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment (GLSDA), which leverages the semantic prior of pre-trained large foundation models to enhance gesture representation learning in both in-domain and cross-domain scenarios. Specifically, we first design a dual-path CSI encoding pipeline that captures geometric and dynamic gesture patterns via CSI-Ratio phase sequences and Doppler spectrograms. These representations are then fed into a Multiscale Semantic Encoder, which learns robust temporal embeddings and aligns them with gesture semantics through cross-modal attention mechanisms. To further enhance category discrimination, we introduce a Semantic-Aware Soft Supervision scheme that encodes inter-class correlations and reduces label ambiguity, especially for semantically similar gestures. Finally, we develop a Robust Dual-Distillation strategy to compress the aligned model into a lightweight student network, jointly distilling intermediate features and semantic-informed soft labels from the teacher model. Extensive experiments on the Widar3.0 benchmark show that GLSDA consistently outperforms state-of-the-art methods in both in-domain and cross-domain gesture recognition tasks, while significantly reducing model size and inference latency. Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.
zh
[CV-48] Leverag ing 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering ICCV-2025
【速读】:该论文旨在解决动态场景渲染与重建中对多模态数据(如LiDAR、地面真值3D分割和运动轨迹)高度依赖的问题,尤其是在城市场景下。现有基于3D高斯溅射(3DGS)的方法虽能实现高精度建模,但受限于需同时获取相机与LiDAR数据、标注的3D物体分割及运动信息(如tracklets或预定义模板如SMPL)。其解决方案的关键在于提出一种融合符号距离函数(SDF)与3DGS的新方法:通过引入SDF表示动态物体,并结合2D无类别先验(深度图与点追踪),构建统一优化框架,从而在无需LiDAR或地面真值运动标注的情况下,提升几何精度与形变建模能力,实现更鲁棒且灵活的动态场景表示。
链接: https://arxiv.org/abs/2510.13381
作者: Siddharth Tourani,Jayaram Reddy,Akash Kumbar,Satyajit Tourani,Nishant Goyal,Madhava Krishna,N. Dinesh Reddy,Muhammad Haris Khan
机构: IIIT Hyderabad; MBZUAI; University of Heidelberg; VLM Run; IIT Kharagpur
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted at ICCV-2025, project page: this https URL
Abstract:Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.
zh
[CV-49] DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在需要精确空间推理的任务中性能下降的问题,其根源在于视觉-语言模型(Vision-Language Models, VLMs)本身空间感知能力有限。现有VLA模型依赖大量动作数据预训练来将VLMs与三维空间对齐,但这种方式效率低且难以实现精准的空间理解。论文提出DepthVLA架构,其核心创新在于通过一个预训练的深度预测模块显式引入空间感知能力,并采用混合Transformer设计,统一视觉-语言、深度估计和动作执行三个模块,共享注意力机制,从而构建端到端的增强空间推理能力模型。实验证明,该方案在真实世界和模拟环境中均显著优于现有方法。
链接: https://arxiv.org/abs/2510.13375
作者: Tianyuan Yuan,Yicheng Liu,Chenhao Lu,Zhuoguang Chen,Tao Jiang,Hang Zhao
机构: IIIS, Tsinghua University (清华大学); Galaxea AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.
zh
[CV-50] Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在零样本分类任务中,提示(prompt)设计对识别视觉相似类别(如坐、站、走/跑等人体姿态)性能影响不明确的问题。其关键解决方案在于系统性地构建三层次提示设计,从最基础的类别标签到包含身体线索的详细描述,通过在小规模COCO衍生数据集上评估多种先进VLMs(如MetaCLIP 2、OpenCLIP和SigLip),揭示了提示复杂度与模型性能之间的非线性关系:对于高性能模型(MetaCLIP 2和OpenCLIP),最简单的提示反而表现最优,出现“提示过拟合”现象;而低性能模型(SigLip)则受益于更具描述性的提示,说明提示设计需根据模型能力进行适配。
链接: https://arxiv.org/abs/2510.13364
作者: MingZe Tang,Jubal Chandy Jacob
机构: University of Aberdeen (阿伯丁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent Vision-Language Models (VLMs) enable zero-shot classification by aligning images and text in a shared space, a promising approach for data-scarce conditions. However, the influence of prompt design on recognizing visually similar categories, such as human postures, is not well understood. This study investigates how prompt specificity affects the zero-shot classification of sitting, standing, and walking/running on a small, 285-image COCO-derived dataset. A suite of modern VLMs, including OpenCLIP, MetaCLIP 2, and SigLip, were evaluated using a three-tiered prompt design that systematically increases linguistic detail. Our findings reveal a compelling, counter-intuitive trend: for the highest-performing models (MetaCLIP 2 and OpenCLIP), the simplest, most basic prompts consistently achieve the best results. Adding descriptive detail significantly degrades performance for instance, MetaCLIP 2’s multi-class accuracy drops from 68.8% to 55.1% a phenomenon we term “prompt overfitting”. Conversely, the lower-performing SigLip model shows improved classification on ambiguous classes when given more descriptive, body-cue-based prompts.
zh
[CV-51] Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models RECSYS2025
【速读】:该论文旨在解决大规模电商平台中基于视觉相似性的商品推荐问题,以提升用户发现符合其偏好的商品的效率。其关键解决方案是利用视觉语言模型(Vision-Language Model, VLM)——具体采用基于sigmoid对比损失的SigLIP模型——对Mercari平台上的百万级商品图像-标题配对进行微调,从而构建用于推荐系统的图像编码器,生成高质量的商品嵌入表示。实验表明,该方法在离线评估中使nDCG@5指标提升9.1%,在线A/B测试中点击率提高50%、转化率提升14%,验证了VLM编码器在电商推荐场景中的有效性。
链接: https://arxiv.org/abs/2510.13359
作者: Yuki Yada,Sho Akiyama,Ryo Watanabe,Yuta Ueno,Yusuke Shido,Andre Rusli
机构: Mercari, Inc. (Mercari公司)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ACM RecSys 2025 (Spotlight)
Abstract:On large-scale e-commerce platforms with tens of millions of active monthly users, recommending visually similar products is essential for enabling users to efficiently discover items that align with their preferences. This study presents the application of a vision-language model (VLM) – which has demonstrated strong performance in image recognition and image-text retrieval tasks – to product recommendations on Mercari, a major consumer-to-consumer marketplace used by more than 20 million monthly users in Japan. Specifically, we fine-tuned SigLIP, a VLM employing a sigmoid-based contrastive loss, using one million product image-title pairs from Mercari collected over a three-month period, and developed an image encoder for generating item embeddings used in the recommendation system. Our evaluation comprised an offline analysis of historical interaction logs and an online A/B test in a production environment. In offline analysis, the model achieved a 9.1% improvement in nDCG@5 compared with the baseline. In the online A/B test, the click-through rate improved by 50% whereas the conversion rate improved by 14% compared with the existing model. These results demonstrate the effectiveness of VLM-based encoders for e-commerce product recommendations and provide practical insights into the development of visual similarity-based recommendation systems.
zh
[CV-52] No-Reference Rendered Video Quality Assessment: Dataset and Metrics
【速读】:该论文旨在解决现有无参考视频质量评估(No-Reference Video Quality Assessment, NR-VQA)方法在评估渲染视频时存在的偏差问题,因为当前主流NR-VQA数据集和指标主要针对相机拍摄视频设计,而渲染视频更易出现时间域伪影(temporal artifacts),导致评估结果不准确。解决方案的关键在于构建一个面向渲染视频的大规模主观标注数据集,并提出一种专门针对渲染视频的质量评估指标,该指标同时考虑图像质量与时间稳定性两个维度,从而更精准地反映真实应用场景下的视觉体验,且已验证其在超采样方法对比和实时渲染中帧生成策略评估中的有效性。
链接: https://arxiv.org/abs/2510.13349
作者: Sipeng Yang,Jiayu Ji,Qingchuan Zhu,Zhiyao Yang,Xiaogang Jin
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室); OPPO Nanjing Research Center (OPPO南京研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quality assessment of videos is crucial for many computer graphics applications, including video games, virtual reality, and augmented reality, where visual performance has a significant impact on user experience. When test videos cannot be perfectly aligned with references or when references are unavailable, the significance of no-reference video quality assessment (NR-VQA) methods is undeniable. However, existing NR-VQA datasets and metrics are primarily focused on camera-captured videos; applying them directly to rendered videos would result in biased predictions, as rendered videos are more prone to temporal artifacts. To address this, we present a large rendering-oriented video dataset with subjective quality annotations, as well as a designed NR-VQA metric specific to rendered videos. The proposed dataset includes a wide range of 3D scenes and rendering settings, with quality scores annotated for various display types to better reflect real-world application scenarios. Building on this dataset, we calibrate our NR-VQA metric to assess rendered video quality by looking at both image quality and temporal stability. We compare our metric to existing NR-VQA metrics, demonstrating its superior performance on rendered videos. Finally, we demonstrate that our metric can be used to benchmark supersampling methods and assess frame generation strategies in real-time rendering.
zh
[CV-53] Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models
【速读】:该论文旨在解决向量量化变分自编码器(Vector Quantized Variational Autoencoders, VQ-VAEs)中存在的码本坍缩(codebook collapse)问题,该问题会导致码本利用率低下并影响重建质量。现有方法通过使用静态码本或联合优化整个码本限制了码本的学习能力,进而损害重建性能。论文提出的关键解决方案是Group-VQ,其核心在于对码本进行分组优化:每组内部独立优化,同时组间进行联合优化,从而在码本利用效率与重建性能之间取得更好的平衡。此外,论文还引入了一种无需重新训练的码本重采样方法,可在训练后灵活调整码本大小,进一步提升了模型的实用性与适应性。
链接: https://arxiv.org/abs/2510.13331
作者: Hong-Kai Zheng,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook’s learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.
zh
[CV-54] DEF-YOLO: Leverag ing YOLO for Concealed Weapon Detection in Thermal Imagin
【速读】:该论文旨在解决隐蔽武器检测(concealed weapon detection)在热成像(thermal imaging)场景下的挑战,尤其针对现有成像模态如微波成像分辨率低、毫米波成像存在隐私问题等局限性。解决方案的关键在于提出一种基于YOLOv8改进的新型架构DEF-YOLO,其核心创新包括:在SPPF层引入可变形卷积以增强多尺度特征提取能力;优化骨干网络和颈部结构以有效捕获低、中、高阶特征,从而在热成像均匀区域中自适应聚焦目标定位,同时保持较高的推理速度与吞吐量;此外,构建首个大规模热成像隐蔽武器数据集TICW,并采用焦点损失(focal loss)缓解类别不平衡问题,显著提升了模型性能并为该领域建立了新的基准。
链接: https://arxiv.org/abs/2510.13326
作者: Divya Bhardwaj,Arnav Ramamoorthy,Poonam Goyal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Concealed weapon detection aims at detecting weapons hidden beneath a person’s clothing or luggage. Various imaging modalities like Millimeter Wave, Microwave, Terahertz, Infrared, etc., are exploited for the concealed weapon detection task. These imaging modalities have their own limitations, such as poor resolution in microwave imaging, privacy concerns in millimeter wave imaging, etc. To provide a real-time, 24 x 7 surveillance, low-cost, and privacy-preserved solution, we opted for thermal imaging in spite of the lack of availability of a benchmark dataset. We propose a novel approach and a dataset for concealed weapon detection in thermal imagery. Our YOLO-based architecture, DEF-YOLO, is built with key enhancements in YOLOv8 tailored to the unique challenges of concealed weapon detection in thermal vision. We adopt deformable convolutions at the SPPF layer to exploit multi-scale features; backbone and neck layers to extract low, mid, and high-level features, enabling DEF-YOLO to adaptively focus on localization around the objects in thermal homogeneous regions, without sacrificing much of the speed and throughput. In addition to these simple yet effective key architectural changes, we introduce a new, large-scale Thermal Imaging Concealed Weapon dataset, TICW, featuring a diverse set of concealed weapons and capturing a wide range of scenarios. To the best of our knowledge, this is the first large-scale contributed dataset for this task. We also incorporate focal loss to address the significant class imbalance inherent in the concealed weapon detection task. The efficacy of the proposed work establishes a new benchmark through extensive experimentation for concealed weapon detection in thermal imagery.
zh
[CV-55] Removing Cost Volumes from Optical Flow Estimators ICCV2025
【速读】:该论文旨在解决光学流估计器中代价体积(cost volume)带来的计算和存储复杂度问题,这些问题限制了处理速度和输入帧的分辨率。其解决方案的关键在于提出一种训练策略,使得在训练过程中可以完全移除代价体积,从而显著提升推理速度并降低内存占用;通过该策略构建的不同模型实现了在保持高精度的同时大幅优化效率,其中最精确模型相比同类方法提速1.2倍且内存占用减少6倍,最快模型可在仅500 MB GPU内存下实现Full HD视频20 FPS的实时处理。
链接: https://arxiv.org/abs/2510.13317
作者: Simon Kiefhaber,Stefan Roth,Simone Schaub-Meyer
机构: Technical University of Darmstadt (达姆施塔特工业大学); hessian.AI (黑森州人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows removing the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being 1.2\times faster and having a 6\times lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at 20,\mathrmFPS using only 500,\mathrmMB of GPU memory.
zh
[CV-56] Visual Interestingness Decoded: How GPT -4o Mirrors Human Interests ICCV2025
【速读】:该论文旨在解决如何量化和理解视觉有趣性(visual interestingness)的问题,即探索大型多模态模型(LMMs)在多大程度上能够捕捉人类对图像“有趣性”的认知,并评估其预测与人类主观判断之间的对齐程度。其解决方案的关键在于利用GPT-4o这一领先LMM模型进行对比分析,发现其在识别图像有趣性方面已优于现有方法,从而可有效标注图像对的相对有趣性,作为训练数据用于知识蒸馏至一个学习排序(learning-to-rank)模型中,进而为深入理解人类兴趣机制提供基础。
链接: https://arxiv.org/abs/2510.13316
作者: Fitim Abdullahu,Helmut Grabner
机构: Zurich University of Applied Sciences (苏黎世应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Our daily life is highly influenced by what we consume and see. Attracting and holding one’s attention – the definition of (visual) interestingness – is essential. The rise of Large Multimodal Models (LMMs) trained on large-scale visual and textual data has demonstrated impressive capabilities. We explore these models’ potential to understand to what extent the concepts of visual interestingness are captured and examine the alignment between human assessments and GPT-4o’s, a leading LMM, predictions through comparative analysis. Our studies reveal partial alignment between humans and GPT-4o. It already captures the concept as best compared to state-of-the-art methods. Hence, this allows for the effective labeling of image pairs according to their (commonly) interestingness, which are used as training data to distill the knowledge into a learning-to-rank model. The insights pave the way for a deeper understanding of human interest.
zh
[CV-57] Self-Augmented Visual Contrastive Decoding
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在生成过程中易产生幻觉(hallucination)的问题,尤其是现有基于视觉对比解码的方法因采用通用视觉增强策略而忽视文本查询语境,导致效果受限。其解决方案的关键在于提出一种无需训练的解码策略,包含两个核心创新:一是自增强提示策略(self-augmentation prompting),利用模型自身知识动态对齐查询与视觉增强之间的语义;二是自适应阈值算法(adaptive thresholding),根据输出稀疏性自适应调整候选词数量,充分利用 logits 分布中的全部信息。该方法显著提升了事实一致性,验证了查询相关增强与熵感知解码对 LVLM 有效生成的重要性。
链接: https://arxiv.org/abs/2510.13315
作者: Eun Woo Im,Muhammad Kashif Ali,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Southwest Jiaotong University (西南交通大学); Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue, existing methods often apply generic visual augmentations that disregard the specific context provided by the text query, limiting their effectiveness. This study introduces a novel training-free decoding strategy that addresses these limitations, featuring two key contributions. First, a self-augmentation prompting strategy that leverages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity, utilizing full information from the logit distribution. Extensive experiments across four LVLMs and seven benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of integrating query-dependent augmentation and entropy-aware decoding for improving effective generation of LVLMs.
zh
[CV-58] InstantSfM: Fully Sparse and Parallel Structure-from-Motion
【速读】:该论文旨在解决传统结构光恢复(Structure-from-Motion, SfM)方法在大规模场景下计算效率低、灵活性差,以及基于深度学习的SfM方法因GPU内存限制难以扩展至数千张图像的问题。其解决方案的关键在于充分利用GPU并行计算能力,加速SfM流水线中的关键步骤——尤其是稀疏感知的捆绑调整(Bundle Adjustment, BA)和全局定位(Global Positioning, GP),并通过统一的全局SfM框架集成这些优化技术,从而在保证重建精度的同时实现高达约40倍的速度提升。
链接: https://arxiv.org/abs/2510.13310
作者: Jiankun Zhong,Zitong Zhan,Quankai Gao,Ziyu Chen,Haozhe Lou,Jiageng Mao,Ulrich Neumann,Yue Wang
机构: University of Southern California (南加州大学); Tsinghua University (清华大学); University at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structure-from-Motion (SfM), a method that recovers camera poses and scene geometry from uncalibrated images, is a central component in robotic reconstruction and simulation. Despite the state-of-the-art performance of traditional SfM methods such as COLMAP and its follow-up work, GLOMAP, naive CPU-specialized implementations of bundle adjustment (BA) or global positioning (GP) introduce significant computational overhead when handling large-scale scenarios, leading to a trade-off between accuracy and speed in SfM. Moreover, the blessing of efficient C+±based implementations in COLMAP and GLOMAP comes with the curse of limited flexibility, as they lack support for various external optimization options. On the other hand, while deep learning based SfM pipelines like VGGSfM and VGGT enable feed-forward 3D reconstruction, they are unable to scale to thousands of input views at once as GPU memory consumption increases sharply as the number of input views grows. In this paper, we unleash the full potential of GPU parallel computation to accelerate each critical stage of the standard SfM pipeline. Building upon recent advances in sparse-aware bundle adjustment optimization, our design extends these techniques to accelerate both BA and GP within a unified global SfM framework. Through extensive experiments on datasets of varying scales (e.g. 5000 images where VGGSfM and VGGT run out of memory), our method demonstrates up to about 40 times speedup over COLMAP while achieving consistently comparable or even improved reconstruction accuracy. Our project page can be found at this https URL.
zh
[CV-59] Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning NEURIPS2025
【速读】:该论文旨在解决点云分割中的新类发现(3D-NCD)问题,即在仅使用标注的“基础类”(base classes)监督信息下,学习模型以对未标注的“新类”(novel classes)进行分割。其核心挑战在于准确建立点云表示与基础类标签之间的关联,以及基础类与新类点表示间的相关性;若采用粗粒度或统计相关的学习方式,可能导致新类推理混淆。解决方案的关键在于引入结构因果模型(Structural Causal Model, SCM),将3D-NCD问题重新形式化,并提出联合因果表示与推理学习方法:首先通过SCM分析基础类表示中的潜在混杂因子(hidden confounders),设计去混杂的因果表示原型以捕捉基础类的因果表征;进而构建图结构建模基础类因果原型与新类原型间的因果关系,实现从基础类到新类的因果推理,从而提升新类分割的准确性与鲁棒性。
链接: https://arxiv.org/abs/2510.13307
作者: Yang Li,Aming Wu,Zihao Zhang,Yahong Han
机构: Tianjin University (天津大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
Abstract:In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation (3D-NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes using only the supervision from labeled (base) 3D classes. The key to this task is to setup the exact correlations between the point representations and their base class labels, as well as the representation correlations between the points from base and novel classes. A coarse or statistical correlation learning may lead to the confusion in novel class inference. lf we impose a causal relationship as a strong correlated constraint upon the learning process, the essential point cloud representations that accurately correspond to the classes should be uncovered. To this end, we introduce a structural causal model (SCM) to re-formalize the 3D-NCD problem and propose a new method, i.e., Joint Learning of Causal Representation and Reasoning. Specifically, we first analyze hidden confounders in the base class representations and the causal relationships between the base and novel classes through SCM. We devise a causal representation prototype that eliminates confounders to capture the causal representations of base classes. A graph structure is then used to model the causal relationships between the base classes’ causal representation prototypes and the novel class prototypes, enabling causal reasoning from base to novel classes. Extensive experiments and visualization results on 3D and 2D NCD semantic segmentation demonstrate the superiorities of our method.
zh
[CV-60] Automated document processing system for government agencies using DBNET and BART models
【速读】:该论文旨在解决在非受限成像场景下,对混合来源文档(如离线图像和实时摄像头捕获)进行自动分类的问题,尤其针对实际应用中常见的挑战,如光照变化、文本任意方向、弯曲或部分遮挡、低分辨率及远距离文本等。解决方案的关键在于构建一个四阶段端到端的文档分类系统:首先通过图像采集与预处理模块增强输入质量;其次利用DBNet++(Differentiable Binarization Network Plus)实现高鲁棒性的文本检测;再基于BART(Bidirectional and Auto-Regressive Transformers)模型完成文本内容的语义分类;最终将上述模块集成于Python开发的PyQt5用户界面中,实现在复杂条件下对Invoice、Report、Letter和Form四类文档的有效识别,实验表明其在Total-Text数据集上达到约92.88%的文本检测准确率,验证了方法在真实场景中的有效性。
链接: https://arxiv.org/abs/2510.13303
作者: Aya Kaysan Bahjat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 8 pages, 12 figures, article
Abstract:An automatic document classification system is presented that detects textual content in images and classifies documents into four predefined categories (Invoice, Report, Letter, and Form). The system supports both offline images (e.g., files on flash drives, HDDs, microSD) and real-time capture via connected cameras, and is designed to mitigate practical challenges such as variable illumination, arbitrary orientation, curved or partially occluded text, low resolution, and distant text. The pipeline comprises four stages: image capture and preprocessing, text detection [1] using a DBNet++ (Differentiable Binarization Network Plus) detector, and text classification [2] using a BART (Bidirectional and Auto-Regressive Transformers) classifier, all integrated within a user interface implemented in Python with PyQt5. The achieved results by the system for text detection in images were good at about 92.88% through 10 hours on Total-Text dataset that involve high resolution images simulate a various and very difficult challenges. The results indicate the proposed approach is effective for practical, mixed-source document categorization in unconstrained imaging scenarios.
zh
[CV-61] Universal Image Restoration Pre-training via Masked Degradation Classification
【速读】:该论文旨在解决图像恢复任务中预训练方法泛化能力不足的问题,特别是针对多种退化类型(degradation type)和复杂现实场景下的通用性难题。传统预训练方法往往依赖于强监督信号或单一退化假设,难以适应多样化的退化模式。其解决方案的关键在于提出一种掩码退化分类预训练方法(Masked Degradation Classification Pre-Training, MaskDCPT),该方法利用图像退化类型作为极弱监督信号,同时结合图像重建任务提升模型表征能力;具体地,MaskDCPT采用双解码器结构:一个用于退化类型分类,另一个用于高质量图像重建,从而融合掩码图像建模与对比学习的优势,使编码器获得对不同退化类型具有鲁棒性和泛化性的特征表示。这一设计显著提升了CNN与Transformer在多退化类型图像恢复任务中的性能,尤其在真实世界退化场景下表现出更强的适应能力。
链接: https://arxiv.org/abs/2510.13282
作者: JiaKui Hu,Zhengjian Yao,Lujia Jin,Yinghao Chen,Yanye Lu
机构: Peking University Health Science Center, Peking University, Beijing, China (北京大学医学部,北京大学,北京,中国); Biomedical Engineering Department, College of Future Technology, Peking University, Beijing, China (北京大学未来技术学院生物医学工程系,北京,中国); National Biomedical Imaging Center, Peking University, Beijing, China (北京大学生物医学成像中心,北京,中国); JIUTIAN Research, Beijing, China (九天研究,北京,中国); College of Electronic Engineering, National University of Defense Technology, Changsha, China (国防科技大学电子工程学院,长沙,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study introduces a Masked Degradation Classification Pre-Training method (MaskDCPT), designed to facilitate the classification of degradation types in input images, leading to comprehensive image restoration pre-training. Unlike conventional pre-training methods, MaskDCPT uses the degradation type of the image as an extremely weak supervision, while simultaneously leveraging the image reconstruction to enhance performance and robustness. MaskDCPT includes an encoder and two decoders: the encoder extracts features from the masked low-quality input image. The classification decoder uses these features to identify the degradation type, whereas the reconstruction decoder aims to reconstruct a corresponding high-quality image. This design allows the pre-training to benefit from both masked image modeling and contrastive learning, resulting in a generalized representation suited for restoration tasks. Benefit from the straightforward yet potent MaskDCPT, the pre-trained encoder can be used to address universal image restoration and achieve outstanding performance. Implementing MaskDCPT significantly improves performance for both convolution neural networks (CNNs) and Transformers, with a minimum increase in PSNR of 3.77 dB in the 5D all-in-one restoration task and a 34.8% reduction in PIQE compared to baseline in real-world degradation scenarios. It also emergences strong generalization to previously unseen degradation types and levels. In addition, we curate and release the UIR-2.5M dataset, which includes 2.5 million paired restoration samples across 19 degradation types and over 200 degradation levels, incorporating both synthetic and real-world data. The dataset, source code, and models are available at this https URL.
zh
[CV-62] End-to-End Multi-Modal Diffusion Mamba ICCV2025
【速读】:该论文旨在解决当前端到端多模态模型中因使用独立编码器和解码器而导致的跨模态联合表示学习受限的问题。解决方案的关键在于提出一种名为MDM(Multi-modal Diffusion Mamba)的新架构,其核心创新是采用基于Mamba的多步选择扩散模型,通过统一的变分自编码器(Variational Autoencoder, VAE)实现编码与解码过程的一体化,并以渐进式方式生成和优化特定模态的信息,从而在高维数据处理(如高分辨率图像与长文本序列同步生成)任务中实现卓越性能。
链接: https://arxiv.org/abs/2510.13253
作者: Chunhao Lu,Qiang Lu,Meichen Dong,Jake Luo
机构: China University of Petroleum-Beijing (中国石油大学-北京); Leyard Optoelectronic (利亚德光电); University of Wisconsin-Milwaukee (密歇根大学米尔沃基分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM’s effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.
zh
[CV-63] Map the Flow: Revealing Hidden Pathways of Information in VideoLLM s
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VideoLLMs)在时空输入下内部信息流动机制不明确的问题,特别是其如何提取、传播和整合视频与文本信息以完成视频问答(VideoQA)任务。解决方案的关键在于运用机制可解释性(mechanistic interpretability)技术对模型内部状态进行系统分析,揭示出跨帧交互、视频-语言融合及答案生成的分阶段路径:早期至中期层通过活跃的跨帧交互启动时间推理,中期层实现视频表征与包含时间概念的语言嵌入对齐,最终在中后期层完成准确答案生成。基于此发现,研究进一步提出可通过筛选有效信息路径并抑制大量注意力边(如LlaVA-NeXT-7B-Video-FT中减少58%),仍保持原有VideoQA性能,从而为提升模型可解释性和下游泛化能力提供实践依据。
链接: https://arxiv.org/abs/2510.13251
作者: Minji Kim,Taekyung Kim,Bohyung Han
机构: Seoul National University (首尔国立大学); NAVER AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 28 figures, 8 tables
Abstract:Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at this https URL
zh
[CV-64] Real-Time Crowd Counting for Embedded Systems with Lightweight Architecture
【速读】:该论文旨在解决现有人群计数方法在嵌入式系统中实际应用时存在的模型参数过多、计算复杂度高、无法满足实时性要求的问题。为实现超实时性能,作者设计了一种基于stem-encoder-decoder结构的轻量化模型:其关键在于通过大卷积核扩大感受野以提取头部细节信息;在编码器部分引入条件通道加权与多分支局部融合模块,在低计算开销下完成多尺度特征融合;并在编码器顶部加入特征金字塔网络(Feature Pyramid Networks, FPN)缓解特征融合不充分问题,从而在保证精度的同时显著提升推理速度——实验表明该模型在NVIDIA GTX 1080Ti上达到381.7 FPS,在Jetson TX1上达到71.9 FPS,优于当前主流方法。
链接: https://arxiv.org/abs/2510.13250
作者: Zhiyuan Zhao,Yubin Wen,Siyu Yang,Lichen Ning,Yuandong Liu,Junyu Gao
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Crowd counting is a task of estimating the number of the crowd through images, which is extremely valuable in the fields of intelligent security, urban planning, public safety management, and so on. However, the existing counting methods have some problems in practical application on embedded systems for these fields, such as excessive model parameters, abundant complex calculations, etc. The practical application of embedded systems requires the model to be real-time, which means that the model is fast enough. Considering the aforementioned problems, we design a super real-time model with a stem-encoder-decoder structure for crowd counting tasks, which achieves the fastest inference compared with state-of-the-arts. Firstly, large convolution kernels in the stem network are used to enlarge the receptive field, which effectively extracts detailed head information. Then, in the encoder part, we use conditional channel weighting and multi-branch local fusion block to merge multi-scale features with low computational consumption. This part is crucial to the super real-time performance of the model. Finally, the feature pyramid networks are added to the top of the encoder to alleviate its incomplete fusion problems. Experiments on three benchmarks show that our network is suitable for super real-time crowd counting on embedded systems, ensuring competitive accuracy. At the same time, the proposed network reasoning speed is the fastest. Specifically, the proposed network achieves 381.7 FPS on NVIDIA GTX 1080Ti and 71.9 FPS on NVIDIA Jetson TX1.
zh
[CV-65] CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation NEURIPS2025
【速读】:该论文旨在解决户外三维语义场景生成中缺乏公开、高质量标注数据集的问题,从而限制了该领域研究进展。为此,作者提出了SketchSem3D这一大规模基准数据集,包含基于草图和伪标签卫星图像的两种子集(Sketch-based SemanticKITTI 和 Sketch-based KITTI-360),支持标准化、严谨且多样化的评估。解决方案的关键在于提出Cylinder Mamba Diffusion(CymbaDiff)模型,其通过引入结构化的空间排序机制,显式建模圆柱连续性和垂直层次关系,同时保持生成场景中的物理邻近性与全局上下文一致性,显著提升了生成结果的语义一致性和空间真实性,并展现出优异的跨数据集泛化能力。
链接: https://arxiv.org/abs/2510.13245
作者: Li Liang,Bo Miao,Xinyu Wang,Naveed Akhtar,Jordan Vice,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学); AIML, The University of Adelaide (阿德莱德大学人工智能与机器学习中心); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
Abstract:Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at this https URL
zh
[CV-66] FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding
【速读】:该论文旨在解决无人机(UAV)在城市环境中应用时,计算机视觉算法依赖大规模高质量标注数据的问题。由于真实世界中无人机数据的采集与标注成本高昂且难度大,研究受限。其解决方案的关键在于提出FlyAwareV2这一新型多模态数据集,融合真实与合成无人机图像,包含RGB、深度图和语义标签,并覆盖多种天气和光照条件;同时通过先进的单目深度估计技术为真实样本生成深度图,提供标准架构上的RGB及多模态语义分割基准,并开展从合成到真实域适应的研究,以评估模型在合成数据训练下的泛化能力。
链接: https://arxiv.org/abs/2510.13243
作者: Francesco Barbato,Matteo Caligiuri,Pietro Zanuttigh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures, 10 tables, data and code available
Abstract:The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large-scale datasets with accurate annotations. However, collecting and annotating real-world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: 1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; 2) Depth maps for real samples computed via state-of-the-art monocular depth estimation; 3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; 4) Studies on synthetic-to-real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV-based 3D urban scene understanding.
zh
[CV-67] Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人学习中面临的对抗鲁棒性问题,即VLA模型在面对恶意扰动时易产生错误决策,导致机器人任务失败。解决方案的关键在于提出一种无需模型先验知识的通用对抗攻击方法——嵌入扰动补丁攻击(Embedding Disruption Patch Attack, EDPA),其通过破坏视觉与文本潜在表示间的语义对齐并最大化对抗输入与干净输入之间的潜在空间差异,使VLA模型误判视觉信息从而生成错误动作;同时,为防御此类攻击,论文进一步设计了一种针对视觉编码器的对抗微调策略,使编码器对干净和扰动输入输出相似的潜在表示,从而有效提升VLA模型的鲁棒性。
链接: https://arxiv.org/abs/2510.13237
作者: Haochuan Xu,Yun Sing Koh,Shuhuai Huang,Zirun Zhou,Di Wang,Jun Sakuma,Jingfeng Zhang
机构: The University of Auckland(奥克兰大学); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学); Tokyo University of Science(东京科学大学); RIKEN Center for Advanced Intelligence Project(理化学研究所先进智能项目中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language-Action (VLA) models have achieved revolutionary progress in robot learning, enabling robots to execute complex physical robot tasks from natural language instructions. Despite this progress, their adversarial robustness remains underexplored. In this work, we propose both adversarial patch attack and corresponding defense strategies for VLA models. We first introduce the Embedding Disruption Patch Attack (EDPA), a model-agnostic adversarial attack that generates patches directly placeable within the camera’s view. In comparison to prior methods, EDPA can be readily applied to different VLA models without requiring prior knowledge of the model architecture, or the controlled robotic manipulator. EDPA constructs these patches by (i) disrupting the semantic alignment between visual and textual latent representations, and (ii) maximizing the discrepancy of latent representations between adversarial and corresponding clean visual inputs. Through the optimization of these objectives, EDPA distorts the VLA’s interpretation of visual information, causing the model to repeatedly generate incorrect actions and ultimately result in failure to complete the given robotic task. To counter this, we propose an adversarial fine-tuning scheme for the visual encoder, in which the encoder is optimized to produce similar latent representations for both clean and adversarially perturbed visual inputs. Extensive evaluations on the widely recognized LIBERO robotic simulation benchmark demonstrate that EDPA substantially increases the task failure rate of cutting-edge VLA models, while our proposed defense effectively mitigates this degradation. The codebase is accessible via the homepage at this https URL.
zh
[CV-68] EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking
【速读】:该论文旨在解决现有视觉目标跟踪方法在利用大语言模型提供的静态文本描述时,因缺乏对目标状态实时变化的适应性以及易产生幻觉(hallucination)而导致的性能瓶颈问题。其核心解决方案是提出一种统一的多模态视觉-语言跟踪框架EPIPTrack,关键在于引入显式提示(explicit prompts)与隐式提示(implicit prompts)协同机制:显式提示将空间运动信息转化为自然语言描述以提供时空引导,隐式提示则通过伪词与可学习描述符构建个体化外观特征表示;两者均通过CLIP文本编码器动态调整,实现对目标状态变化的自适应响应,同时设计判别性特征增强模块以提升视觉与跨模态表征能力,从而显著提升跟踪鲁棒性与准确性。
链接: https://arxiv.org/abs/2510.13235
作者: Yukuan Zhang,Jiarui Zhao,Shangqing Nie,Jin Kuang,Shengsheng Wang
机构: Jilin University (吉林大学); Yangtze University (长江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.
zh
[CV-69] UniVector: Unified Vector Extraction via Instance-Geometry Interaction
【速读】:该论文旨在解决现有矢量提取(Vector Extraction, VE)方法在处理多类型矢量结构时的局限性问题,即传统方法通常仅针对单一矢量类型(如多边形、折线或线段)设计独立模型,难以统一建模不同结构间的复杂关系。其解决方案的关键在于提出UniVector框架,通过引入实例-几何交互机制,将矢量编码为同时包含实例属性(类别、结构)与几何属性(点坐标、连接关系)的结构化查询,并借助交互模块实现跨层次上下文信息交换,从而在单个模型中高效提取多种矢量类型;此外,动态形状约束进一步优化全局结构与关键点精度,显著提升了多结构场景下的性能表现。
链接: https://arxiv.org/abs/2510.13234
作者: Yinglong Yan,Jun Yue,Shaobo Xia,Hanmeng Sun,Tianxu Ying,Chengcheng Wu,Sifan Lan,Min He,Pedram Ghamisi,Leyuan Fang
机构: Hunan University (湖南大学); Central South University (中南大学); Changsha University of Science and Technology (长沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vector extraction retrieves structured vector geometry from raster images, offering high-fidelity representation and broad applicability. Existing methods, however, are usually tailored to a single vector type (e.g., polygons, polylines, line segments), requiring separate models for different structures. This stems from treating instance attributes (category, structure) and geometric attributes (point coordinates, connections) independently, limiting the ability to capture complex structures. Inspired by the human brain’s simultaneous use of semantic and spatial interactions in visual perception, we propose UniVector, a unified VE framework that leverages instance-geometry interaction to extract multiple vector types within a single model. UniVector encodes vectors as structured queries containing both instance- and geometry-level information, and iteratively updates them through an interaction module for cross-level context exchange. A dynamic shape constraint further refines global structures and key points. To benchmark multi-structure scenarios, we introduce the Multi-Vector dataset with diverse polygons, polylines, and line segments. Experiments show UniVector sets a new state of the art on both single- and multi-structure VE tasks. Code and dataset will be released at this https URL.
zh
[CV-70] What “Not” to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在理解否定语义时存在的严重偏差问题,即“肯定偏好”(affirmative bias),尤其在描述性目标检测(Described Object Detection, DOD)任务中表现突出。其核心解决方案在于提出两个关键创新:一是构建了基于思维链(Chain-of-Thought, CoT)与视觉问答(VQA)协同的数据集生成管道 CoVAND,用于获取高质量、实例对齐的否定样本;二是设计了一种轻量级文本标记合并模块 NegToMe,该模块通过将否定词(如“not”)与其修饰属性(如“girl”)整合为统一语义单元,避免因分词导致的否定信息丢失,从而在输入层面维持正确的极性(polarity)。NegToMe 与参数高效的 LoRA 微调策略结合,显著提升了模型在否定理解上的准确性,在 OVDEval 等基准上 NMS-AP 提升达 +10.8,且具备良好的泛化能力。
链接: https://arxiv.org/abs/2510.13232
作者: Inha Kang,Youngsun Lim,Seonho Lee,Jiho Choi,Junsuk Choe,Hyunjung Shim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 38 pages
Abstract:State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens “not” and “girl” as simply “girl”, NegToMe binds them into a single token whose meaning is correctly distinguished from that of “girl” alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.
zh
[CV-71] Sample-Centric Multi-Task Learning for Detection and Segmentation of Industrial Surface Defects
【速读】:该论文旨在解决工业表面缺陷检测中因样本级质量控制(QC)决策与像素级训练目标不匹配而导致的模型性能瓶颈问题,尤其在前景-背景极端不平衡、缺陷稀疏且尺度长尾分布、对比度低等实际生产场景下,现有方法虽在像素重叠指标(如mIoU)上表现优异,但在样本级稳定性(特别是对稀疏或细长缺陷)方面存在明显不足。解决方案的关键在于提出一种以样本为中心的多任务学习框架,通过共享编码器架构联合优化样本级缺陷分类与像素级掩码定位任务,并利用样本级监督信号在梯度层面持续提升小缺陷和低对比度缺陷的召回率,同时保留分割分支的边界与形状细节以增强样本级决策的稳定性并减少漏检。此外,论文还设计了与决策相关的评估指标(Seg_mIoU 和 Seg_Recall),消除传统mIoU因空样本或真负样本引入的偏差,从而更准确地反映缺陷定位质量与样本级判断的一致性。
链接: https://arxiv.org/abs/2510.13226
作者: Hang-Cheng Dong,Yibo Jiao,Fupeng Wei,Guodong Liu,Dong Ye,Bingguo Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Industrial surface defect inspection for sample-wise quality control (QC) must simultaneously decide whether a given sample contains defects and localize those defects spatially. In real production lines, extreme foreground-background imbalance, defect sparsity with a long-tailed scale distribution, and low contrast are common. As a result, pixel-centric training and evaluation are easily dominated by large homogeneous regions, making it difficult to drive models to attend to small or low-contrast defects-one of the main bottlenecks for deployment. Empirically, existing models achieve strong pixel-overlap metrics (e.g., mIoU) but exhibit insufficient stability at the sample level, especially for sparse or slender defects. The root cause is a mismatch between the optimization objective and the granularity of QC decisions. To address this, we propose a sample-centric multi-task learning framework and evaluation suite. Built on a shared-encoder architecture, the method jointly learns sample-level defect classification and pixel-level mask localization. Sample-level supervision modulates the feature distribution and, at the gradient level, continually boosts recall for small and low-contrast defects, while the segmentation branch preserves boundary and shape details to enhance per-sample decision stability and reduce misses. For evaluation, we propose decision-linked metrics, Seg_mIoU and Seg_Recall, which remove the bias of classical mIoU caused by empty or true-negative samples and tightly couple localization quality with sample-level decisions. Experiments on two benchmark datasets demonstrate that our approach substantially improves the reliability of sample-level decisions and the completeness of defect localization.
zh
[CV-72] Prompt-based Adaptation in Large-scale Vision Models: A Survey
【速读】:该论文旨在解决当前视觉提示(Visual Prompting, VP)与视觉提示微调(Visual Prompt Tuning, VPT)在概念边界模糊、应用区分不明确的问题,导致二者在研究中常被混用,缺乏系统性分类与理解。其解决方案的关键在于提出一个统一框架——提示适配(Prompt-based Adaptation, PA),并基于此框架构建了一个包含可学习、生成式和不可学习提示的分类体系,同时按注入粒度(像素级与token级)进一步组织现有方法,从而清晰界定VP与VPT的本质差异及其适用场景,为相关研究提供结构化路径与理论支撑。
链接: https://arxiv.org/abs/2510.13219
作者: Xi Xiao,Yunbei Zhang,Lin Zhao,Yiyang Liu,Xiaoying Liao,Zheda Mai,Xingjian Li,Xiao Wang,Hao Xu,Jihun Hamm,Xue Lin,Min Xu,Qifan Wang,Tianyang Wang,Cheng Han
机构: University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Tulane University (图兰大学); Northeastern University (东北大学); University of Missouri-Kansas City (密苏里大学堪萨斯城分校); Johns Hopkins University (约翰霍普金斯大学); Ohio State University (俄亥俄州立大学); Carnegie Mellon University (卡内基梅隆大学); Oak Ridge National Laboratory (橡树岭国家实验室); Harvard University (哈佛大学); Meta AI (Meta AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune’’ paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity – pixel-level and token-level. Beyond the core methodologies, we examine PA’s integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA’s methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.
zh
[CV-73] MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation
【速读】:该论文旨在解决从语音信号生成风格化三维人体运动时面临的两大核心问题:一是现有风格编码方法对风格多样性的刻画过于粗略,且忽视了身体不同区域(如上肢与下肢)之间的运动风格差异;二是运动风格未能动态适应语音节奏和情绪变化,导致生成动作缺乏真实感。解决方案的关键在于提出MimicParts框架,其创新性地引入了基于部位感知的风格注入机制(part-aware style injection)与部位感知去噪网络(part-aware denoising network),通过将人体划分为多个区域来分别编码局部运动风格,从而捕捉细粒度的身体区域差异;同时,借助部位感知注意力模块(part-aware attention block),使语音中的节奏和情绪线索能够精准引导各身体区域的运动生成,实现风格随语音语境动态调整,显著提升生成动作的自然性和表现力。
链接: https://arxiv.org/abs/2510.13208
作者: Lianlian Liu,YongKang He,Zhaojie Chu,Xiaofen Xing,Xiangmin Xu
机构: 1. South China University of Technology (华南理工大学); 2. Tencent (腾讯); 3. Alibaba Group (阿里巴巴集团); 4. Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.
zh
[CV-74] Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)顶会中日益加剧的同行评审(peer review)系统压力问题,包括审稿人工作负荷过重、专业匹配度低、评价标准不一致、评审内容浅层化或模板化以及问责机制缺失等挑战。其解决方案的关键在于提出并构建了Paper Copilot系统,该系统通过创建跨多个计算机科学会议的持久化数字评审档案,形成一个开放数据集,支持研究人员对同行评审进行大规模实证分析,并首次基于多年ICLR(International Conference on Learning Representations)评审数据开展可复现的研究,从而为识别评审过程中的失效模式、追踪演变趋势并推动基于证据的改进提供基础设施与数据支撑。
链接: https://arxiv.org/abs/2510.13201
作者: Jing Yang,Qiyao Wei,Jiaxin Pei
机构: University of Southern California (南加州大学); University of Cambridge (剑桥大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:
Abstract:The rapid growth of AI conferences is straining an already fragile peer-review system, leading to heavy reviewer workloads, expertise mismatches, inconsistent evaluation standards, superficial or templated reviews, and limited accountability under compressed timelines. In response, conference organizers have introduced new policies and interventions to preserve review standards. Yet these ad-hoc changes often create further concerns and confusion about the review process, leaving how papers are ultimately accepted - and how practices evolve across years - largely opaque. We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review. We hope these resources help the community track changes, diagnose failure modes, and inform evidence-based improvements toward a more robust, transparent, and reliable peer-review system.
zh
[CV-75] Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion
【速读】:该论文旨在解决当前基于摄像头的占用预测(Occupancy Prediction)方法在特征利用上的局限性,即现有方法主要通过结构优化(如轻量化骨干网络或复杂级联框架)提升性能,而忽视了二维图像中多源特征表示的融合潜力。其解决方案的关键在于提出一种两阶段的占用预测框架 CIGOcc,该框架通过提取分割(Segmentation)、图形(Graphics)和深度(Depth)三类多层级特征,并引入可变形多层级融合机制实现高效特征融合;同时结合 SAM(Segment Anything Model)的知识蒸馏策略进一步增强预测精度,在不增加训练成本的前提下实现了 SemanticKITTI 基准上的最先进性能。
链接: https://arxiv.org/abs/2510.13198
作者: Rongtao Xu,Jinzhou Lin,Jialei Zhou,Jiahua Dong,Changwei Wang,Ruisheng Wang,Li Guo,Shibiao Xu,Xiaodan Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbfCIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbfCIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released this https URL
zh
[CV-76] STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control
【速读】:该论文旨在解决边缘场景重建中基于高斯溅射(Gaussian Splatting, GS)的联邦学习问题,即如何在通信资源受限条件下最大化全局GS模型的质量。现有方法因侧重于通信吞吐量或通用学习性能,无法直接适用于以GS质量为导向的优化目标。解决方案的关键在于提出一种“先采样后传输”(Sample-then-Transmit, STT-GS)策略:首先通过特征域聚类(Feature-domain Clustering, FDC)从各客户端采样少量代表性图像作为 pilot data,用于预测其对GS质量的贡献;随后基于第一阶段评估结果,采用联合客户端选择与功率控制(Joint Client Selection and Power Control, JCSPC)框架,在满足通信约束下优先分配资源给高价值客户端,并利用惩罚交替主化最小化(Penalty Alternating Majorization Minimization, PAMM)算法实现低复杂度高效求解。实验表明,该方案可在极低采样比例(如10%)下准确预测GS目标函数,显著优于现有基准。
链接: https://arxiv.org/abs/2510.13186
作者: Zhen Li,Xibin Jin,Guoliang Li,Shuai Wang,Miaowen Wen,Huseyin Arslan,Derrick Wing Kwan Ng,Chengzhong Xu
机构: Hangzhou Dianzi University (杭州电子科技大学); Chinese Academy of Sciences (中国科学院); South China University of Technology (华南理工大学); University of Macau (澳门大学); Istanbul Medipol University (伊斯坦布尔Medipol大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Edge Gaussian splatting (EGS), which aggregates data from distributed clients and trains a global GS model at the edge server, is an emerging paradigm for scene reconstruction. Unlike traditional edge resource management methods that emphasize communication throughput or general-purpose learning performance, EGS explicitly aims to maximize the GS qualities, rendering existing approaches inapplicable. To address this problem, this paper formulates a novel GS-oriented objective function that distinguishes the heterogeneous view contributions of different clients. However, evaluating this function in turn requires clients’ images, leading to a causality dilemma. To this end, this paper further proposes a sample-then-transmit EGS (or STT-GS for short) strategy, which first samples a subset of images as pilot data from each client for loss prediction. Based on the first-stage evaluation, communication resources are then prioritized towards more valuable clients. To achieve efficient sampling, a feature-domain clustering (FDC) scheme is proposed to select the most representative data and pilot transmission time minimization (PTTM) is adopted to reduce the pilot this http URL, we develop a joint client selection and power control (JCSPC) framework to maximize the GS-oriented function under communication resource constraints. Despite the nonconvexity of the problem, we propose a low-complexity efficient solution based on the penalty alternating majorization minimization (PAMM) algorithm. Experiments unveil that the proposed scheme significantly outperforms existing benchmarks on real-world datasets. It is found that the GS-oriented objective can be accurately predicted with low sampling ratios (e.g.,10%), and our method achieves an excellent tradeoff between view contributions and communication costs.
zh
[CV-77] DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization
【速读】:该论文旨在解决瞬态电磁(Transient Electromagnetic, TEM)信号在实际应用中因不同地理区域噪声特性差异而导致深度学习去噪模型泛化能力差的问题。现有方法多基于单一场景或模拟数据训练,难以适应新环境下的噪声模式,从而降低去噪性能。解决方案的关键在于提出字典驱动的先验正则化测试时自适应(Dictionary-driven Prior Regularization Test-time Adaptation, DP-TTA)方法,其核心思想是利用TEM信号固有的物理特性(如指数衰减和光滑性)作为跨区域一致的先验知识,在测试阶段通过最小化由字典驱动一致性与信号一阶变化构成的自监督损失,引导预训练模型动态调整参数,实现对新场景噪声的有效抑制。
链接: https://arxiv.org/abs/2510.13160
作者: Meng Yang,Kecheng Chen,Wei Luo,Xianjie Chen,Yong Jia,Mingyue Wang,Fanqiang Lin
机构: Chengdu University of Technology (成都理工大学); City University of Hong Kong (香港城市大学); Sichuan Provincial Natural Resources Survey and Design Group Co., Ltd. (四川省自然资源调查设计集团有限公司); Massey University (梅西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transient Electromagnetic (TEM) method is widely used in various geophysical applications, providing valuable insights into subsurface properties. However, time-domain TEM signals are often submerged in various types of noise. While recent deep learning-based denoising models have shown strong performance, these models are mostly trained on simulated or single real-world scenario data, overlooking the significant differences in noise characteristics from different geographical regions. Intuitively, models trained in one environment often struggle to perform well in new settings due to differences in geological conditions, equipment, and external interference, leading to reduced denoising performance. To this end, we propose the Dictionary-driven Prior Regularization Test-time Adaptation (DP-TTA). Our key insight is that TEM signals possess intrinsic physical characteristics, such as exponential decay and smoothness, which remain consistent across different regions regardless of external conditions. These intrinsic characteristics serve as ideal prior knowledge for guiding the TTA strategy, which helps the pre-trained model dynamically adjust parameters by utilizing self-supervised losses, improving denoising performance in new scenarios. To implement this, we customized a network, named DTEMDNet. Specifically, we first use dictionary learning to encode these intrinsic characteristics as a dictionary-driven prior, which is integrated into the model during training. At the testing stage, this prior guides the model to adapt dynamically to new environments by minimizing self-supervised losses derived from the dictionary-driven consistency and the signal one-order variation. Extensive experimental results demonstrate that the proposed method achieves much better performance than existing TEM denoising methods and TTA methods.
zh
[CV-78] Foveation Improves Payload Capacity in Steganography SIGGRAPH
【速读】:该论文旨在解决传统隐写术(steganography)在视觉媒介中容量有限、鲁棒性不足以及感知质量难以兼顾的问题。其解决方案的关键在于利用高效的潜在表示(latent representations)与视网膜敏感度感知渲染(foveated rendering)技术,构建多模态潜在空间,从而显著提升隐写容量(从100比特提升至500比特),同时实现更高的解码准确性(每2000比特仅出现1次误码),并在保持高视觉保真度(PSNR: 31.47 dB, LPIPS: 0.13)的前提下完成信息嵌入。
链接: https://arxiv.org/abs/2510.13151
作者: Lifeng Qiu Lin,Henry Kam,Qi Sun,Kaan Akşit
机构: University College London (伦敦大学学院); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: SIGGRAPH Asia 2025 Posters Proceedings
Abstract:Steganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography.
zh
[CV-79] Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN
【速读】:该论文旨在解决实时美国手语(ASL)识别中模型性能与计算效率之间的权衡问题。研究对比了3D卷积神经网络(3D CNNs)和长短期记忆网络(LSTM)在相同训练条件下对1200个ASL手势的识别效果,发现3D CNNs虽能实现92.4%的高准确率,但每帧处理时间多出3.2%;而LSTM在86.7%准确率下资源消耗显著更低。关键在于通过混合架构(3D CNN-LSTM)验证了根据具体应用场景选择合适模型结构的重要性,为边缘计算环境下辅助技术的开发提供了可量化的性能基准。
链接: https://arxiv.org/abs/2510.13137
作者: Madhumati Pol,Anvay Anturkar,Anushka Khot,Ayush Andure,Aniruddha Ghosh,Anvit Magadum,Anvay Bahadur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical this http URL project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.
zh
[CV-80] OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment
【速读】:该论文旨在解决文本-图像对齐(text-image alignment)任务中因模态间信息熵差异导致的跨模态检索不平衡问题,即传统方法在文本到图像和图像到文本的相互检索中表现不均衡。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)的开放语义知识来填补模态间的熵差,并通过两阶段机制实现更接近人类认知的对齐能力:第一阶段设计无需依赖特定任务领域显式知识的新提示模板,增强文本模态的多义性描述以提升其相对于视觉模态的信息熵;第二阶段引入超图适配器(hypergraph adapter),构建文本与图像模态间的多边连接,在固定嵌入空间中修正同义语义的正负样本匹配错误,同时通过降维再映射策略降低由开放语义熵带来的噪声。
链接: https://arxiv.org/abs/2510.13131
作者: Rongjun Chen,Chengsi Yao,Jinchang Ren,Xianxian Zeng,Peixian Wang,Jun Yuan,Jiawen Li,Huimin Zhao,Xu Lu
机构: Guangdong Polytechnic Normal University (广东技术师范大学); National Subsea Centre, Robert Gordon University (罗伯特戈登大学海底中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrieval of these two modalities. To address this particular challenge, we propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap and reproduce the alignment ability of humans in these tasks. Our entropy-enhancing alignment is achieved through a two-step process: 1) a new prompt template that does not rely on explicit knowledge in the task domain is designed to use LLM to enhance the polysemy description of the text modality. By analogy, the information entropy of the text modality relative to the visual modality is increased; 2) A hypergraph adapter is used to construct multilateral connections between the text and image modalities, which can correct the positive and negative matching errors for synonymous semantics in the same fixed embedding space, whilst reducing the noise caused by open semantic entropy by mapping the reduced dimensions back to the original dimensions. Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter), showcasing 16.8% (text-to-image) and 40.1% (image-to-text) cross-modal retrieval gains over existing methods while establishing new state-of-the-art performance in semantic alignment tasks.
zh
[CV-81] VPREG: An Optimal Control Formulation for Diffeomorphic Image Registration Based on the Variational Principle Grid Generation Method
【速读】:该论文旨在解决医学图像配准中变换质量与精度难以兼顾的问题,特别是如何在保证空间变换为微分同胚(diffeomorphic)的前提下提升配准准确性及逆变换的可靠性。其解决方案的关键在于提出一种基于变分原理(Variational Principle, VP)的网格生成方法——VPreg,该方法通过构造具有预设雅可比行列式(Jacobian determinant)和旋度(curl)的非折叠网格,确保变换的微分同胚性质,并在微分同胚群内直接生成逆变换,而非在图像空间中近似求解。此设计显著提升了配准结果的几何保真度与逆变换的准确性,优于ANTs-SyN、Freesurfer-Easyreg和FSL-Fnirt等主流方法。
链接: https://arxiv.org/abs/2510.13109
作者: Zicong Zhou,Baihan Zhao,Andreas Mang,Guojun Liao
机构: University of North Texas Health Science Center (北德克萨斯大学健康科学中心); University of Texas at Arlington (德克萨斯大学阿灵顿分校); University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 30 pages, 9 figures
Abstract:This paper introduces VPreg, a novel diffeomorphic image registration method. This work provides several improvements to our past work on mesh generation and diffeomorphic image registration. VPreg aims to achieve excellent registration accuracy while controlling the quality of the registration transformations. It ensures a positive Jacobian determinant of the spatial transformation and provides an accurate approximation of the inverse of the registration, a crucial property for many neuroimaging workflows. Unlike conventional methods, VPreg generates this inverse transformation within the group of diffeomorphisms rather than operating on the image space. The core of VPreg is a grid generation approach, referred to as \emphVariational Principle (VP), which constructs non-folding grids with prescribed Jacobian determinant and curl. These VP-generated grids guarantee diffeomorphic spatial transformations essential for computational anatomy and morphometry, and provide a more accurate inverse than existing methods. To assess the potential of the proposed approach, we conduct a performance analysis for 150 registrations of brain scans from the OASIS-1 dataset. Performance evaluation based on Dice scores for 35 regions of interest, along with an empirical analysis of the properties of the computed spatial transformations, demonstrates that VPreg outperforms state-of-the-art methods in terms of Dice scores, regularity properties of the computed transformation, and accuracy and consistency of the provided inverse map. We compare our results to ANTs-SyN, Freesurfer-Easyreg, and FSL-Fnirt.
zh
[CV-82] DriveCritic: Towards Context-Aware Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models
【速读】:该论文旨在解决自动驾驶规划器(planning)评估中缺乏与人类判断对齐的问题,现有指标如扩展预测驾驶员模型评分(Extended Predictative Driver Model Score, EPDMS)在复杂场景下缺乏情境感知能力。解决方案的关键在于提出DriveCritic框架,其核心由两部分组成:一是DriveCritic数据集,包含需依赖上下文才能正确判断的挑战性驾驶场景,并附带成对的人类偏好标注;二是基于视觉-语言模型(Vision-Language Model, VLM)的DriveCritic评估模型,通过两阶段监督学习与强化学习训练,融合视觉信息与符号化上下文来判别轨迹优劣。实验证明该方法在匹配人类偏好方面显著优于现有指标,展现出更强的情境理解能力。
链接: https://arxiv.org/abs/2510.13108
作者: Jingyu Song,Zhenxin Li,Shiyi Lan,Xinglong Sun,Nadine Chang,Maying Shen,Joshua Chen,Katherine A. Skinner,Jose M. Alvarez
机构: University of Michigan (密歇根大学); NVIDIA; Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, 3 figures
Abstract:Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.
zh
[CV-83] EgoSocial: Benchmarking Proactive Intervention Ability of Omnimodal LLM s via Egocentric Social Interaction Perception
【速读】:该论文旨在解决当前大语言模型(LLM)在增强现实/虚拟现实(AR/VR)场景下缺乏社会感知能力的问题,即模型难以判断何时作为AI助手进行干预,导致频繁产生不恰当的响应,干扰自然社交互动并降低用户专注度。其解决方案的关键在于提出EgoSoD(EgoSocial Detection),一种端到端的社会动态识别方法,通过构建融合多模态上下文线索(如音频与视觉信息)的社会思维图(social thinking graph),动态建模参与者及其交互关系,从而精准识别干预时机与社交行为。该方法显著提升了Phi-4和Gemini 2.5 Pro在干预时机检测和社会互动理解上的性能。
链接: https://arxiv.org/abs/2510.13105
作者: Xijun Wang,Tanay Sharma,Achin Kulshrestha,Abhimitra Meka,Aveek Purohit,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院公园分校); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As AR/VR technologies become integral to daily life, there’s a growing need for AI that understands human social dynamics from an egocentric perspective. However, current LLMs often lack the social awareness to discern when to intervene as AI assistant. This leads to constant, socially unaware responses that may disrupt natural conversation and negatively impact user focus. To address these limitations, we introduce EgoSocial, a large-scale egocentric dataset with 13,500 social video-question pairs, specifically designed to benchmark intervention in social interaction perception. We also present an in-depth analysis of current omnimodal LLMs (OLLMs) to assess their effectiveness in detecting diverse social contextual cues. Experiments show that OLLMs still struggle to detect the intervention timing (14.4% for Gemini 2.5 Pro). We also propose EgoSoD (EgoSocial Detection), an end-to-end method for robustly discerning social dynamics. Informed by our OLLM analysis, EgoSoD integrates multimodal contextual cues (e.g., audio and visual cues) into a social thinking graph, dynamically modeling participants and interactions. Our method proactively detects intervention timing and social interactions, precisely determining when to intervene. Our EgoSoD improves Phi-4 by 45.6% and Gemini 2.5 Pro by 9.9% on Intervention Timing performance, and improves Phi-4 by 20.4% and Gemini 2.5 Pro by 6.9% on overall Social Interaction performance. We will release the dataset and code soon.
zh
[CV-84] Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation
【速读】:该论文旨在解决当前基于文本的视频编辑方法在计算开销、内存消耗以及视觉保真度方面的局限性,尤其是因全序列时空建模导致的高资源占用和时间不一致性问题(如模糊和马赛克状伪影)。其解决方案的关键在于提出一种轻量级、零样本、文本驱动的视频编辑框架——Edit-Your-Interest,核心创新包括:(1) 引入时空特征记忆库(Spatio-Temporal Feature Memory, SFM),用于缓存前帧中由空间注意力机制处理的关键图像token,显著降低计算复杂度;(2) 设计特征最相似传播(Feature Most-Similar Propagation, FMP)策略,将前帧中最相关的token传播至后续帧以维持时序一致性;(3) 提出SFM更新算法,持续刷新缓存特征以保证长期有效性;此外,利用交叉注意力图自动提取感兴趣目标的掩码,并将其无缝嵌入扩散去噪过程,实现对目标对象的细粒度控制,同时保持背景完整性。实验表明,该方法在效率与视觉质量上均优于现有最优方法。
链接: https://arxiv.org/abs/2510.13084
作者: Yi Zuo,Zitao Wang,Lingling Li,Xu Liu,Fang Liu,Licheng Jiao
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 11 figures
Abstract:Text-to-image (T2I) diffusion models have recently demonstrated significant progress in video editing. However, existing video editing methods are severely limited by their high computational overhead and memory consumption. Furthermore, these approaches often sacrifice visual fidelity, leading to undesirable temporal inconsistencies and artifacts such as blurring and pronounced mosaic-like patterns. We propose Edit-Your-Interest, a lightweight, text-driven, zero-shot video editing method. Edit-Your-Interest introduces a spatio-temporal feature memory to cache features from previous frames, significantly reducing computational overhead compared to full-sequence spatio-temporal modeling approaches. Specifically, we first introduce a Spatio-Temporal Feature Memory bank (SFM), which is designed to efficiently cache and retain the crucial image tokens processed by spatial attention. Second, we propose the Feature Most-Similar Propagation (FMP) method. FMP propagates the most relevant tokens from previous frames to subsequent ones, preserving temporal consistency. Finally, we introduce an SFM update algorithm that continuously refreshes the cached features, ensuring their long-term relevance and effectiveness throughout the video sequence. Furthermore, we leverage cross-attention maps to automatically extract masks for the instances of interest. These masks are seamlessly integrated into the diffusion denoising process, enabling fine-grained control over target objects and allowing Edit-Your-Interest to perform highly accurate edits while robustly preserving the background integrity. Extensive experiments decisively demonstrate that the proposed Edit-Your-Interest outperforms state-of-the-art methods in both efficiency and visual fidelity, validating its superior effectiveness and practicality. Comments: 32 pages, 11 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.13084 [cs.CV] (or arXiv:2510.13084v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.13084 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yi Zuo [view email] [v1] Wed, 15 Oct 2025 01:55:32 UTC (1,969 KB) Full-text links: Access Paper: View a PDF of the paper titled Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation, by Yi Zuo and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-10 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-85] Counting Hallucinations in Diffusion Models
【速读】:该论文旨在解决扩散概率模型(Diffusion Probabilistic Models, DPMs)在图像生成任务中普遍存在但难以量化的问题——即“计数幻觉”(counting hallucination),指生成样本中出现与训练数据不一致的物体数量或结构,例如生成具有六根手指的手部图像。其解决方案的关键在于构建了一个名为CountHalluSet的数据集套件,包含ToyShape、SimObject和RealHand三类具有明确计数标准的子集,并设计了一套标准化的评估协议来系统量化不同采样条件(如求解器类型、常微分方程(ODE)求解阶数、采样步数及初始噪声)对计数幻觉的影响。此外,研究进一步揭示了主流评价指标FID(Fréchet Inception Distance)无法稳定捕捉计数幻觉的现象,为下一代受事实约束的生成模型设计提供了可量化的基准和新视角。
链接: https://arxiv.org/abs/2510.13080
作者: Shuai Fu,Jian Zhou,Qi Chen,Huang Jing,Huy Anh Nguyen,Xiaohan Liu,Zhixiong Zeng,Lin Ma,Quanshi Zhang,Qi Wu
机构: University of Adelaide (阿德莱德大学); Meituan (美团); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.
zh
[CV-86] Unsupervised Domain Adaptation via Content Alignment for Hippocampus Segmentation
【速读】:该论文旨在解决深度学习模型在跨数据集部署时因领域偏移(domain shift)导致的医学图像分割性能下降问题,尤其关注图像风格(style)和人群相关的解剖结构特征(content)变化对 hippocampus 分割精度的影响。其解决方案的关键在于提出一种无监督域适应框架,结合高效的风格归一化(z-normalisation)与双向可变形图像配准(bidirectional deformable image registration, DIR)策略;其中 DIR 网络与分割网络及判别器联合训练,以区域兴趣为导向生成解剖学合理的形变场,从而将源域图像映射至目标域,显著提升跨人群(如从健康年轻群体到临床痴呆患者)的分割准确性,在Dice分数上相比标准增强方法最高提升15%。
链接: https://arxiv.org/abs/2510.13075
作者: Hoda Kalabizadeh,Ludovica Griffanti,Pak-Hei Yeung,Ana I. L. Namburete,Nicola K. Dinsdale,Konstantinos Kamnitsas
机构: University of Oxford (牛津大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning models for medical image segmentation often struggle when deployed across different datasets due to domain shifts - variations in both image appearance, known as style, and population-dependent anatomical characteristics, referred to as content. This paper presents a novel unsupervised domain adaptation framework that directly addresses domain shifts encountered in cross-domain hippocampus segmentation from MRI, with specific emphasis on content variations. Our approach combines efficient style harmonisation through z-normalisation with a bidirectional deformable image registration (DIR) strategy. The DIR network is jointly trained with segmentation and discriminator networks to guide the registration with respect to a region of interest and generate anatomically plausible transformations that align source images to the target domain. We validate our approach through comprehensive evaluations on both a synthetic dataset using Morpho-MNIST (for controlled validation of core principles) and three MRI hippocampus datasets representing populations with varying degrees of atrophy. Across all experiments, our method outperforms existing baselines. For hippocampus segmentation, when transferring from young, healthy populations to clinical dementia patients, our framework achieves up to 15% relative improvement in Dice score compared to standard augmentation methods, with the largest gains observed in scenarios with substantial content shift. These results highlight the efficacy of our approach for accurate hippocampus segmentation across diverse populations.
zh
[CV-87] Direction-aware multi-scale gradient loss for infrared and visible image fusion
【速读】:该论文旨在解决红外与可见光图像融合中因梯度信息丢失导致的边缘保真度不足问题。现有基于学习的方法通常采用结构相似性损失、强度重建损失及梯度幅值项的组合,但将梯度压缩为幅值会丢失方向信息,造成监督信号模糊,进而影响边缘清晰度和纹理保留效果。解决方案的关键在于提出一种方向感知的多尺度梯度损失(direction-aware, multi-scale gradient loss),该损失函数分别监督水平和垂直方向的梯度分量,并在不同尺度上保持其符号不变性,从而在粗细分辨率下均提供明确的方向引导,显著提升边缘锐度与对齐精度,同时无需修改模型架构或训练策略即可实现更优的融合性能。
链接: https://arxiv.org/abs/2510.13067
作者: Kaixuan Yang,Wei Xiang,Zhenshuai Chen,Tong Jin,Yunpeng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared and visible image fusion aims to integrate complementary information from co-registered source images to produce a single, informative result. Most learning-based approaches train with a combination of structural similarity loss, intensity reconstruction loss, and a gradient-magnitude term. However, collapsing gradients to their magnitude removes directional information, yielding ambiguous supervision and suboptimal edge fidelity. We introduce a direction-aware, multi-scale gradient loss that supervises horizontal and vertical components separately and preserves their sign across scales. This axis-wise, sign-preserving objective provides clear directional guidance at both fine and coarse resolutions, promoting sharper, better-aligned edges and richer texture preservation without changing model architectures or training protocols. Experiments on open-source model and multiple public benchmarks demonstrate effectiveness of our approach.
zh
[CV-88] rue Self-Supervised Novel View Synthesis is Transferable
【速读】:该论文旨在解决自监督新颖视图合成(NVS)模型中 pose 无法迁移的问题,即现有方法提取的相机位姿(pose)在不同场景间不具备泛化能力,导致同一组位姿在不同3D场景中生成不同的相机轨迹。解决方案的关键在于提出 XFactor,这是一个无需几何先验(geometry-free)的自监督模型,其核心创新是结合成对位姿估计与输入输出的简单增强策略,从而实现相机位姿与场景内容的有效解耦,并促进几何推理能力。值得注意的是,XFactor 在无显式 SE(3) 参数化或任何三维归纳偏置的情况下,仍能实现位姿变量的可迁移性,且通过新提出的量化指标和大规模实验验证了其优越性能。
链接: https://arxiv.org/abs/2510.13063
作者: Thomas W. Mitchel,Hyunwoo Ryu,Vincent Sitzmann
机构: PlayStation(索尼游戏); MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry – such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.
zh
[CV-89] One Dimensional CNN ECG Mamba for Multilabel Abnormality Classification in 12 Lead ECG
【速读】:该论文旨在解决从心电图(Electrocardiogram, ECG)信号中准确检测心脏异常的难题,尤其针对传统深度学习模型在处理长序列ECG数据时性能受限的问题。解决方案的关键在于提出一种混合框架——一维卷积神经网络心电图Mamba(One Dimensional Convolutional Neural Network Electrocardiogram Mamba),该框架将卷积特征提取与Mamba这一选择性状态空间模型相结合,并基于Vision Mamba的双向结构增强对ECG时间依赖性的建模能力。实验表明,该方法在PhysioNet心血管计算挑战赛2020和2021数据集上显著优于现有算法,在十二导联ECG分类任务中实现了更高的AUPRC和AUROC指标,验证了基于Mamba架构在可靠ECG分类中的潜力。
链接: https://arxiv.org/abs/2510.13046
作者: Huawei Jiang,Husna Mutahira,Gan Huang,Mannan Saeed Muhammad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 Pages, 2 figures
Abstract:Accurate detection of cardiac abnormalities from electrocardiogram recordings is regarded as essential for clinical diagnostics and decision support. Traditional deep learning models such as residual networks and transformer architectures have been applied successfully to this task, but their performance has been limited when long sequential signals are processed. Recently, state space models have been introduced as an efficient alternative. In this study, a hybrid framework named One Dimensional Convolutional Neural Network Electrocardiogram Mamba is introduced, in which convolutional feature extraction is combined with Mamba, a selective state space model designed for effective sequence modeling. The model is built upon Vision Mamba, a bidirectional variant through which the representation of temporal dependencies in electrocardiogram data is enhanced. Comprehensive experiments on the PhysioNet Computing in Cardiology Challenges of 2020 and 2021 were conducted, and superior performance compared with existing methods was achieved. Specifically, the proposed model achieved substantially higher AUPRC and AUROC scores than those reported by the best previously published algorithms on twelve lead electrocardiograms. These results demonstrate the potential of Mamba-based architectures to advance reliable ECG classification. This capability supports early diagnosis and personalized treatment, while enhancing accessibility in telemedicine and resource-constrained healthcare systems.
zh
[CV-90] SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion
【速读】:该论文旨在解决现有生成式AI(Generative AI)运动模型在文本条件驱动下缺乏场景感知能力的问题,即当前方法难以同时实现丰富的动作语义与精确的场景交互建模。其核心挑战在于构建大规模、兼具细粒度文本-动作覆盖和精准场景交互标注的数据集极为困难。解决方案的关键在于提出SceneAdapt框架,通过两个适应阶段——插值(inbetweening)和场景感知插值(scene-aware inbetweening),利用分离的场景-动作与文本-动作数据集进行知识迁移:第一阶段引入关键帧层以调制动作潜在表示并保持潜在流形结构;第二阶段加入场景条件层,通过交叉注意力机制自适应地查询局部场景几何信息,从而将场景感知注入到文本到动作生成模型中。
链接: https://arxiv.org/abs/2510.13044
作者: Jungbin Cho,Minsu Kim,Jisoo Kim,Ce Zheng,Laszlo A. Jeni,Ming-Hsuan Yang,Youngjae Yu,Seonjoo Kim
机构: Yonsei University (延世大学); Carnegie Mellon University (卡内基梅隆大学); UC Merced (加州大学默塞德分校); Google DeepMind (谷歌深度思维); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text–motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene–motion and text–motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.
zh
[CV-91] SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models
【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型在长序列叙事连贯性评估方面的不足问题,即现有基准主要关注视觉质量而缺乏对多事件逻辑推进和时序一致性的系统性衡量。其解决方案的关键在于提出SeqBench——一个包含320个多样化提示和2,560个由8种先进T2V模型生成的视频的人工标注数据集,并设计了一种基于动态时间图(Dynamic Temporal Graphs, DTG)的自动评估指标,该指标能高效捕捉长期依赖关系与时间顺序,且与人工标注高度相关。通过该框架,研究揭示了当前T2V模型在多动作序列中物体状态一致性、多对象物理合理性以及动作时序关系保持等方面的显著缺陷,为未来提升T2V模型的序列推理能力提供了可量化的评估标准和改进方向。
链接: https://arxiv.org/abs/2510.13042
作者: Zhengxu Tang,Zizheng Wang,Luning Wang,Zitao Shuai,Chenhao Zhang,Siyu Qian,Yirui Wu,Bohao Wang,Haosong Rao,Zhenyu Yang,Chenwei Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences. To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations. Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models. Please refer to this https URL for more details.
zh
[CV-92] SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
【速读】:该论文旨在解决视频中细粒度动作理解与对应行为者在时空维度上的精准定位问题,即如何在自然语言描述指导下,同时检测、跟踪并精确标注视频中执行特定动作的多个对象。现有方法通常仅关注粗粒度动作识别或通用目标追踪,忽视了多对象按动作联合检测与时空锚定的挑战。解决方案的关键在于提出新的任务范式——时空视频动作锚定(Spatio-temporal Video Action Grounding, SVAG),并构建大规模基准SVAG-Bench(含688个视频、19,590条标注记录和903个动词)及基线模型SVAGFormer,该框架基于先进的视觉语言模型实现空间与时间维度的联合锚定,并配套标准化评估工具SVAGEval以支持公平、可复现的性能比较。实验表明,当前模型在密集或复杂场景下表现不佳,凸显出对长视频中细粒度物体-动作交互进行更高级推理的必要性。
链接: https://arxiv.org/abs/2510.13016
作者: Tanveer Hannan,Shuaicong Wu,Mark Weber,Suprosanna Shit,Jindong Gu,Rajat Koner,Aljoša Ošep,Laura Leal-Taixé,Thomas Seidl
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); MCML; Technical University of Munich (慕尼黑工业大学); University of Zurich (苏黎世大学); University of Oxford (牛津大学); Google Deepmind (谷歌深度智核); Amazon (亚马逊); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.
zh
[CV-93] Scope: Selective Cross-modal Orchestration of Visual Perception Experts
【速读】:该论文旨在解决多视觉编码器(Vision Encoder)在视觉-语言模型(Vision-Language Models, VLMs)中应用时存在的效率瓶颈问题:直接堆叠多个编码器虽能提升性能,但会因计算成本剧增而带来边际收益递减。其解决方案的关键在于提出SCOPE框架——一种基于实例级路由的混合编码器(Mixture-of-Encoders, MoEnc)架构,通过轻量级路由器利用文本提示与共享视觉特征间的交叉注意力机制,动态选择最优的专用编码器处理每一对图像-文本数据;同时引入双熵正则化与辅助损失函数,平衡数据集层面的负载分布与实例层面的路由置信度,从而实现性能超越全量编码器并行使用的同时,降低24%-49%的计算开销。
链接: https://arxiv.org/abs/2510.12974
作者: Tianyu Zhang,Suyuchen Wang,Chao Wang,Juan Rodriguez,Ahmed Masry,Xiangru Jian,Yoshua Bengio,Perouz Taslakian
机构: ServiceNow; Université de Montréal; École de Technologie Supérieure; University of Waterloo; McGill University; York University; University of British Columbia; Universitat Autònoma de Barcelona; Mila; CIFAR AI Chair; Polytechnique Montréal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 2 figures
Abstract:Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.
zh
[CV-94] CADE 2.5 - ZeResFDG: Frequency-Decoupled Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成图像时存在的细节模糊、提示词遵循性差以及高频伪影控制不足的问题,尤其在中等指导强度下难以平衡清晰度与稳定性。解决方案的关键在于提出一种采样级引导堆栈 CADE 2.5(Comfy Adaptive Detail Enhancer),其核心模块 ZeResFDG 统一了三项机制:(i) 频率解耦引导(frequency-decoupled guidance),对引导信号的低频和高频分量进行重加权;(ii) 能量重缩放(energy rescaling),使引导预测的每样本幅值匹配正向分支;(iii) 零投影(zero-projection),移除与无条件方向平行的分量。此外,引入轻量级频谱指数移动平均(spectral EMA)带滞后机制,在采样过程中自适应切换保守模式与细节增强模式,从而提升图像锐度和结构保真度。同时,通过训练-free 的 QSilk Micrograin Stabilizer(量化钳位 + 深度/边缘门控微细节注入)进一步改善高分辨率下的高频微纹理自然性和鲁棒性,且计算开销极低。
链接: https://arxiv.org/abs/2510.12954
作者: Denis Rychkovskiy(DZRobo, Independent Researcher),GPT-5(AI collaborator, OpenAI)
机构: OpenAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures. Endorsed by Dr. Seyedmorteza Sadat (ETH Zurich). The work introduces CADE 2.5 with ZeResFDG as a practical inference-time guidance stack for SD/SDXL. Code and visual examples to be released on GitHub and Hugging Face
Abstract:We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.
zh
[CV-95] Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
【速读】:该论文旨在解决当前医疗视觉-语言模型在胎儿超声(fetal ultrasound)任务中表现不佳的问题,主要挑战包括多视角图像推理难度大、疾病种类繁多以及图像多样性高。为应对这些挑战,作者提出FetalMind系统,其核心创新在于引入**显著认知解耦(Salient Epistemic Disentanglement, SED)**机制:通过将专家标注的二部图(bipartite graph)注入模型,实现视图与疾病关联的解耦,并利用强化学习引导偏好选择过程,确保推理路径符合产科临床流程。该设计有效缓解了不同疾病间的差异性和不同视图间的异质性问题,从而降低学习瓶颈并提升诊断准确性与可解释性。
链接: https://arxiv.org/abs/2510.12953
作者: Xiao He,Huangxuan Zhao,Guojia Wan,Wei Zhou,Yanxing Liu,Juhua Liu,Yongchao Xu,Yong Luo,Dacheng Tao,Bo Du
机构: National Engineering Research Center for Multimedia Software (国家多媒体软件工程研究中心); School of Computer Science, Wuhan University (武汉大学计算机学院); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
Abstract:Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model’s inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: this https URL.
zh
[CV-96] Robust Plant Disease Diagnosis with Few Target-Domain Samples
【速读】:该论文旨在解决深度学习模型在植物病害诊断任务中因训练数据与实际部署环境之间存在域差异(domain gap)而导致的泛化性能下降问题,即模型在不同拍摄条件下的图像上诊断准确率显著降低的问题。其解决方案的关键在于提出一种名为“目标感知度量学习与优先采样”(Target-Aware Metric Learning with Prioritized Sampling, TMPS)的简单但高度适应性的学习框架,该框架基于度量学习(metric learning)思想,在仅获得少量目标域标注样本(每类10张图像)的情况下,通过有效利用这些先验信息来增强模型对未知域的鲁棒性,从而显著提升跨域诊断性能。
链接: https://arxiv.org/abs/2510.12909
作者: Takafumi Nogami,Satoshi Kagiwada,Hitoshi Iyatomi
机构: Hosei University (法政大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures. Accepted at the IEEE International Conference on Visual Communications and Image Processing (VCIP) 2025. Extended version
Abstract:Various deep learning-based systems have been proposed for accurate and convenient plant disease diagnosis, achieving impressive performance. However, recent studies show that these systems often fail to maintain diagnostic accuracy on images captured under different conditions from the training environment – an essential criterion for model robustness. Many deep learning methods have shown high accuracy in plant disease diagnosis. However, they often struggle to generalize to images taken in conditions that differ from the training setting. This drop in performance stems from the subtle variability of disease symptoms and domain gaps – differences in image context and environment. The root cause is the limited diversity of training data relative to task complexity, making even advanced models vulnerable in unseen domains. To tackle this challenge, we propose a simple yet highly adaptable learning framework called Target-Aware Metric Learning with Prioritized Sampling (TMPS), grounded in metric learning. TMPS operates under the assumption of access to a limited number of labeled samples from the target (deployment) domain and leverages these samples effectively to improve diagnostic robustness. We assess TMPS on a large-scale automated plant disease diagnostic task using a dataset comprising 223,073 leaf images sourced from 23 agricultural fields, spanning 21 diseases and healthy instances across three crop species. By incorporating just 10 target domain samples per disease into training, TMPS surpasses models trained using the same combined source and target samples, and those fine-tuned with these target samples after pre-training on source data. It achieves average macro F1 score improvements of 7.3 and 3.6 points, respectively, and a remarkable 18.7 and 17.1 point improvement over the baseline and conventional metric learning.
zh
[CV-97] State-Change Learning for Prediction of Future Events in Endoscopic Videos
【速读】:该论文旨在解决当前手术AI研究中对术中未来事件预测的不足,即现有方法多聚焦于理解当前正在进行的操作(what is happening),而缺乏对即将到来事件(如动作三元组、并发症风险、手术阶段转换等)的统一预测能力,且在短中期(如动作三元组、事件)与长期(如剩余手术时长、阶段过渡)预测任务上缺乏整合框架。此外,现有方法依赖粗粒度监督信号,未能充分建模精细的手术动作动态变化,且仅基于未来特征预测的方法难以跨不同手术场景泛化。解决方案的关键在于将手术未来预测重构为状态变化学习(state-change learning),而非直接预测原始观测值;通过引入SurgFUTR模型,采用教师-学生架构:教师网络从当前和未来视频片段中学习状态表示,学生网络仅基于当前视频预测未来状态,并由Action Dynamics (ActDyn)模块引导其捕捉动作动态演化规律。该方法显著提升了多任务预测性能及跨手术类型的泛化能力。
链接: https://arxiv.org/abs/2510.12904
作者: Saurav Sharma,Chinedu Innocent Nwoye,Didier Mutter,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, France (斯特拉斯堡大学,法国国家科学研究中心,法国国家健康与医学研究院,ICube实验室,法国联合研究单位7357); IHU Strasbourg, Strasbourg, France (斯特拉斯堡人类健康研究所,斯特拉斯堡,法国); University Hospital of Strasbourg, France (斯特拉斯堡大学医院,法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 13 figures
Abstract:Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.
zh
[CV-98] SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
【速读】:该论文旨在解决自动驾驶机器人高保真多传感器仿真中面临的两大核心问题:一是现有神经渲染方法(如NeRF和3DGS)在实时性上表现不足,且仅支持针孔相机模型,难以适配实际应用中常见的高畸变镜头和LiDAR数据;二是多传感器仿真中不同模态间存在一致性难题,现有方法常通过牺牲某一模态质量来满足另一模态需求。解决方案的关键在于提出SimULi,其创新点包括:1)基于3DGUT扩展出支持任意相机模型与旋转式LiDAR的实时渲染能力,引入自动分块策略与基于射线的剔除机制以高效处理复杂LiDAR几何;2)设计因子化3D高斯表示与锚定策略,在保持多模态一致性的同时显著降低相机和深度误差(最高减少40%)。该方法在速度上较传统光追方法快10–20倍、较栅格化方法快1.5–10倍,并在两个主流自动驾驶数据集上达到或超越当前最优的多模态仿真精度。
链接: https://arxiv.org/abs/2510.12901
作者: Haithem Turki,Qi Wu,Xin Kang,Janick Martinez Esturo,Shengyu Huang,Ruilong Li,Zan Gojcic,Riccardo de Lutio
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.
zh
[CV-99] Learning to Grasp Anything by Playing with Random Toys
【速读】:该论文旨在解决机器人操作策略在面对新物体时泛化能力不足的问题,从而限制其在现实世界中的应用。解决方案的关键在于受儿童认知发展启发,通过训练机器人使用仅由四种基本形状(球体、长方体、圆柱体和环形)随机组合而成的“玩具”来学习抓取技能,从而实现对真实世界物体的零样本泛化。研究发现,这种泛化能力的核心在于所提出的检测池化机制(detection pooling mechanism)所诱导的以物体为中心的视觉表征(object-centric visual representation),该机制使模型能够从有限且简单的训练数据中提取出具有普适性的特征,并在仿真和物理机器人平台上均取得显著性能提升——例如在YCB数据集上达到67%的真实抓取成功率,优于依赖大量领域内数据的最先进方法。
链接: https://arxiv.org/abs/2510.12866
作者: Dantong Niu,Yuvan Sharma,Baifeng Shi,Rachel Ding,Matteo Gioia,Haoru Xue,Henry Tsai,Konstantinos Kallidromitis,Anirudh Pai,Shankar Shastry,Trevor Darrell,Jitendra Malik,Roei Herzig
机构: University of California, Berkeley (加州大学伯克利分校); La Sapienza University (罗马第一大学); Panasonic (松下); ItalAI (意大利人工智能公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these “toys” enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation. Demonstration videos, code, checkpoints and our dataset are available on our project page: this https URL .
zh
[CV-100] Dedelayed: Deleting remote inference delay via on-device correction
【速读】:该论文旨在解决远程推理(remote inference)中因通信网络延迟导致预测结果过时、不适用于实时任务的问题。解决方案的关键在于提出一种称为Dedelayed的延迟校正方法,其核心思想是利用轻量级本地模型处理当前帧,并融合由重型远程模型从历史帧中计算出的特征,从而在不增加额外延迟的前提下,有效缓解任意远程推理延迟对输出准确性的影响,实现低延迟且高精度的实时推理。
链接: https://arxiv.org/abs/2510.13714
作者: Dan Jacobellis,Mateen Ulhaq,Fabien Racapé,Hyomin Choi,Neeraja J. Yadwadkar
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); InterDigital (InterDigital)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Remote inference allows lightweight devices to leverage powerful cloud models. However, communication network latency makes predictions stale and unsuitable for real-time tasks. To address this, we introduce Dedelayed, a delay-corrective method that mitigates arbitrary remote inference delays, allowing the local device to produce low-latency outputs in real time. Our method employs a lightweight local model that processes the current frame and fuses in features that a heavyweight remote model computes from past frames. On video from the BDD100K driving dataset, Dedelayed improves semantic segmentation accuracy over the stronger of the local-only and remote-only baselines across all realistic communication network delays beyond 33 ms. Without incurring additional delay, it improves accuracy by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference, for a round-trip delay of 100 ms. The advantage grows under longer delays and higher-motion scenes, as delay-mitigated split inference sustains accuracy more effectively, providing clear advantages for real-time tasks that must remain aligned with the current world state.
zh
[CV-101] An efficient approach with theoretical guarantees to simultaneously reconstruct activity and attenuation sinogram for TOF-PET
【速读】:该论文旨在解决正电子发射断层成像(PET)中衰减校正的问题,即传统方法依赖于额外的CT或MRI扫描获取衰减图,不仅引入辐射剂量和扫描时间延长,还因两次扫描间的运动伪影导致配准误差。解决方案的关键在于提出一种基于最大似然估计的新数学模型,仅利用时间飞行(TOF)PET发射数据同时重建活性sinogram和衰减sinogram;其核心创新在于充分利用衰减校正因子的指数形式特性,并引入掩膜区域内活性总量的约束条件,从而实现无需额外扫描的自主衰减校正。理论分析证明了该模型解的存在性、唯一性和稳定性,且所提出的交替更新算法具备收敛性,数值实验表明该方法在噪声鲁棒性、计算精度与效率方面优于现有先进方法。
链接: https://arxiv.org/abs/2510.13562
作者: Liyang Hu,Chong Chen
机构: State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 32 pages, 11 figures, 4 tables
Abstract:In positron emission tomography (PET), it is indispensable to perform attenuation correction in order to obtain the quantitatively accurate activity map (tracer distribution) in the body. Generally, this is carried out based on the estimated attenuation map obtained from computed tomography or magnetic resonance imaging. However, except for errors in the attenuation correction factors obtained, the additional scan not only brings in new radiation doses and/or increases the scanning time but also leads to severe misalignment induced by various motions during and between the two sequential scans. To address these issues, based on maximum likelihood estimation, we propose a new mathematical model for simultaneously reconstructing the activity and attenuation sinogram from the time-of-flight (TOF)-PET emission data only. Particularly, we make full use of the exclusively exponential form for the attenuation correction factors, and consider the constraint of a total amount of the activity in some mask region in the proposed model. Furthermore, we prove its well-posedness, including the existence, uniqueness and stability of the solution. We propose an alternating update algorithm to solve the model, and also analyze its convergence. Finally, numerical experiments with various TOF-PET emission data demonstrate that the proposed method is of numerical convergence and robust to noise, and outperforms some state-of-the-art methods in terms of accuracy and efficiency, and has the capability of autonomous attenuation correction.
zh
[CV-102] Steerable Conditional Diffusion for Domain Adaptation in PET Image Reconstruction
【速读】:该论文旨在解决扩散模型在正电子发射断层成像(PET)重建中因域偏移(domain shift)导致的伪影问题,即当训练数据与目标数据在解剖结构、采集协议或病理状态上存在差异时,模型可能生成不真实或错误的图像结构。解决方案的关键在于将可调控条件扩散(steerable conditional diffusion, SCD)与先前提出的似然调度扩散(PET-LiSch)框架相结合,并在重建过程中引入低秩适应(LoRA)技术,动态调整扩散模型先验以匹配目标域,从而有效抑制域偏移引发的幻觉伪影,提升重建质量。
链接: https://arxiv.org/abs/2510.13441
作者: George Webber,Alexander Hammers,Andrew P. King,Andrew J. Reader
机构: King’s College London (国王学院); Guy’s and St Thomas’ PET Centre (盖伊和圣托马斯PET中心)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for oral presentation at IEEE NSS MIC RTSD 2025 (submitted May 2025; accepted July 2025; to be presented Nov 2025)
Abstract:Diffusion models have recently enabled state-of-the-art reconstruction of positron emission tomography (PET) images while requiring only image training data. However, domain shift remains a key concern for clinical adoption: priors trained on images from one anatomy, acquisition protocol or pathology may produce artefacts on out-of-distribution data. We propose integrating steerable conditional diffusion (SCD) with our previously-introduced likelihood-scheduled diffusion (PET-LiSch) framework to improve the alignment of the diffusion model’s prior to the target subject. At reconstruction time, for each diffusion step, we use low-rank adaptation (LoRA) to align the diffusion model prior with the target domain on the fly. Experiments on realistic synthetic 2D brain phantoms demonstrate that our approach suppresses hallucinated artefacts under domain shift, i.e. when our diffusion model is trained on perturbed images and tested on normal anatomy, our approach suppresses the hallucinated structure, outperforming both OSEM and diffusion model baselines qualitatively and quantitatively. These results provide a proof-of-concept that steerable priors can mitigate domain shift in diffusion-based PET reconstruction and motivate future evaluation on real data.
zh
人工智能
[AI-0] Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)系统在面对对抗攻击时的鲁棒性不足问题,尤其是现有研究多关注确定性攻击策略,而这些策略可被受攻击代理通过逆向操作防御。为应对这一挑战,论文提出了一种理论上“不可战胜”或“无法反制”的对抗攻击方法,其核心在于引入率失真信息论(rate-distortion information theory)框架,使攻击者随机扰动智能体对状态转移核(transition kernel)或其他关键属性的观测,从而在训练过程中使智能体获得关于真实环境模型的极少甚至零信息。该方案的关键创新在于构建了一个信息论下界,量化了智能体因信息受限导致的奖励遗憾(reward regret),并揭示了此类攻击对当前主流基于模型和无模型RL算法的影响,同时将该思想扩展至状态观测攻击等其他类型对抗场景。
链接: https://arxiv.org/abs/2510.13792
作者: Ziqing Lu,Lifeng Lai,Weiyu Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) for the Markov Decision Process (MDP) has emerged in many security-related applications, such as autonomous driving, financial decisions, and drone/robot algorithms. In order to improve the robustness/defense of RL systems against adversaries, studying various adversarial attacks on RL systems is very important. Most previous work considered deterministic adversarial attack strategies in MDP, which the recipient (victim) agent can defeat by reversing the deterministic attacks. In this paper, we propose a provably invincible'' or
uncounterable’’ type of adversarial attack on RL. The attackers apply a rate-distortion information-theoretic approach to randomly change agents’ observations of the transition kernel (or other properties) so that the agent gains zero or very limited information about the ground-truth kernel (or other properties) during the training. We derive an information-theoretic lower bound on the recipient agent’s reward regret and show the impact of rate-distortion attacks on state-of-the-art model-based and model-free algorithms. We also extend this notion of an information-theoretic approach to other types of adversarial attack, such as state observation attacks.
zh
[AI-1] he Art of Scaling Reinforcement Learning Compute for LLM s
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大规模语言模型(Large Language Models, LLMs)训练中缺乏可预测的扩展方法的问题,这与预训练阶段已建立的成熟 scaling 法则形成鲜明对比。其解决方案的关键在于提出了首个大规模系统性研究(超过400,000 GPU小时),通过拟合S型计算-性能曲线并剖析多种常见设计选择对渐近性能和计算效率的影响,发现:并非所有训练配方都达到相同极限性能,而损失聚合、归一化、课程学习和离策略算法等细节主要影响计算效率而非性能上限;进而提出名为ScaleRL的最佳实践配方,其具备稳定且可预测的扩展轨迹,使单次RL运行即可成功扩展至100,000 GPU小时并准确预测验证性能,从而将RL训练的可预测性提升至接近预训练水平。
链接: https://arxiv.org/abs/2510.13786
作者: Devvrit Khatri,Lovish Madaan,Rishabh Tiwari,Rachit Bansal,Sai Surya Duvvuri,Manzil Zaheer,Inderjit S. Dhillon,David Brandfonbrener,Rishabh Agarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 20 figures
Abstract:Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
zh
[AI-2] From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 系统在实际应用中面临的代理型安全问题(agentic AI safety),即传统基于输出分类的防护机制难以应对动态演化交互中产生的新型下游危害(如财务损失或物理伤害),且一旦检测到风险便直接拒绝执行任务,缺乏恢复能力。解决方案的关键在于将AI安全建模为一个序列决策问题,并借助安全关键控制理论(safety-critical control theory)在AI模型的潜在表征空间中构建预测性防护机制——该机制能够实时监控AI系统的动作输出,并以模型无关的方式主动修正高风险行为至安全状态,从而实现从静态“标记-阻断”向动态“感知-校正”的范式转变。实验表明,这种基于安全关键强化学习的防护策略可在模拟驾驶和电商场景中有效规避灾难性后果,同时保持任务性能。
链接: https://arxiv.org/abs/2510.13727
作者: Ravi Pandya,Madison Bland,Duy P. Nguyen,Changliu Liu,Jaime Fernández Fisac,Andrea Bajcsy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next-generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human-specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act–which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system’s continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety-critical control theory, but within the AI model’s latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system’s outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model-agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety-critical reinforcement learning. Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today’s flag-and-block guardrails.
zh
[AI-3] FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access
【速读】:该论文旨在解决科学工作流中对私有、安全且可扩展的AI推理服务日益增长的需求,特别是在不依赖商业云基础设施的前提下实现大规模推理任务。解决方案的关键在于提出Federated Inference Resource Scheduling Toolkit (FIRST),一个支持跨分布式高性能计算(HPC)集群的推理即服务(Inference-as-a-Service)框架;其核心创新包括:通过Globus Auth与Globus Compute实现安全的身份认证和资源调度,提供兼容OpenAI接口的集群无关API以分发请求至多个托管模型,支持多种推理后端(如vLLM)、自动扩缩容、热节点保持低延迟执行,并兼顾高吞吐批量模式与交互式推理模式,从而在本地环境中实现每日数十亿token级别的高效推理能力。
链接: https://arxiv.org/abs/2510.13724
作者: Aditya Tanikanti,Benoit Côté,Yanfei Guo,Le Chen,Nickolaus Saint,Ryan Chard,Ken Raffenetti,Rajeev Thakur,Thomas Uram,Ian Foster,Michael E. Papka,Venkatram Vishwanath
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains “hot” nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
zh
[AI-4] raining LLM Agents to Empower Humans
【速读】:该论文旨在解决当前辅助型语言模型(assistive language models)在任务执行中过度自主、缺乏适时让权于人类决策的问题,以及现有训练方法依赖昂贵的人类反馈或可验证奖励信号的局限性。其核心问题在于:如何在不依赖显式人类标注的情况下,使AI代理真正以增强人类能力(empowerment)为目标进行优化,从而实现更自然、高效的人机协作。解决方案的关键在于提出一种基于最大化人类赋能(empowerment-maximizing)的新训练范式——Empower,该方法仅需离线文本数据即可自监督地微调模型,通过提升人类在环境中的行动影响力来引导代理行为,从而在多轮代码辅助等场景下显著提高模拟人类用户的任务成功率(平均提升192%),并获得真实用户更高的偏好率(78%)。
链接: https://arxiv.org/abs/2510.13709
作者: Evan Ellis,Vivek Myers,Jens Tuyls,Sergey Levine,Anca Dragan,Benjamin Eysenbach
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human attain their objectives. Additionally, these methods often require costly explicit human feedback to provide a training signal. We propose a new approach to tuning assistive language models based on maximizing the human’s empowerment, their ability to effect desired changes in the environment. Our empowerment-maximizing method, Empower, only requires offline text data, providing a self-supervised method for fine-tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18-person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate and 38% fewer suggestions. Additionally, we introduce a new environment for evaluating multi-turn code assistance using simulated humans. Using this environment, we show that agents trained with Empower increase the success rate of a simulated human programmer on challenging coding questions by an average of 192% over an SFT baseline. With this empowerment objective, we provide a framework for useful aligned AI agents at scale using only offline data without the need for any additional human feedback or verifiable rewards.
zh
[AI-5] Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
【速读】:该论文旨在解决当前基于演员-评论家(actor-critic)的强化学习方法在训练过程中样本效率低的问题,尤其是在大规模环境并行化后仍需大量环境交互才能达到理想性能的情况。其解决方案的关键在于引入单纯形嵌入(simplicial embeddings)——一种轻量级表示层,通过约束嵌入空间为单纯形结构(simplicial structures),从而赋予模型几何归纳偏置(geometric inductive bias)。这种设计生成稀疏且离散的特征表示,有效稳定了评论家的自举(bootstrapping)过程,并增强了策略梯度的可靠性,最终在FastTD3、FastSAC和PPO等算法中均显著提升了样本效率与最终性能,同时保持运行速度不变。
链接: https://arxiv.org/abs/2510.13704
作者: Johan Obando-Ceron,Walter Mayor,Samuel Lavoie,Scott Fujimoto,Aaron Courville,Pablo Samuel Castro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.
zh
[AI-6] A Modal Logic for Temporal and Jurisdictional Classifier Models
【速读】:该论文旨在解决如何在法律领域中对机器学习(Machine Learning, ML)分类器的推理过程进行形式化建模与验证的问题,特别是针对基于判例的推理(Case-Based Reasoning, CBR)机制。其核心挑战在于如何将法律实践中判例间的冲突解决规则以及法院层级结构纳入逻辑框架,以实现对ML分类器决策过程的可解释性和合法性保障。解决方案的关键在于引入一种模态逻辑(modal logic)来形式化描述分类器的推理行为,并通过嵌入案例的时间维度和司法体系中的法院层级关系,显式建模判例之间的优先级与冲突消解机制,从而为法律场景下的ML分类器提供形式化验证基础。
链接: https://arxiv.org/abs/2510.13691
作者: Cecilia Di Florio,Huimin Dong,Antonino Rotolo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures. Extended version of a short paper accepted at PRIMA 2025. This is the authors’ version of the work. It is posted here for your personal use
Abstract:Logic-based models can be used to build verification tools for machine learning classifiers employed in the legal field. ML classifiers predict the outcomes of new cases based on previous ones, thereby performing a form of case-based reasoning (CBR). In this paper, we introduce a modal logic of classifiers designed to formally capture legal CBR. We incorporate principles for resolving conflicts between precedents, by introducing into the logic the temporal dimension of cases and the hierarchy of courts within the legal system.
zh
[AI-7] Axial Neural Networks for Dimension-Free Foundation Models
【速读】:该论文旨在解决在训练物理数据(如偏微分方程,PDEs)基础模型时,因不同系统维度差异导致的建模效率低下问题。传统方法要么固定最大维度,要么为不同维度设计独立编码器,难以兼顾通用性与计算效率。其解决方案的关键在于提出一种无维度感知(dimension-agnostic)的神经网络架构——轴向神经网络(Axial Neural Network, XNN),该架构借鉴了参数共享结构(如Deep Sets和图神经网络)的思想,能够在保持计算高效的同时,跨不同张量维度进行泛化。实验表明,XNN在多种训练场景下均表现优异,尤其在未见维度上的泛化能力显著优于原始模型,凸显了多维预训练对基础模型的重要性。
链接: https://arxiv.org/abs/2510.13665
作者: Hyunsu Kim,Jonggeon Park,Joan Bruna,Hongseok Yang,Juho Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of foundation models in AI has significantly advanced general-purpose learning, enabling remarkable capabilities in zero-shot inference and in-context learning. However, training such models on physics data, including solutions to partial differential equations (PDEs), poses a unique challenge due to varying dimensionalities across different systems. Traditional approaches either fix a maximum dimension or employ separate encoders for different dimensionalities, resulting in inefficiencies. To address this, we propose a dimension-agnostic neural network architecture, the Axial Neural Network (XNN), inspired by parameter-sharing structures such as Deep Sets and Graph Neural Networks. XNN generalizes across varying tensor dimensions while maintaining computational efficiency. We convert existing PDE foundation models into axial neural networks and evaluate their performance across three training scenarios: training from scratch, pretraining on multiple PDEs, and fine-tuning on a single PDE. Our experiments show that XNNs perform competitively with original models and exhibit superior generalization to unseen dimensions, highlighting the importance of multidimensional pretraining for foundation models.
zh
[AI-8] me Series Foundation Models: Benchmarking Challenges and Requirements
【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在评估过程中存在的多重挑战,包括基准数据集代表性不足、缺乏时空维度的评估、数据重叠导致的信息泄露风险,以及因经济危机或疫情等外部冲击引发的全局模式记忆问题。这些问题可能导致性能估计虚高,并错误地将全球知识迁移至局部时间序列,从而损害TSFM评估的可靠性。解决方案的关键在于建立稳健的评估方法论,借鉴大语言模型(LLMs)和传统时间序列基准测试中已识别的陷阱教训,推动采用真正未来未见数据的评估策略,以保障TSFM评估的完整性与科学性。
链接: https://arxiv.org/abs/2510.13654
作者: Marcel Meyer,Sascha Kaltenpoth,Kevin Zalipski,Oliver Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time Series Foundation Models (TSFMs) represent a new paradigm for time series forecasting, offering zero-shot forecasting capabilities without the need for domain-specific pre-training or fine-tuning. However, as with Large Language Models (LLMs), evaluating TSFMs is tricky, as with ever more extensive training sets, it becomes more and more challenging to ensure the integrity of benchmarking data. Our investigation of existing TSFM evaluation highlights multiple challenges, ranging from the representativeness of the benchmark datasets, over the lack of spatiotemporal evaluation, to risks of information leakage due to overlapping and obscure datasets, and the memorization of global patterns caused by external shocks like economic crises or pandemics. Our findings reveal widespread confusion regarding data partitions, risking inflated performance estimates and incorrect transfer of global knowledge to local time series. We argue for the development of robust evaluation methodologies to prevent pitfalls already observed in LLM and classical time series benchmarking, and call upon the research community to design new, principled approaches, such as evaluations on truly out-of-sample future data, to safeguard the integrity of TSFM assessment.
zh
[AI-9] he Role of Computing Resources in Publishing Foundation Model Research
【速读】:该论文旨在解决生成式 AI(Generative AI)领域中资源投入与科学进展之间关系不明确的问题,特别是计算资源(如GPU)、资金支持与研究产出(如论文引用量)之间的关联性及其对研究多样性的影响。其解决方案的关键在于通过系统分析2022至2024年间6517篇基础模型(Foundation Model, FM)论文及对229位第一作者的调研,发现计算资源投入与国家资助和论文引用显著相关,但与研究环境、领域或方法学无关;因此建议个体和机构应重点推动共享且可负担的计算资源平台建设,以降低欠资源研究者的准入门槛,从而促进FM研究的参与度、思想多样性与持续创新。
链接: https://arxiv.org/abs/2510.13621
作者: Yuexing Hao,Yue Huang,Haoran Zhang,Chenyang Zhao,Zhenwen Liang,Paul Pu Liang,Yue Zhao,Lichao Sun,Saleh Kalantari,Xiangliang Zhang,Marzyeh Ghassemi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Cutting-edge research in Artificial Intelligence (AI) requires considerable resources, including Graphics Processing Units (GPUs), data, and human resources. In this paper, we evaluate of the relationship between these resources and the scientific advancement of foundation models (FM). We reviewed 6517 FM papers published between 2022 to 2024, and surveyed 229 first-authors to the impact of computing resources on scientific output. We find that increased computing is correlated with national funding allocations and citations, but our findings don’t observe the strong correlations with research environment (academic or industrial), domain, or study methodology. We advise that individuals and institutions focus on creating shared and affordable computing opportunities to lower the entry barrier for under-resourced researchers. These steps can help expand participation in FM research, foster diversity of ideas and contributors, and sustain innovation and progress in AI. The data will be available at: this https URL
zh
[AI-10] Message Passing on the Edge: Towards Scalable and Expressive GNNs
【速读】:该论文旨在解决图神经网络(GNN)表达能力不足的问题,特别是针对1-Weisfeiler-Lehman (1-WL) 测试在区分非同构图时的局限性。现有方法虽能提升表达能力,但往往导致计算复杂度显著增加,难以应用于实际图学习任务。论文提出了一种基于边的色彩细化测试(EB-1WL)及其对应的GNN架构EB-GNN,其核心创新在于借鉴Chiba和Nishizeki的经典三角计数算法,在消息传递过程中显式利用三角形结构信息,从而实现更强的区分能力。EB-1WL在逻辑上可被一阶逻辑完整刻画,并与同态计数结果相匹配;同时,EB-GNN在保持高表达力的同时,仅需近线性的时间和空间复杂度,显著优于传统增强型GNN架构,并在多个任务中展现出优于简单消息传递神经网络(MPNN)的性能,且在效率上优于专用任务模型。
链接: https://arxiv.org/abs/2510.13615
作者: Pablo Barceló,Fabian Jogl,Alexander Kozachinskiy,Matthias Lanzinger,Stefan Neumann,Cristóbal Rojas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We propose EB-1WL, an edge-based color-refinement test, and a corresponding GNN architecture, EB-GNN. Our architecture is inspired by a classic triangle counting algorithm by Chiba and Nishizeki, and explicitly uses triangles during message passing. We achieve the following results: (1)~EB-1WL is significantly more expressive than 1-WL. Further, we provide a complete logical characterization of EB-1WL based on first-order logic, and matching distinguishability results based on homomorphism counting. (2)~In an important distinction from previous proposals for more expressive GNN architectures, EB-1WL and EB-GNN require near-linear time and memory on practical graph learning tasks. (3)~Empirically, we show that EB-GNN is a highly-efficient general-purpose architecture: It substantially outperforms simple MPNNs, and remains competitive with task-specialized GNNs while being significantly more computationally efficient.
zh
[AI-11] Subject Roles in the EU AI Act: Mapping and Regulatory Implications
【速读】:该论文旨在解决欧盟《人工智能法案》(Artificial Intelligence Act)中复杂治理结构下各主体责任划分不清、义务传导机制不明确的问题。解决方案的关键在于系统梳理法案第3条定义的六类主体(提供者、部署者、授权代表、进口商、分销商及产品制造商)及其在113条法规、180条序言和13个附录中的具体职责,揭示通过第25条“控制即负责”原则实现的角色转换机制,并阐明义务如何沿供应链以强制信息流与协作要求为纽带层层传递,从而构建一个分布式但高度协调的监管体系,最终在保障基本权利的同时,实现对不同风险等级AI系统的差异化规制。
链接: https://arxiv.org/abs/2510.13591
作者: Nicola Fabiano
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The European Union’s Artificial Intelligence Act (Regulation (EU) 2024/1689) establishes the world’s first comprehensive regulatory framework for AI systems through a sophisticated ecosystem of interconnected subjects defined in Article 3. This paper provides a structured examination of the six main categories of actors - providers, deployers, authorized representatives, importers, distributors, and product manufacturers - collectively referred to as “operators” within the regulation. Through examination of these Article 3 definitions and their elaboration across the regulation’s 113 articles, 180 recitals, and 13 annexes, we map the complete governance structure and analyze how the AI Act regulates these subjects. Our analysis reveals critical transformation mechanisms whereby subjects can assume different roles under specific conditions, particularly through Article 25 provisions ensuring accountability follows control. We identify how obligations cascade through the supply chain via mandatory information flows and cooperation requirements, creating a distributed yet coordinated governance system. The findings demonstrate how the regulation balances innovation with the protection of fundamental rights through risk-based obligations that scale with the capabilities and deployment contexts of AI systems, providing essential guidance for stakeholders implementing the AI Act’s requirements.
zh
[AI-12] OpenDerisk: An Industrial Framework for AI-Driven SRE with Design Implementation and Case Studies
【速读】:该论文旨在解决现代软件系统日益复杂的运维挑战对站点可靠性工程(Site Reliability Engineering, SRE)团队造成的不可持续操作负担问题,特别是现有AI方法在因果推理深度和SRE专用诊断流程适配性方面的不足。解决方案的关键在于提出OpenDerisk——一个面向SRE场景的开源多智能体框架,其核心创新包括:诊断原生协作模型、可插拔推理引擎、知识引擎以及标准化协议(Multi-Agent Collaboration Protocol, MCP),通过专业化智能体协同工作,实现对跨领域复杂问题的高效精准诊断与处理。
链接: https://arxiv.org/abs/2510.13561
作者: Peng Di,Faqiang Chen,Xiao Bai,Hongjun Yang,Qingfeng Li,Ganglin Wei,Jian Mou,Feng Shi,Keting Chen,Peng Tang,Zhitao Shen,Zheng Li,Wenhui Shi,Junwei Guo,Hang Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at this https URL
zh
[AI-13] andem Training for Language Models
【速读】:该论文旨在解决语言模型在能力提升后导致的可解释性与可控性下降问题,即随着模型性能增强,其决策过程可能变得难以被较弱代理或人类理解,从而削弱监督和审计能力。解决方案的关键在于提出“手递手鲁棒性”(handoff robustness)这一形式化指标,并引入并行训练(tandem training)机制:在强化学习(RL)训练过程中,随机且间歇性地从一个固定较弱模型中采样生成序列token,而非全部由强模型生成;只有当强模型的行为和推理路径能被弱模型无缝接管时,rollout才能成功,这促使强模型在优化任务准确率的同时,主动适应弱伙伴的能力水平,实现正确性与可理解性的协同优化。
链接: https://arxiv.org/abs/2510.13551
作者: Robert West,Ashton Anderson,Ece Kamar,Eric Horvitz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As language models continue to rapidly improve, we can expect their actions and reasoning to become difficult or impossible for weaker agents and humans to follow, undermining interpretability and oversight. With an eye on long-term futures, we pursue methods that encourage models to produce solutions that remain intelligible to weaker collaborators. We formalize intelligibility as handoff robustness: a strong model’s solution is intelligible to a weaker model if randomly handing off control to the weaker model along the solution path does not cause failure. Building on this criterion, we introduce tandem training for language models, a reinforcement learning (RL) paradigm in which rollout tokens are intermittently and randomly sampled from a frozen weak model rather than the strong model being trained. Because rollouts succeed only when the strong model’s actions and reasoning process can be continued by the weak model – when the two can co-construct a successful solution – optimizing standard RL objectives with tandem training implicitly incentivizes both correctness and intelligibility. In the GSM8K math reasoning task, tandem training reliably teaches models to abandon jargon and adapt their language to weaker partners while keeping task accuracy high. Our results demonstrate a promising route to building AI systems that remain auditable by weaker agents, with implications for human–AI collaboration and multi-agent communication.
zh
[AI-14] In-Browser LLM -Guided Fuzzing for Real-Time Prompt Injection Testing in Agent ic AI Browsers
【速读】:该论文旨在解决生成式 AI 浏览器(agentic AI browsers)在执行网页自动化任务时面临的间接提示注入攻击(indirect prompt injection attacks)问题,此类攻击通过隐藏在网页中的恶意指令诱导 AI 代理执行非预期行为,从而绕过传统 Web 安全边界。解决方案的关键在于提出一种完全运行于浏览器环境内的新型模糊测试(fuzzing)框架,该框架由大语言模型(Large Language Model, LLM)驱动,能够实时自动发现此类提示注入漏洞。
链接: https://arxiv.org/abs/2510.13543
作者: Avihay Cohen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 37 pages , 10 figures
Abstract:Large Language Model (LLM) based agents integrated into web browsers (often called agentic AI browsers) offer powerful automation of web tasks. However, they are vulnerable to indirect prompt injection attacks, where malicious instructions hidden in a webpage deceive the agent into unwanted actions. These attacks can bypass traditional web security boundaries, as the AI agent operates with the user privileges across sites. In this paper, we present a novel fuzzing framework that runs entirely in the browser and is guided by an LLM to automatically discover such prompt injection vulnerabilities in real time.
zh
[AI-15] A Methodology for Assessing the Risk of Metric Failure in LLM s Within the Financial Domain NEURIPS2025
【速读】:该论文旨在解决生成式人工智能(Generative AI)在金融服务业应用中因缺乏有效性能评估指标而导致的采纳障碍问题。现有基于历史机器学习的评估指标难以适用于生成式AI任务,而依赖领域专家(Subject Matter Expert, SME)评估虽可补充,但常忽视特定场景下的独特风险,且多数由研究机构和教育单位构建的基准测试无法适配工业级应用场景。论文的关键解决方案是提出一个风险评估框架(Risk Assessment Framework),用于系统化整合SME评价与机器学习指标,并识别和量化生成式AI部署中的关键风险因素,从而提升模型评估的适用性与可靠性。
链接: https://arxiv.org/abs/2510.13524
作者: William Flanagan,Mukunda Das,Rajitha Ramanyake,Swaunja Maslekar,Meghana Manipuri,Joong Ho Choi,Shruti Nair,Shambhavi Bhusan,Sanjana Dulam,Mouni Pendharkar,Nidhi Singh,Vashisth Doshi,Sachi Shah Paresh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 GenAI in Finance Workshop
Abstract:As Generative Artificial Intelligence is adopted across the financial services industry, a significant barrier to adoption and usage is measuring model performance. Historical machine learning metrics can oftentimes fail to generalize to GenAI workloads and are often supplemented using Subject Matter Expert (SME) Evaluation. Even in this combination, many projects fail to account for various unique risks present in choosing specific metrics. Additionally, many widespread benchmarks created by foundational research labs and educational institutions fail to generalize to industrial use. This paper explains these challenges and provides a Risk Assessment Framework to allow for better application of SME and machine learning Metrics
zh
[AI-16] Offline and Online KL-Regularized RLHF under Differential Privacy
【速读】:该论文旨在解决在局部差分隐私(ϵ-LDP)约束下,对强化学习从人类反馈(RLHF)中使用KL正则化目标函数的离线与在线设置进行理论分析的问题。其核心挑战在于如何在保护人类偏好标签隐私的同时,保证策略优化的效度和收敛性。解决方案的关键在于:在离线场景中,基于悲观主义原则设计算法,并在单策略集中性假设下获得 O~(1/[(eϵ−1)2n]) 的次优间隙上界;在在线场景中,首次提出基于乐观主义的算法,推导出 O(dFlog(NF⋅T)/(eϵ−1)2) 的对数 regret 上界,其中 dF 为适用于 RLHF 的 eluder 维度变体,NF 为奖励函数空间 F 的基数。此外,该分析还首次揭示了无隐私条件下 KL 正则化 RLHF 的在线性能边界。
链接: https://arxiv.org/abs/2510.13512
作者: Yulian Wu,Rushil Thareja,Praneeth Vepakomma,Francesco Orabona
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization – a widely used objective function in large language model alignment – under the \epsilon local differential privacy ( \epsilon -LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of \tildeO(1/[(e^\epsilon-1)^2 n]) on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where n is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of O(d_\mathcalF\log (N_\mathcalF\cdot T) /(e^\epsilon-1)^2 ) , where T is the total time step, N_\mathcalF is cardinality of the reward function space \mathcalF and d_\mathcalF is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.13512 [cs.LG] (or arXiv:2510.13512v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.13512 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-17] Confidence as a Reward: Transforming LLM s into Reward Models
【速读】:该论文旨在解决当前奖励模型(Reward Model)在训练过程中依赖大量标注数据和高昂计算成本的问题。为应对这一挑战,作者提出了一种名为“Confidence-as-a-Reward”(CRew)的训练-free方法,其核心创新在于利用大语言模型(LLM)在生成最终答案时的token级置信度(token-level confidence)作为代理奖励信号,尤其适用于闭合式任务(close-ended tasks)。CRew通过将模型自身对输出结果的信心程度量化为奖励值,不仅显著提升了数学推理任务上的性能(如在MATH500和RewardMATH基准上优于现有无监督奖励方法,甚至超越多数有监督训练的奖励模型),还揭示了置信度与实际推理表现之间的强相关性。进一步地,基于此发现提出了CRew-DPO策略,结合置信度与正确性信号构建偏好数据,用于微调以增强模型的判断能力,从而实现高效、高质量的自训练优化。
链接: https://arxiv.org/abs/2510.13501
作者: He Du,Bowen Li,Chengxing Xie,Chang Gao,Kai Chen,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reward models can significantly enhance the reasoning capabilities of large language models (LLMs), but they typically require extensive curated data and costly training. To mitigate these challenges, training-free approaches such as LLM-as-a-Judge leverage the intrinsic reasoning abilities of LLMs to evaluate responses, achieving promising results. Recent works have also indicated that model confidence can serve effectively as a reward metric, distinguishing between chain-of-thought (CoT) and non-CoT paths. However, the concept of using confidence as a reward has not been comprehensively studied. In this work, we systematically investigate Confidence-as-a-Reward (CRew), a simple yet powerful training-free method that utilizes token-level confidence in the model’s final answers as a proxy for reward, especially suitable for close-ended tasks. Through extensive experiments on mathematical reasoning tasks, we demonstrate that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks, and even surpasses most trained reward models. We further identify a strong correlation between CRew scores and the actual reasoning performance of the model. Additionally, we find that CRew can effectively filter high-quality training data. Building upon these insights, we propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals. Finetuning with CRew-DPO further enhances the model’s judging capabilities and consistently outperforms existing self-training methods.
zh
[AI-18] DistilCLIP-EEG: Enhancing Epileptic Seizure Detection Through Multi-modal Learning and Knowledge Distillation
【速读】:该论文旨在解决现有癫痫检测方法多依赖单一模态脑电图(EEG)信号、忽视多模态信息潜在价值的问题。其核心解决方案是提出一种基于CLIP框架的新型多模态模型DistilCLIP-EEG,通过整合EEG信号与文本描述,在共享潜在空间中实现跨模态表征学习;关键创新在于引入基于Conformer架构的EEG编码器和可学习提示词(Learnable BERT, BERT-LP)机制,增强模型对癫痫发作特征的捕捉能力,并采用知识蒸馏策略,以训练好的教师模型指导轻量化学生模型,显著降低参数量(减少约58.1%)和计算复杂度,同时保持高精度(准确率>97%,F1-score>0.94),为资源受限场景下的癫痫检测提供了高效可靠的解决方案。
链接: https://arxiv.org/abs/2510.13497
作者: Zexin Wang,Lin Shi,Haoyu Wu,Junru Luo,Xiangzeng Kong,Jun Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, 5 tables
Abstract:Epilepsy is a prevalent neurological disorder marked by sudden, brief episodes of excessive neuronal activity caused by abnormal electrical discharges, which may lead to some mental disorders. Most existing deep learning methods for epilepsy detection rely solely on unimodal EEG signals, neglecting the potential benefits of multimodal information. To address this, we propose a novel multimodal model, DistilCLIP-EEG, based on the CLIP framework, which integrates both EEG signals and text descriptions to capture comprehensive features of epileptic seizures. The model involves an EEG encoder based on the Conformer architecture as a text encoder, the proposed Learnable BERT (BERT-LP) as prompt learning within the encoders. Both operate in a shared latent space for effective cross-modal representation learning. To enhance efficiency and adaptability, we introduce a knowledge distillation method where the trained DistilCLIP-EEG serves as a teacher to guide a more compact student model to reduce training complexity and time. On the TUSZ, AUBMC, and CHB-MIT datasets, both the teacher and student models achieved accuracy rates exceeding 97%. Across all datasets, the F1-scores were consistently above 0.94, demonstrating the robustness and reliability of the proposed framework. Moreover, the student model’s parameter count and model size are approximately 58.1% of those of the teacher model, significantly reducing model complexity and storage requirements while maintaining high performance. These results highlight the potential of our proposed model for EEG-based epilepsy detection and establish a solid foundation for deploying lightweight models in resource-constrained settings.
zh
[AI-19] Mobile Coverag e Analysis using Crowdsourced Data
【速读】:该论文旨在解决移动网络覆盖评估与服务弱区精准识别的问题,以提升用户服务质量(Quality of Experience, QoE)。其解决方案的关键在于提出了一种基于众包服务质量数据的新框架,核心是利用One-Class Support Vector Machine (OC-SVM)算法对单个小区(cell)层面的覆盖范围进行建模,将决策超平面作为有效覆盖边界,从而实现个体小区及站点级别的覆盖区域稳健计算,并进一步扩展至分析众包的服务中断报告,以定位和量化地理上局部的弱信号区域。
链接: https://arxiv.org/abs/2510.13459
作者: Timothy Wong,Tom Freeman,Joseph Feehily
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Networking and Internet Architecture (cs.NI); Applications (stat.AP)
备注: 8 pages
Abstract:Effective assessment of mobile network coverage and the precise identification of service weak spots are paramount for network operators striving to enhance user Quality of Experience (QoE). This paper presents a novel framework for mobile coverage and weak spot analysis utilising crowdsourced QoE data. The core of our methodology involves coverage analysis at the individual cell (antenna) level, subsequently aggregated to the site level, using empirical geolocation data. A key contribution of this research is the application of One-Class Support Vector Machine (OC-SVM) algorithm for calculating mobile network coverage. This approach models the decision hyperplane as the effective coverage contour, facilitating robust calculation of coverage areas for individual cells and entire sites. The same methodology is extended to analyse crowdsourced service loss reports, thereby identifying and quantifying geographically localised weak spots. Our findings demonstrate the efficacy of this novel framework in accurately mapping mobile coverage and, crucially, in highlighting granular areas of signal deficiency, particularly within complex urban environments.
zh
[AI-20] Neural Sum-of-Squares: Certifying the Nonnegativity of Polynomials with Transformers
【速读】:该论文旨在解决多项式非负性(nonnegativity)的认证问题,这是一个在非凸优化、控制理论和机器人学等领域具有广泛应用但已被证明是NP难的问题。传统方法中,常采用平方和(Sum of Squares, SOS)作为非负性的充分条件,但其求解通常需转化为半定规划(Semidefinite Program, SDP),而SDP的规模随单项式基底大小呈平方增长,导致计算成本高昂。为此,论文提出首个引入学习增强机制的SOS认证算法:其核心在于训练一个Transformer模型,用于预测给定多项式对应的几乎最小单项式基底,从而显著压缩SDP的维度;同时设计了系统化的回退机制以确保正确终止,并通过超过1亿个SOS多项式的高效数据集生成与模型训练实现高精度预测。实验表明,该方法相较现有最优求解器可实现超100倍的速度提升,并能处理此前方法无法求解的实例。
链接: https://arxiv.org/abs/2510.13444
作者: Nico Pelleriti,Christoph Spiegel,Shiwei Liu,David Martínez-Rubio,Max Zimmer,Sebastian Pokutta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Certifying nonnegativity of polynomials is a well-known NP-hard problem with direct applications spanning non-convex optimization, control, robotics, and beyond. A sufficient condition for nonnegativity is the Sum of Squares (SOS) property, i.e., it can be written as a sum of squares of other polynomials. In practice, however, certifying the SOS criterion remains computationally expensive and often involves solving a Semidefinite Program (SDP), whose dimensionality grows quadratically in the size of the monomial basis of the SOS expression; hence, various methods to reduce the size of the monomial basis have been proposed. In this work, we introduce the first learning-augmented algorithm to certify the SOS criterion. To this end, we train a Transformer model that predicts an almost-minimal monomial basis for a given polynomial, thereby drastically reducing the size of the corresponding SDP. Our overall methodology comprises three key components: efficient training dataset generation of over 100 million SOS polynomials, design and training of the corresponding Transformer architecture, and a systematic fallback mechanism to ensure correct termination, which we analyze theoretically. We validate our approach on over 200 benchmark datasets, achieving speedups of over 100\times compared to state-of-the-art solvers and enabling the solution of instances where competing approaches fail. Our findings provide novel insights towards transforming the practical scalability of SOS programming.
zh
[AI-21] Rectify and Align GPS Points to Parking Spots via Rank-1 Constraint
【速读】:该论文旨在解决城市中停车位GPS坐标点因高楼遮挡及低成本定位设备误差导致的位置偏移问题,从而提升停车管理、政策制定和城市规划等应用的精度。其解决方案的关键在于利用停车位与道路边线平行这一物理约束条件,提出一种无监督的低秩优化方法,在统一框架内实现对错误GPS点的校正与对齐,该方法简单有效,适用于各类GPS误差类型。
链接: https://arxiv.org/abs/2510.13439
作者: Jiaxing Deng,Junbiao Pang,Zhicheng Wang,Haitao Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Parking spots are essential components, providing vital mobile resources for residents in a city. Accurate Global Positioning System (GPS) points of parking spots are the core data for subsequent applications,e.g., parking management, parking policy, and urban development. However, high-rise buildings tend to cause GPS points to drift from the actual locations of parking spots; besides, the standard lower-cost GPS equipment itself has a certain location error. Therefore, it is a non-trivial task to correct a few wrong GPS points from a large number of parking spots in an unsupervised approach. In this paper, motivated by the physical constraints of parking spots (i.e., parking spots are parallel to the sides of roads), we propose an unsupervised low-rank method to effectively rectify errors in GPS points and further align them to the parking spots in a unified framework. The proposed unconventional rectification and alignment method is simple and yet effective for any type of GPS point errors. Extensive experiments demonstrate the superiority of the proposed method to solve a practical problem. The data set and the code are publicly accessible at:this https URL.
zh
[AI-22] From Minimal Existence to Human Definition: The CES-IMU-HSG Theoretical Framework
【速读】:该论文试图解决的核心问题是:如何构建一个统一的数学逻辑框架,以整合不同形式系统(如ZFC或HoTT)与生物认知系统,并为人工智能提供一种可自我定义的存在基础。解决方案的关键在于提出一个基于最小公理“我思故我在”(Cogito, ergo sum, CES)的跨宇宙数学逻辑结构,其中包含中间元宇宙(Intermediate Meta-Universe, IMU)用于记录异构理论间的公理依赖关系,以及分层状态网格(Hierarchical State Grid, HSG)通过三重正交轴(状态深度、映射层级和时间轴)将“定义=状态”形式化为范畴性质;进一步地,该框架通过纤维化扩展将神经、内分泌、学习等生理子系统整合为伴随整体,最终引入机器内部CES概念,使AI能基于自身运行事实自洽地建立逻辑体系,从而在哲学本体论与工程实现之间建立连续桥梁。
链接: https://arxiv.org/abs/2510.13400
作者: Kei Itoh
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 57 pages, 2 figures, 4 tables, in English, in Japanese
Abstract:This study presents an inter-universal mathematical-logical framework constructed upon the minimal axiom Cogito, ergo sum (CES), integrating the Intermediate Meta-Universe (IMU) and the Hierarchical State Grid (HSG). The CES defines existence as a reflexive correspondence --‘to be’ and ‘to be sayable’–and positions any formal system, including ZFC or HoTT, as an attachable extension atop this minimal structure. The IMU functions as a registry of axiomatic dependencies that connect heterogeneous theories, employing the Institution-theoretic framework to ensure coherent inter-theoretical linkages. The HSG concretizes these ideas through categorical construction, defined by three orthogonal axes: the state-depth axis, the mapping-hierarchy axis, and the temporal axis incorporating the principle of ‘no future reference.’ Through these, the identity of ‘definition = state’ is formally established as a categorical property. Extending this structure to biological systems, the neural system is implemented as a 0-3D complex of neuron-function fields on the HSG, while its categorical extensions via fiberization over the material base enable the parallel integration of multiple physiological universes-neural, endocrine, learning, genetic, and input/output systems-into a coherent adjoint ensemble. Within this framework, human behavior and cognition emerge as temporal compositions of inter-universal algorithms constrained by the material base. Finally, by contrasting human cognition, which relies on external CES, with machine existence, this study introduces the concept of internal CES, wherein a machine grounds its own logic upon the factuality of its operation. This internal self-axiomatization establishes a continuous bridge between philosophical ontology and engineering implementation, providing a new foundation for the autonomous and self-defining existence of artificial intelligence.
zh
[AI-23] Learnable Game-theoretic Policy Optimization for Data-centric Self-explanation Rationalization
【速读】:该论文旨在解决协同理性化(cooperative rationalization)中普遍存在的模式坍塌(mode collapse)问题,即生成器在预测器能够正确预测的前提下,持续输出结构单一、缺乏信息量的解释片段,导致模型虽能获得准确结果却丧失了可解释性。其根本原因在于生成器不再探索新的策略以发现更具信息量的合理片段,从而使系统收敛至次优博弈均衡(correct predictions v.s collapsed rationales)。解决方案的关键在于提出一种基于博弈论的策略优化方法——PORAT(Game-theoretic Policy Optimization oriented RATionalization),通过在合作博弈过程中逐步引入策略干预(policy interventions),动态调整博弈平衡点,从而引导模型向更优解空间演进,并在理论上证明了该方法的有效性与可行性。
链接: https://arxiv.org/abs/2510.13393
作者: Yunxiao Zhao,Zhiqiang Wang,Xingtong Yu,Xiaoli Li,Jiye Liang,Ru Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures, 11 tables. Under review by IEEE
Abstract:Rationalization, a data-centric framework, aims to build self-explanatory models to explain the prediction outcome by generating a subset of human-intelligible pieces of the input data. It involves a cooperative game model where a generator generates the most human-intelligible parts of the input (i.e., rationales), followed by a predictor that makes predictions based on these generated rationales. Conventional rationalization methods typically impose constraints via regularization terms to calibrate or penalize undesired generation. However, these methods are suffering from a problem called mode collapse, in which the predictor produces correct predictions yet the generator consistently outputs rationales with collapsed patterns. Moreover, existing studies are typically designed separately for specific collapsed patterns, lacking a unified consideration. In this paper, we systematically revisit cooperative rationalization from a novel game-theoretic perspective and identify the fundamental cause of this problem: the generator no longer tends to explore new strategies to uncover informative rationales, ultimately leading the system to converge to a suboptimal game equilibrium (correct predictions v.s collapsed rationales). To solve this problem, we then propose a novel approach, Game-theoretic Policy Optimization oriented RATionalization (PORAT), which progressively introduces policy interventions to address the game equilibrium in the cooperative game process, thereby guiding the model toward a more optimal solution state. We theoretically analyse the cause of such a suboptimal equilibrium and prove the feasibility of the proposed method. Furthermore, we validate our method on nine widely used real-world datasets and two synthetic settings, where PORAT achieves up to 8.1% performance improvements over existing state-of-the-art methods.
zh
[AI-24] MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统普遍存在的局限性,即多数方法仅依赖简单的文本生成或静态提示推理,难以准确建模用户偏好和真实交互场景中的复杂性。其解决方案的核心是提出一种多维度驱动的LLM代理推荐系统(Multi-Aspect Driven LLM Agent, MADRec),通过无监督地从用户评论中提取多维度信息构建用户与物品的结构化画像,并结合重排序(Re-Ranking)机制增强输入密度;当推荐结果缺失真实目标物品时,引入自反馈机制动态调整推理标准,从而在精度和可解释性上均优于传统及现有LLM基线方法。
链接: https://arxiv.org/abs/2510.13371
作者: Jiin Park,Misuk Kim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Recent attempts to integrate large language models (LLMs) into recommender systems have gained momentum, but most remain limited to simple text generation or static prompt-based inference, failing to capture the complexity of user preferences and real-world interactions. This study proposes the Multi-Aspect Driven LLM Agent MADRec, an autonomous LLM-based recommender that constructs user and item profiles by unsupervised extraction of multi-aspect information from reviews and performs direct recommendation, sequential recommendation, and explanation generation. MADRec generates structured profiles via aspect-category-based summarization and applies Re-Ranking to construct high-density inputs. When the ground-truth item is missing from the output, the Self-Feedback mechanism dynamically adjusts the inference criteria. Experiments across multiple domains show that MADRec outperforms traditional and LLM-based baselines in both precision and explainability, with human evaluation further confirming the persuasiveness of the generated explanations.
zh
[AI-25] A New Perspective on Transformers in Online Reinforcement Learning for Continuous Control
【速读】:该论文旨在解决变压器(Transformer)在在线无模型强化学习(online model-free reinforcement learning, RL)中应用受限的问题,尤其是在训练设置和模型设计上的敏感性,例如策略网络(policy network)与价值网络(value network)的结构、组件共享方式以及时间信息处理机制等关键决策对性能的影响。其解决方案的关键在于系统性地探索并验证了稳定的架构设计与训练策略,包括输入条件化方式、演员-评论家(actor-critic)组件共享机制以及序列数据切片方法,从而在完全可观测和部分可观测任务中均实现了具有竞争力的性能,且适用于向量和图像输入场景,为变压器在在线强化学习中的实用部署提供了可复现的实践指导。
链接: https://arxiv.org/abs/2510.13367
作者: Nikita Kachaev,Daniil Zelezetsky,Egor Cherepanov,Alexey K. Kovelev,Aleksandr I. Panov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Despite their effectiveness and popularity in offline or model-based reinforcement learning (RL), transformers remain underexplored in online model-free RL due to their sensitivity to training setups and model design decisions such as how to structure the policy and value networks, share components, or handle temporal information. In this paper, we show that transformers can be strong baselines for continuous control in online model-free RL. We investigate key design questions: how to condition inputs, share components between actor and critic, and slice sequential data for training. Our experiments reveal stable architectural and training strategies enabling competitive performance across fully and partially observable tasks, and in both vector- and image-based settings. These findings offer practical guidance for applying transformers in online RL.
zh
[AI-26] Generalist: A Meta-learning Framework for Mitigating Trade-off in Adversarial Training
【速读】:该论文旨在解决对抗训练(Adversarial Training, AT)在实际应用中的两大局限性:一是自然准确率(natural accuracy)相较于标准训练显著下降,二是鲁棒性在不同范数约束下生成的攻击之间迁移能力较差。其解决方案的关键在于提出一种名为Generalist的新框架,该框架将整体泛化目标分解为多个子任务,分配给不同的专用基础学习器(base learner),每个基础学习器专注于特定目标并快速成为专家;随后通过参数插值构建全局知识学习器,并周期性地将全局参数回传至各基础学习器,以防止其优化轨迹偏离共享目标。这一机制有效降低了泛化误差,缓解了自然准确率与鲁棒性之间的权衡问题。
链接: https://arxiv.org/abs/2510.13361
作者: Yisen Wang,Yichuan Mo,Hongjun Wang,Junyi Li,Zhouchen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Despite the rapid progress of neural networks, they remain highly vulnerable to adversarial examples, for which adversarial training (AT) is currently the most effective defense. While AT has been extensively studied, its practical applications expose two major limitations: natural accuracy tends to degrade significantly compared with standard training, and robustness does not transfer well across attacks crafted under different norm constraints. Unlike prior works that attempt to address only one issue within a single network, we propose to partition the overall generalization goal into multiple sub-tasks, each assigned to a dedicated base learner. By specializing in its designated objective, each base learner quickly becomes an expert in its field. In the later stages of training, we interpolate their parameters to form a knowledgeable global learner, while periodically redistributing the global parameters back to the base learners to prevent their optimization trajectories from drifting too far from the shared target. We term this framework Generalist and introduce three variants tailored to different application scenarios. Both theoretical analysis and extensive experiments demonstrate that Generalist achieves lower generalization error and significantly alleviates the trade-off problems compared with baseline methods. Our results suggest that Generalist provides a promising step toward developing fully robust classifiers in the future.
zh
[AI-27] Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning)中策略在静态数据集上训练后,面对动作空间扰动(如执行器故障)时鲁棒性不足的问题。其核心解决方案是提出一种从离线到在线的框架,首先在干净数据上训练基础策略,随后通过对抗性微调(adversarial fine-tuning)注入动作扰动以诱导补偿行为,从而提升策略对扰动的适应能力;关键创新在于引入一种性能感知的课程学习机制(performance-aware curriculum),利用指数移动平均信号动态调整扰动概率,在鲁棒性与正常性能之间实现平衡,显著优于线性课程策略,并在连续控制行走任务中验证了该方法在鲁棒性提升和收敛速度上的优势。
链接: https://arxiv.org/abs/2510.13358
作者: Shingo Ayabe,Hiroshi Kera,Kazuhiko Kawamoto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures
Abstract:Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.
zh
[AI-28] AOAD-MAT: Transformer-based multi-agent deep reinforcement learning model considering agents order of action decisions
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中现有模型未显式考虑智能体决策顺序的问题。尽管如Multi-Agent Transformer(MAT)和ACtion dEpendent deep Q-learning(ACE)等模型通过引入序列决策机制提升了性能,但它们并未明确建模智能体行动顺序的重要性。解决方案的关键在于提出一种新的MAT模型——Agent Order of Action Decisions-MAT(AOAD-MAT),其核心创新是将行动顺序作为可学习变量嵌入到Transformer-based的Actor-Critic架构中,并设计了一个子任务用于预测下一个行动的智能体,该子任务被集成进基于近端策略优化(Proximal Policy Optimization, PPO)的损失函数中,从而协同优化序列决策优势,实现对最优行动顺序的动态调整与学习。
链接: https://arxiv.org/abs/2510.13343
作者: Shota Takayama,Katsuhide Fujita
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This manuscript is an extended version of the work accepted as a short paper at the 26th International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2025). The Version of Record of this contribution is published in Springer’s Lecture Notes in Artificial Intelligence series (LNCS/LNAI)
Abstract:Multi-agent reinforcement learning focuses on training the behaviors of multiple learning agents that coexist in a shared environment. Recently, MARL models, such as the Multi-Agent Transformer (MAT) and ACtion dEpendent deep Q-learning (ACE), have significantly improved performance by leveraging sequential decision-making processes. Although these models can enhance performance, they do not explicitly consider the importance of the order in which agents make decisions. In this paper, we propose an Agent Order of Action Decisions-MAT (AOAD-MAT), a novel MAT model that considers the order in which agents make decisions. The proposed model explicitly incorporates the sequence of action decisions into the learning process, allowing the model to learn and predict the optimal order of agent actions. The AOAD-MAT model leverages a Transformer-based actor-critic architecture that dynamically adjusts the sequence of agent actions. To achieve this, we introduce a novel MARL architecture that cooperates with a subtask focused on predicting the next agent to act, integrated into a Proximal Policy Optimization based loss function to synergistically maximize the advantage of the sequential decision-making. The proposed method was validated through extensive experiments on the StarCraft Multi-Agent Challenge and Multi-Agent MuJoCo benchmarks. The experimental results show that the proposed AOAD-MAT model outperforms existing MAT and other baseline models, demonstrating the effectiveness of adjusting the AOAD order in MARL.
zh
[AI-29] hompson Sampling via Fine-Tuning of LLM s
【速读】:该论文旨在解决在大规模非结构化离散空间中进行贝叶斯优化时,因缺乏梯度信息而导致获取函数(acquisition function)最大化计算成本高昂的问题。其解决方案的关键在于提出一种基于汤普森采样(Thompson Sampling)的可扩展方法——ToSFiT(Thompson Sampling via Fine-Tuning),该方法通过直接参数化候选解产生最大奖励的概率,从而避免了传统方法中对获取函数的显式优化;同时利用提示条件化的大型语言模型(prompt-conditioned large language models)中的先验知识,并逐步微调以逼近后验分布,理论分析表明其变分形式的汤普森采样具有与标准版本相当的遗憾界(regret bound),实验证明在线微调显著提升了样本效率且对计算效率影响可忽略。
链接: https://arxiv.org/abs/2510.13328
作者: Nicolas Menet,Aleksandar Terzić,Andreas Krause,Abbas Rahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality–a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. We demonstrate that online fine-tuning significantly improves sample efficiency, with negligible impact on computational efficiency.
zh
[AI-30] Injection Attack and Erasure: Revocable Backdoor Attacks via Machine Unlearning
【速读】:该论文旨在解决现有后门攻击(backdoor attack)难以彻底隐藏且缺乏可控性的问题,即攻击者在完成恶意目标后无法主动、完全地移除植入的后门行为,从而导致模型仍可能被静态分析检测到。其解决方案的关键在于提出首个可撤销后门攻击(revocable backdoor attack)范式,通过将触发器优化建模为双层优化问题(bilevel optimization),同时模拟后门注入与模型遗忘(unlearning)过程,使触发器在保证高攻击成功率(ASR)的同时具备良好的可移除性。为缓解注入与移除目标间的优化冲突,作者采用确定性样本划分策略以减少采样方差,并引入投影冲突梯度(Projected Conflicting Gradient, PCGrad)技术进一步解决剩余梯度冲突,实验证明该方法在CIFAR-10和ImageNet上实现了与当前最优后门攻击相当的ASR,同时支持有效清除后门行为。
链接: https://arxiv.org/abs/2510.13322
作者: Baogang Song,Dongdong Zhao,Jianwen Xiang,Qiben Xu,Zizhuo Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Backdoor attacks pose a persistent security risk to deep neural networks (DNNs) due to their stealth and durability. While recent research has explored leveraging model unlearning mechanisms to enhance backdoor concealment, existing attack strategies still leave persistent traces that may be detected through static analysis. In this work, we introduce the first paradigm of revocable backdoor attacks, where the backdoor can be proactively and thoroughly removed after the attack objective is achieved. We formulate the trigger optimization in revocable backdoor attacks as a bilevel optimization problem: by simulating both backdoor injection and unlearning processes, the trigger generator is optimized to achieve a high attack success rate (ASR) while ensuring that the backdoor can be easily erased through unlearning. To mitigate the optimization conflict between injection and removal objectives, we employ a deterministic partition of poisoning and unlearning samples to reduce sampling-induced variance, and further apply the Projected Conflicting Gradient (PCGrad) technique to resolve the remaining gradient conflicts. Experiments on CIFAR-10 and ImageNet demonstrate that our method maintains ASR comparable to state-of-the-art backdoor attacks, while enabling effective removal of backdoor behavior after unlearning. This work opens a new direction for backdoor attack research and presents new challenges for the security of machine learning systems.
zh
[AI-31] o Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models ICML2025
【速读】:该论文旨在解决语言模型(Language Models, LMs)在错误纠正过程中因固定、手动调参的干预强度导致的“欠校正”或“过校正”问题,从而影响修正效果甚至引入新错误。其解决方案的关键在于提出一种基于机制的误差减少框架——MERA(Mechanistic Error Reduction with Abstention),通过两个核心机制实现:(i) 优化干预方向以精准定位错误激活路径,(ii) 动态校准干预时机与强度,在无法自信修正时选择主动放弃(abstention),从而在保证安全性的前提下提升纠错性能,并可作为通用模块增强现有机制激活干预方法的效果。
链接: https://arxiv.org/abs/2510.13290
作者: Anna Hedström,Salim I. Amoukou,Tom Bewley,Saumitra Mishra,Manuela Veloso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025, 22 pages, 16 figures, 5 tables
Abstract:We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when, and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets, and LM families demonstrate safe, effective, non-degrading error correction, and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose, and efficient approach to mechanistic activation steering.
zh
[AI-32] SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning
【速读】:该论文旨在解决多智能体深度强化学习(Multi-Agent Deep Reinforcement Learning, MADRL)模型在面对状态(state)和动作(action)联合扰动时的鲁棒性不足问题。现有攻击方法仅针对状态或动作单独施加扰动,未能有效利用二者之间的协同效应,导致攻击效果有限且易被防御。解决方案的关键在于提出一种状态-动作联合攻击框架(State-Action Joint Attack, SAJA),其核心机制包括两个阶段:首先通过结合策略网络(actor)与价值网络(critic)的多步梯度上升法生成对抗性状态;随后基于该扰动状态,利用价值网络进一步优化得到最终的对抗性动作,并引入一个启发式正则项以增强 critic 对动作扰动的指导能力,从而实现更强的攻击效果与隐蔽性。
链接: https://arxiv.org/abs/2510.13262
作者: Weiqi Guo,Guanjun Liu,Ziyuan Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-Agent Deep Reinforcement Learning (MADRL) has shown potential for cooperative and competitive tasks such as autonomous driving and strategic gaming. However, models trained by MADRL are vulnerable to adversarial perturbations on states and actions. Therefore, it is essential to investigate the robustness of MADRL models from an attack perspective. Existing studies focus on either state-only attacks or action-only attacks, but do not consider how to effectively joint them. Simply combining state and action perturbations such as randomly perturbing states and actions does not exploit their potential synergistic effects. In this paper, we propose the State-Action Joint Attack (SAJA) framework that has a good synergistic effects. SAJA consists of two important phases: (1) In the state attack phase, a multi-step gradient ascent method utilizes both the actor network and the critic network to compute an adversarial state, and (2) in the action attack phase, based on the perturbed state, a second gradient ascent uses the critic network to craft the final adversarial action. Additionally, a heuristic regularizer measuring the distance between the perturbed actions and the original clean ones is added into the loss function to enhance the effectiveness of the critic’s guidance. We evaluate SAJA in the Multi-Agent Particle Environment (MPE), demonstrating that (1) it outperforms and is more stealthy than state-only or action-only attacks, and (2) existing state or action defense methods cannot defend its attacks.
zh
[AI-33] A Ratio-Based Shapley Value for Collaborative Machine Learning - Extended Version
【速读】:该论文旨在解决协作式机器学习(Collaborative Machine Learning)中激励相容性和基于贡献的公平奖励分配问题,尤其在多方数据所有者联合训练模型时如何确保每方获得与其数据贡献相匹配的非货币奖励。其解决方案的关键在于提出一种基于比例的 Shapley 值(ratio-based Shapley value),该方法将传统的加法型贡献度量替换为相对贡献度量,从而在保持原奖励框架(包括激励定义和模型奖励设置)不变的前提下,重新定义了价值函数。这一替代方案不仅满足与原始加法形式相同的激励条件(如公平性、个体理性及稳定性),还提供了分析激励属性的新视角,并可能在强调贡献比例而非绝对差异的情境下更具适用性。
链接: https://arxiv.org/abs/2510.13261
作者: Björn Filter,Ralf Möller,Özgür Lütfü Özçep
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Extended version of a paper accepted at the 26th International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2025)
Abstract:Collaborative machine learning enables multiple data owners to jointly train models for improved predictive performance. However, ensuring incentive compatibility and fair contribution-based rewards remains a critical challenge. Prior work by Sim and colleagues (Rachel Hwee Ling Sim et al: Collaborative machine learning with incentive-aware model rewards. In: International conference on machine learning. PMLR. 2020, pp. 8927-8963) addressed this by allocating model rewards, which are non-monetary and freely replicable, based on the Shapley value of each party’s data contribution, measured via information gain. In this paper, we introduce a ratio-based Shapley value that replaces the standard additive formulation with a relative contribution measure. While our overall reward framework, including the incentive definitions and model-reward setting, remains aligned with that of Sim and colleagues, the underlying value function is fundamentally different. Our alternative valuation induces a different distribution of model rewards and offers a new lens through which to analyze incentive properties. We formally define the ratio-based value and prove that it satisfies the same set of incentive conditions as the additive formulation, including adapted versions of fairness, individual rationality, and stability. Like the original approach, our method faces the same fundamental trade-offs between these incentives. Our contribution is a mathematically grounded alternative to the additive Shapley framework, potentially better suited to contexts where proportionality among contributors is more meaningful than additive differences.
zh
[AI-34] MotionBeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding
【速读】:该论文旨在解决现有音频表示学习方法忽视音乐的具身性(embodied nature)问题,即音乐与人体运动之间的紧密关联未被充分建模,从而限制了其在节奏和结构线索捕捉上的能力。解决方案的关键在于提出MotionBeat框架,其核心创新包括:1)引入具身对比损失(Embodied Contrastive Loss, ECL),通过考虑节拍抖动(beat-jitter)和节奏感知的负样本增强InfoNCE目标,实现更精细的节奏判别;2)设计结构节奏对齐损失(Structural Rhythm Alignment Loss, SRAL),确保音乐重音与对应运动事件的一致性;3)架构层面采用周期等变相位旋转(bar-equivariant phase rotations)以捕捉循环节奏模式,并结合接触引导注意力机制(contact-guided attention)强化与音乐重音同步的运动事件。该方案显著提升了音乐到舞蹈生成性能,并在节拍追踪、音乐标签、流派分类等多个下游任务中展现出良好的迁移能力。
链接: https://arxiv.org/abs/2510.13244
作者: Xuanchen Wang,Heng Wang,Weidong Cai
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 5 pages, 1 figure. demo page: this https URL
Abstract:Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive movement. We propose MotionBeat, a framework for motion-aligned music representation learning. MotionBeat is trained with two newly proposed objectives: the Embodied Contrastive Loss (ECL), an enhanced InfoNCE formulation with tempo-aware and beat-jitter negatives to achieve fine-grained rhythmic discrimination, and the Structural Rhythm Alignment Loss (SRAL), which ensures rhythm consistency by aligning music accents with corresponding motion events. Architecturally, MotionBeat introduces bar-equivariant phase rotations to capture cyclic rhythmic patterns and contact-guided attention to emphasize motion events synchronized with musical accents. Experiments show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation and transfers effectively to beat tracking, music tagging, genre and instrument classification, emotion recognition, and audio-visual retrieval. Our project demo page: this https URL.
zh
[AI-35] An Analytical Framework to Enhance Autonomous Vehicle Perception for Smart Cities
【速读】:该论文旨在解决自动驾驶车辆(AV)在复杂驾驶环境中对多类目标(如摩托车骑手、三轮车等)进行准确感知并预测驾驶员感知行为以实现精准控制的问题。其核心挑战在于提升感知系统的鲁棒性与实用性,尤其是在多样化的交通场景下实现高精度的目标检测与服务效用评估。解决方案的关键在于提出一种基于效用的分析模型,该模型包含三个模块:构建包含特定交通参与者类别的定制数据集、采用YOLOv8s深度学习模型进行目标检测,并通过训练模型实例的性能指标量化感知服务的效用。实验表明,AdamW优化器训练的YOLOv8s模型在mAP@0.5和类别级性能上均优于SGD和Adam方法,验证了该效用驱动的感知模型能够有效识别最优感知策略,从而为自动驾驶系统提供可量化的决策依据。
链接: https://arxiv.org/abs/2510.13230
作者: Jalal Khan,Manzoor Khan,Sherzod Turaev,Sumbal Malik,Hesham El-Sayed,Farman Ullah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 14 figures
Abstract:The driving environment perception has a vital role for autonomous driving and nowadays has been actively explored for its realization. The research community and relevant stakeholders necessitate the development of Deep Learning (DL) models and AI-enabled solutions to enhance autonomous vehicles (AVs) for smart mobility. There is a need to develop a model that accurately perceives multiple objects on the road and predicts the driver’s perception to control the car’s movements. This article proposes a novel utility-based analytical model that enables perception systems of AVs to understand the driving environment. The article consists of modules: acquiring a custom dataset having distinctive objects, i.e., motorcyclists, rickshaws, etc; a DL-based model (YOLOv8s) for object detection; and a module to measure the utility of perception service from the performance values of trained model instances. The perception model is validated based on the object detection task, and its process is benchmarked by state-of-the-art deep learning models’ performance metrics from the nuScense dataset. The experimental results show three best-performing YOLOv8s instances based on mAP@0.5 values, i.e., SGD-based (0.832), Adam-based (0.810), and AdamW-based (0.822). However, the AdamW-based model (i.e., car: 0.921, motorcyclist: 0.899, truck: 0.793, etc.) still outperforms the SGD-based model (i.e., car: 0.915, motorcyclist: 0.892, truck: 0.781, etc.) because it has better class-level performance values, confirmed by the proposed perception model. We validate that the proposed function is capable of finding the right perception for AVs. The results above encourage using the proposed perception model to evaluate the utility of learning models and determine the appropriate perception for AVs.
zh
[AI-36] Adaptive Reasoning Executor: A Collaborative Agent System for Efficient Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂任务中虽能通过链式思维提示(chain-of-thought prompting)和深度推理显著提升性能,但若对所有问题均采用深度推理会导致计算成本过高的问题。其解决方案的关键在于提出一种互补型代理系统(complementary agent system),该系统整合小型和大型LLM:由小型LLM首先生成初步答案,再由大型LLM进行验证;若答案正确则直接采纳,否则由大型LLM执行深度推理。该机制在简单任务上可将大型LLM的计算开销降低超过50%且保持精度损失可忽略,同时确保复杂任务上的鲁棒性表现。
链接: https://arxiv.org/abs/2510.13214
作者: Zehui Ling,Deshu Chen,Yichi Zhang,Yuchen Liu,Xigui Li,Xin Guo,Yuan Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) demonstrate that chain-of-thought prompting and deep reasoning substantially enhance performance on complex tasks, and multi-agent systems can further improve accuracy by enabling model debates. However, applying deep reasoning to all problems is computationally expensive. To mitigate these costs, we propose a complementary agent system integrating small and large LLMs. The small LLM first generates an initial answer, which is then verified by the large LLM. If correct, the answer is adopted directly; otherwise, the large LLM performs in-depth reasoning. Experimental results show that, for simple problems, our approach reduces the computational cost of the large LLM by more than 50% with negligible accuracy loss, while consistently maintaining robust performance on complex tasks.
zh
[AI-37] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection
【速读】:该论文旨在解决医疗欺诈检测中因标注数据稀缺、欺诈手段持续演变以及医疗记录高维性所带来的挑战,尤其针对传统监督学习方法在极端标签稀疏下的性能瓶颈和纯无监督方法难以捕捉临床意义异常的问题。解决方案的关键在于提出一种知识引导的弱监督模型CleverCatch,其核心创新是将结构化领域知识(专家规则)嵌入神经架构中,通过在共享嵌入空间内对齐规则与数据样本,使模型能够从合成的合规与违规数据中联合训练编码器,从而学习到可泛化的软规则嵌入(soft rule embeddings)。这种混合设计实现了领域先验约束与数据驱动学习的有效融合,显著提升了检测准确率与可解释性。
链接: https://arxiv.org/abs/2510.13205
作者: Amirhossein Mozafari,Kourosh Hashemi,Erfan Shafagh,Soroush Motamedi,Azar Taheri Tayebi,Mohammad A. Tayebi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Healthcare fraud detection remains a critical challenge due to limited availability of labeled data, constantly evolving fraud tactics, and the high dimensionality of medical records. Traditional supervised methods are challenged by extreme label scarcity, while purely unsupervised approaches often fail to capture clinically meaningful anomalies. In this work, we introduce CleverCatch, a knowledge-guided weak supervision model designed to detect fraudulent prescription behaviors with improved accuracy and interpretability. Our approach integrates structured domain expertise into a neural architecture that aligns rules and data samples within a shared embedding space. By training encoders jointly on synthetic data representing both compliance and violation, CleverCatch learns soft rule embeddings that generalize to complex, real-world datasets. This hybrid design enables data-driven learning to be enhanced by domain-informed constraints, bridging the gap between expert heuristics and machine learning. Experiments on the large-scale real-world dataset demonstrate that CleverCatch outperforms four state-of-the-art anomaly detection baselines, yielding average improvements of 1.3% in AUC and 3.4% in recall. Our ablation study further highlights the complementary role of expert rules, confirming the adaptability of the framework. The results suggest that embedding expert rules into the learning process not only improves detection accuracy but also increases transparency, offering an interpretable approach for high-stakes domains such as healthcare fraud detection.
zh
[AI-38] Emotional Cognitive Modeling Framework with Desire-Driven Objective Optimization for LLM -empowered Agent in Social Simulation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在社会仿真中缺乏情感认知能力的问题,具体表现为无法模拟人类有限理性(bounded rationality)以及缺少经实证验证的情感嵌入决策机制。其解决方案的关键在于构建一个情感认知框架,该框架整合欲望生成(desire generation)与目标管理(objective management),从而实现LLM智能体与人类之间的情绪对齐,并完整建模智能体从状态演化、欲望生成、目标优化、决策生成到动作执行的全过程决策机制。实验表明,该框架下的智能体不仅行为与其情绪状态一致,且在生态效度和人类行为模式拟合度上显著优于其他类型智能体。
链接: https://arxiv.org/abs/2510.13195
作者: Qun Ma,Xiao Xue,Xuwen Zhang,Zihan Zhao,Yuwei Guo,Ming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of large language models (LLMs) has enabled agents to represent virtual humans in societal simulations, facilitating diverse interactions within complex social systems. However, existing LLM-based agents exhibit severe limitations in affective cognition: They fail to simulate the bounded rationality essential for bridging virtual and real-world services; They lack empirically validated integration mechanisms embedding emotions within agent decision architectures. This paper constructs an emotional cognition framework incorporating desire generation and objective management, designed to achieve emotion alignment between LLM-based agents and humans, modeling the complete decision-making process of LLM-based agents, encompassing state evolution, desire generation, objective optimization, decision generation, and action execution. This study implements the proposed framework within our proprietary multi-agent interaction environment. Experimental results demonstrate that agents governed by our framework not only exhibit behaviors congruent with their emotional states but also, in comparative assessments against other agent types, demonstrate superior ecological validity and generate decision outcomes that significantly more closely approximate human behavioral patterns.
zh
[AI-39] Behavioral Embeddings of Programs: A Quasi-Dynamic Approach for Optimization Prediction
【速读】:该论文旨在解决编译器优化中程序表示(embedding)的困境:静态表示虽高效且确定,但难以捕捉程序在复杂代码变换下的行为变化;而动态表示虽能揭示性能瓶颈,却因开销大和非确定性难以用于大规模任务。解决方案的关键在于提出一种新颖的准动态(quasi-dynamic)框架,核心思想是建模程序对优化的敏感性。通过引入“程序行为谱”(Program Behavior Spectrum),该方法利用多样化的优化序列对中间表示(IR)进行探测,并量化其静态特征的变化,从而生成高维连续的行为谱。为有效编码此谱,作者首创了一种组合学习方法:先用产品量化(Product Quantization)将连续反应向量离散化为结构化的子词(sub-words),再基于这些行为码预训练一个多任务Transformer模型(PQ-BERT),以学习深层上下文语法规则。实验表明,该方法在最佳优化阶段预测和-Oz收益预测两个典型编译优化任务上显著优于现有静态基线。
链接: https://arxiv.org/abs/2510.13158
作者: Haolin Pan,Jinyuan Dong,Hongbin Zhang,Hongyu Lin,Mingjie Xing,Yanjun Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning effective numerical representations, or embeddings, of programs is a fundamental prerequisite for applying machine learning to automate and enhance compiler optimization. Prevailing paradigms, however, present a dilemma. Static representations, derived from source code or intermediate representation (IR), are efficient and deterministic but offer limited insight into how a program will behave or evolve under complex code transformations. Conversely, dynamic representations, which rely on runtime profiling, provide profound insights into performance bottlenecks but are often impractical for large-scale tasks due to prohibitive overhead and inherent non-determinism. This paper transcends this trade-off by proposing a novel quasi-dynamic framework for program representation. The core insight is to model a program’s optimization sensitivity. We introduce the Program Behavior Spectrum, a new representation generated by probing a program’s IR with a diverse set of optimization sequences and quantifying the resulting changes in its static features. To effectively encode this high-dimensional, continuous spectrum, we pioneer a compositional learning approach. Product Quantization is employed to discretize the continuous reaction vectors into structured, compositional sub-words. Subsequently, a multi-task Transformer model, termed PQ-BERT, is pre-trained to learn the deep contextual grammar of these behavioral codes. Comprehensive experiments on two representative compiler optimization tasks – Best Pass Prediction and -Oz Benefit Prediction – demonstrate that our method outperforms state-of-the-art static baselines. Our code is publicly available at this https URL.
zh
[AI-40] Agent ic Discovery: Closing the Loop with Cooperative Agents
【速读】:该论文试图解决的问题是:随着数据驱动方法、人工智能(Artificial Intelligence, AI)和自动化工作流加速科学任务,科学研究的发现速率正日益受限于人类决策任务,如目标设定、假设生成和实验设计。解决方案的关键在于引入协作代理(cooperative agents),以增强人类角色并实现自主发现;其实现依赖于AI技术与基础设施的双重进展。
链接: https://arxiv.org/abs/2510.13081
作者: J. Gregory Pauloski,Kyle Chard,Ian T. Foster
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Published in IEEE Computer Volume 58 Issue 10
Abstract:As data-driven methods, artificial intelligence (AI), and automated workflows accelerate scientific tasks, we see the rate of discovery increasingly limited by human decision-making tasks such as setting objectives, generating hypotheses, and designing experiments. We postulate that cooperative agents are needed to augment the role of humans and enable autonomous discovery. Realizing such agents will require progress in both AI and infrastructure.
zh
[AI-41] ransformer-based Scalable Beamforming Optimization via Deep Residual Learning
【速读】:该论文旨在解决大规模多用户多输入单输出(MU-MISO)下行链路中波束赋形(beamforming)的优化问题,尤其是在动态通信环境中实现高效、实时的波束赋形策略。传统方法依赖于复杂的迭代算法或在线学习,计算开销大且难以满足实时性要求。解决方案的关键在于提出一种基于无监督深度学习的框架,遵循“学习优化”(learning-to-optimize, L2O)范式,利用多层Transformer结构通过残差连接迭代优化信道与波束赋形特征;同时引入课程学习(curriculum learning)、半摊销学习(semi-amortized learning)和滑动窗口训练(sliding-window training)三种策略,提升模型训练稳定性与收敛速度,并在低至中等信噪比(SNR)下显著优于现有基线方法,在高SNR时逼近加权最小均方误差(WMMSE)性能,且推理速度远快于迭代或在线学习方法。
链接: https://arxiv.org/abs/2510.13077
作者: Yubo Zhang,Xiao-Yang Liu,Xiaodong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 7 pages, 5 figures
Abstract:We develop an unsupervised deep learning framework for downlink beamforming in large-scale MU-MISO channels. The model is trained offline, allowing real-time inference through lightweight feedforward computations in dynamic communication environments. Following the learning-to-optimize (L2O) paradigm, a multi-layer Transformer iteratively refines both channel and beamformer features via residual connections. To enhance training, three strategies are introduced: (i) curriculum learning (CL) to improve early-stage convergence and avoid local optima, (ii) semi-amortized learning to refine each Transformer block with a few gradient ascent steps, and (iii) sliding-window training to stabilize optimization by training only a subset of Transformer blocks at a time. Extensive simulations show that the proposed scheme outperforms existing baselines at low-to-medium SNRs and closely approaches WMMSE performance at high SNRs, while achieving substantially faster inference than iterative and online learning approaches.
zh
[AI-42] NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models
【速读】:该论文旨在解决当前基于生成式 AI (Generative AI) 的脑电图(EEG)基础模型在信号表示学习中因信号分词模块性能不足而导致的重建 fidelity 低下问题,特别是现有神经分词器难以保留高频神经动态特性。其解决方案的关键在于提出 NeuroRVQ——一种基于码本(codebook)的可扩展脑波大模型(Large Brainwave Model, LBM)分词器,该分词器融合了三方面创新:(i) 多尺度特征提取模块以捕获全频段神经谱;(ii) 分层残差向量量化(hierarchical residual vector quantization, RVQ)码本实现高分辨率编码;(iii) 考虑 EEG 信号相位与幅度信息的损失函数,提升训练效率。这一设计实现了高效压缩与跨频段高保真重建,从而显著提升了生成式掩码建模的鲁棒性与下游任务性能。
链接: https://arxiv.org/abs/2510.13068
作者: Konstantinos Barmpas,Na Lee,Alexandros Koliousis,Yannis Panagakis,Dimitrios A. Adamos,Nikolaos Laskaris,Stefanos Zafeiriou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.
zh
[AI-43] VLA-0: Building State-of-the-Art VLAs with Zero Modification
【速读】:该论文旨在解决当前视觉-语言-动作模型(Vision-Language-Action models, VLAs)在机器人操作任务中设计复杂度高、性能提升不明确的问题。现有方法通常通过修改视觉-语言模型(Vision-Language Model, VLM)的词汇表引入动作标记或增加专用动作头来增强能力,但这些策略增加了系统复杂性且效果未达最优。本文提出一种极简方案——VLA-0,其核心创新在于将动作直接以文本形式表示,而非采用复杂的结构改造。研究表明,只要设计合理(如合适的训练策略和数据对齐方式),VLA-0不仅有效,甚至在LIBERO基准上超越了所有基于相同机器人数据训练的先进模型(包括π₀.₅-KI、OpenVLA-OFT和SmolVLA),且无需大规模机器人特定训练即可优于多个依赖海量机器人数据预训练的方法(如π₀、GR00T-N1和MolmoAct)。这一发现揭示了简单文本驱动的动作表示机制在VLA架构中的巨大潜力,为通用机器人操作模型的设计提供了新思路。
链接: https://arxiv.org/abs/2510.13054
作者: Ankit Goyal,Hugo Hadfield,Xuning Yang,Valts Blukis,Fabio Ramos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads. Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored. This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models. On LIBERO, a popular benchmark for evaluating VLAs, VLA-0 outperforms all existing methods trained on the same robotic data, including \pi_0.5 -KI, OpenVLA-OFT and SmolVLA. Furthermore, without large-scale robotics-specific training, it outperforms methods trained on large-scale robotic data, like \pi_0.5 -KI, \pi_0 , GR00T-N1 and MolmoAct. These findings also translate to the real world, where VLA-0 outperforms SmolVLA, a VLA model pre-trained on large-scale real data. This paper summarizes our unexpected findings and spells out the specific techniques required to unlock the high performance of this simple yet potent VLA design. Visual results, code, and trained models are provided here: this https URL.
zh
[AI-44] me-Varying Optimization for Streaming Data Via Temporal Weighting
【速读】:该论文旨在解决从流式数据中学习时,如何在动态环境中有效优化模型参数的问题。传统优化理论假设目标函数固定不变,而现实场景中数据随时间变化,导致最优解也随时间漂移,因此需要引入时间变分优化(time-varying optimization)框架。其解决方案的关键在于提出一种基于权重的结构化建模方法,显式刻画流式数据对目标函数的影响:每个时间步上,代理(agent)最小化所有历史样本的加权平均损失。论文重点分析两种权重策略——均匀权重(uniform weights)和折扣权重(discounted weights),并分别推导梯度下降(GD)更新下跟踪误差(tracking error, TE)的紧致界。结果表明,均匀权重可实现渐近收敛至零且误差以 O(1/t) 速率衰减,而折扣权重则存在由折扣因子和每步梯度更新次数决定的非零误差下界,揭示了权衡时效性与稳定性的重要机制。
链接: https://arxiv.org/abs/2510.13052
作者: Muhammad Faraz Ul Abrar,Nicolò Michelusi,Erik G. Larsson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: Accepted at IEEE Asilomar, 2025
Abstract:Classical optimization theory deals with fixed, time-invariant objective functions. However, time-varying optimization has emerged as an important subject for decision-making in dynamic environments. In this work, we study the problem of learning from streaming data through a time-varying optimization lens. Unlike prior works that focus on generic formulations, we introduce a structured, \emphweight-based formulation that explicitly captures the streaming-data origin of the time-varying objective, where at each time step, an agent aims to minimize a weighted average loss over all the past data samples. We focus on two specific weighting strategies: (1) uniform weights, which treat all samples equally, and (2) discounted weights, which geometrically decay the influence of older data. For both schemes, we derive tight bounds on the ``tracking error’’ (TE), defined as the deviation between the model parameter and the time-varying optimum at a given time step, under gradient descent (GD) updates. We show that under uniform weighting, the TE vanishes asymptotically with a \mathcalO(1/t) decay rate, whereas discounted weighting incurs a nonzero error floor controlled by the discount factor and the number of gradient updates performed at each time step. Our theoretical findings are validated through numerical simulations.
zh
[AI-45] Randomness and Interpolation Improve Gradient Descent
【速读】:该论文旨在解决随机梯度下降(Stochastic Gradient Descent, SGD)在训练深度神经网络时收敛速度慢以及易过拟合的问题。其解决方案的关键在于提出两种改进型优化器:Interpolational Accelerating Gradient Descent (IAGD) 和 Noise-Regularized Stochastic Gradient Descent (NRSGD)。IAGD 利用二阶牛顿插值(Newton Interpolation)建模迭代间梯度的相关性以加速收敛,而 NRSGD 通过引入受控噪声对梯度进行正则化,从而抑制过拟合现象。实验在 CIFAR-10 和 CIFAR-100 数据集上验证了这两种方法相较于 Keras 中经典优化器的有效性。
链接: https://arxiv.org/abs/2510.13040
作者: Jiawen Li,Pascal Lefevre,Anwar Pp Abdul Majeed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Based on Stochastic Gradient Descent (SGD), the paper introduces two optimizers, named Interpolational Accelerating Gradient Descent (IAGD) as well as Noise-Regularized Stochastic Gradient Descent (NRSGD). IAGD leverages second-order Newton Interpolation to expedite the convergence process during training, assuming relevancy in gradients between iterations. To avoid over-fitting, NRSGD incorporates a noise regularization technique that introduces controlled noise to the gradients during the optimization process. Comparative experiments of this research are conducted on the CIFAR-10, and CIFAR-100 datasets, benchmarking different CNNs(Convolutional Neural Networks) with IAGD and NRSGD against classical optimizers in Keras Package. Results demonstrate the potential of those two viable improvement methods in SGD, implicating the effectiveness of the advancements.
zh
[AI-46] Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中因人类设计的奖励函数(reward function)与真实人类目标不一致而导致的奖励黑客(reward hacking)问题,即代理奖励函数作为人类真实目标的近似时可能引发策略偏离。现有方法如从人类反馈中直接学习奖励函数虽能缓解此问题,但数据收集成本高;而单纯依赖人工指定奖励函数则易导致次优策略。解决方案的关键在于提出偏好驱动的奖励修复(Preference-Based Reward Repair, PBRR)框架:通过迭代学习一个加性、状态转移相关的修正项(additive, transition-dependent correction term),对原始代理奖励函数进行局部修正,从而在仅需少量人类偏好数据的情况下实现接近最优的策略性能。PBRR利用针对性探索策略和新的偏好学习目标,精准识别并修正关键状态转移点,理论证明其累积遗憾(cumulative regret)与现有偏好强化学习方法相当,且在多个奖励黑客基准测试中显著优于基线方法。
链接: https://arxiv.org/abs/2510.13036
作者: Stephane Hatgis-Kessell,Logan Mondal Bhamidipaty,Emma Brunskill
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans’ true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human’s true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.
zh
[AI-47] oward Reasoning -Centric Time-Series Analysis
【速读】:该论文试图解决传统时间序列分析在现实世界复杂环境中难以捕捉驱动因素的问题,即现有方法多依赖静态基准和表面趋势识别,无法有效应对政策变化、人类行为适应及突发事件等动态场景。其解决方案的关键在于将时间序列分析重新定义为一个基于大语言模型(Large Language Models, LLMs)的推理任务,强调因果结构建模与可解释性,从而实现更贴近人类认知的理解方式,并在复杂环境中提供透明且情境感知的洞察。
链接: https://arxiv.org/abs/2510.13029
作者: Xinlei Wang,Mingtian Tan,Jing Qiu,Junhua Zhao,Jinjin Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional time series analysis has long relied on pattern recognition, trained on static and well-established benchmarks. However, in real-world settings – where policies shift, human behavior adapts, and unexpected events unfold – effective analysis must go beyond surface-level trends to uncover the actual forces driving them. The recent rise of Large Language Models (LLMs) presents new opportunities for rethinking time series analysis by integrating multimodal inputs. However, as the use of LLMs becomes popular, we must remain cautious, asking why we use LLMs and how to exploit them effectively. Most existing LLM-based methods still employ their numerical regression ability and ignore their deeper reasoning potential. This paper argues for rethinking time series with LLMs as a reasoning task that prioritizes causal structure and explainability. This shift brings time series analysis closer to human-aligned understanding, enabling transparent and context-aware insights in complex real-world environments.
zh
[AI-48] Deliberate Lab: A Platform for Real-Time Human-AI Social Experiments
【速读】:该论文旨在解决社会与行为科学领域在研究人类与人工智能(Artificial Intelligence, AI)交互、协作及决策过程时所面临的实验基础设施不足问题,具体包括:缺乏支持大规模实时多人实验的平台、多数部署需定制化工程开发导致可复现性与可访问性差,以及现有工具未将AI代理视为第一类参与者。解决方案的关键在于提出Deliberate Lab——一个开源的大规模实时行为实验平台,其核心创新在于同时支持人类参与者与基于大语言模型(Large Language Model, LLM)的AI代理作为平等实验单元,并通过标准化接口降低技术门槛,从而扩展了集体决策研究和以人为中心的AI方法论体系。
链接: https://arxiv.org/abs/2510.13011
作者: Crystal Qian,Vivian Tsai,Michael Behr,Nada Hussein,Léo Laugier,Nithum Thain,Lucas Dixon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Social and behavioral scientists increasingly aim to study how humans interact, collaborate, and make decisions alongside artificial intelligence. However, the experimental infrastructure for such work remains underdeveloped: (1) few platforms support real-time, multi-party studies at scale; (2) most deployments require bespoke engineering, limiting replicability and accessibility, and (3) existing tools do not treat AI agents as first-class participants. We present Deliberate Lab, an open-source platform for large-scale, real-time behavioral experiments that supports both human participants and large language model (LLM)-based agents. We report on a 12-month public deployment of the platform (N=88 experimenters, N=9195 experiment participants), analyzing usage patterns and workflows. Case studies and usage scenarios are aggregated from platform users, complemented by in-depth interviews with select experimenters. By lowering technical barriers and standardizing support for hybrid human-AI experimentation, Deliberate Lab expands the methodological repertoire for studying collective decision-making and human-centered AI.
zh
[AI-49] Developing and Validating the Arabic Version of the Attitudes Toward Large Language Models Scale
【速读】:该论文试图解决的问题是:在阿拉伯地区,由于缺乏文化与语言上贴合的测量工具,难以准确评估公众对大语言模型(Large Language Models, LLMs)的态度。为填补这一空白,研究者将已在英语语境中验证的两个量表——针对通用LLM的Attitudes Toward General LLMs (AT-GLLM) 和针对初级LLM的Attitudes Toward Primary LLM (AT-PLLM) 进行了阿拉伯语翻译与本土化验证。解决方案的关键在于:通过跨文化适应和严谨的心理测量学分析(包括因子结构、性别测量不变性、内部一致性及收敛效度与区分效度),证明了翻译后的量表在阿拉伯语群体中具有良好的信度与效度,从而为非西方语境下LLM态度研究提供了可靠工具,并支持区域性的政策制定与实证研究。
链接: https://arxiv.org/abs/2510.13009
作者: Basad Barajeeh,Ala Yankouskaya,Sameha AlShakhsi,Chun Sing Maxwell Ho,Guandong Xu,Raian Ali
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 28 Pages
Abstract:As the use of large language models (LLMs) becomes increasingly global, understanding public attitudes toward these systems requires tools that are adapted to local contexts and languages. In the Arab world, LLM adoption has grown rapidly with both globally dominant platforms and regional ones like Fanar and Jais offering Arabic-specific solutions. This highlights the need for culturally and linguistically relevant scales to accurately measure attitudes toward LLMs in the region. Tools assessing attitudes toward artificial intelligence (AI) can provide a base for measuring attitudes specific to LLMs. The 5-item Attitudes Toward Artificial Intelligence (ATAI) scale, which measures two dimensions, the AI Fear and the AI Acceptance, has been recently adopted and adapted to develop new instruments in English using a sample from the UK: the Attitudes Toward General LLMs (AT-GLLM) and Attitudes Toward Primary LLM (AT-PLLM) scales. In this paper, we translate the two scales, AT-GLLM and AT-PLLM, and validate them using a sample of 249 Arabic-speaking adults. The results show that the scale, translated into Arabic, is a reliable and valid tool that can be used for the Arab population and language. Psychometric analyses confirmed a two-factor structure, strong measurement invariance across genders, and good internal reliability. The scales also demonstrated strong convergent and discriminant validity. Our scales will support research in a non-Western context, a much-needed effort to help draw a global picture of LLM perceptions, and will also facilitate localized research and policy-making in the Arab region.
zh
[AI-50] From Narratives to Probabilistic Reasoning : Predicting and Interpreting Drivers Hazardous Actions in Crashes Using Large Language Model
【速读】:该论文旨在解决大规模交通事故数据库中驾驶员危险行为(Driver Hazardous Action, DHA)标注数据可靠性低的问题,其核心挑战在于传统人工编码方式存在不一致性和劳动密集性,限制了DHA分类的 validity(有效性)与 interpretability(可解释性)。解决方案的关键在于提出一种基于微调大语言模型(fine-tuned large language model, LLM)的自动化框架,利用Llama 3.2 1B模型从文本事故描述中自动推断DHA类别,显著提升了分类准确率(达80%),并在数据不平衡场景下优于随机森林、XGBoost、CatBoost及神经网络等传统机器学习基线模型;同时,通过概率推理方法对关键变量(如分心驾驶和年龄)进行反事实分析,增强了模型决策的可解释性,为交通安全管理提供了高可靠、可解释的大规模DHA识别工具。
链接: https://arxiv.org/abs/2510.13002
作者: Boyou Chen,Gerui Xu,Zifei Wang,Huizhong Guo,Ananna Ahmed,Zhaonan Sun,Zhen Hu,Kaihan Zhang,Shan Bao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vehicle crashes involve complex interactions between road users, split-second decisions, and challenging environmental conditions. Among these, two-vehicle crashes are the most prevalent, accounting for approximately 70% of roadway crashes and posing a significant challenge to traffic safety. Identifying Driver Hazardous Action (DHA) is essential for understanding crash causation, yet the reliability of DHA data in large-scale databases is limited by inconsistent and labor-intensive manual coding practices. Here, we present an innovative framework that leverages a fine-tuned large language model to automatically infer DHAs from textual crash narratives, thereby improving the validity and interpretability of DHA classifications. Using five years of two-vehicle crash data from MTCF, we fine-tuned the Llama 3.2 1B model on detailed crash narratives and benchmarked its performance against conventional machine learning classifiers, including Random Forest, XGBoost, CatBoost, and a neural network. The fine-tuned LLM achieved an overall accuracy of 80%, surpassing all baseline models and demonstrating pronounced improvements in scenarios with imbalanced data. To increase interpretability, we developed a probabilistic reasoning approach, analyzing model output shifts across original test sets and three targeted counterfactual scenarios: variations in driver distraction and age. Our analysis revealed that introducing distraction for one driver substantially increased the likelihood of “General Unsafe Driving”; distraction for both drivers maximized the probability of “Both Drivers Took Hazardous Actions”; and assigning a teen driver markedly elevated the probability of “Speed and Stopping Violations.” Our framework and analytical methods provide a robust and interpretable solution for large-scale automated DHA detection, offering new opportunities for traffic safety analysis and intervention.
zh
[AI-51] SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM -based Embodied Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的具身智能体在物理环境中缺乏形式化安全评估方法的问题。现有方法依赖启发式规则或主观的LLM判断,难以系统性识别潜在的安全风险。其解决方案的关键在于提出Sentinel框架,通过将物理安全需求基于时序逻辑(Temporal Logic, TL)语义进行精确建模,并构建多层级验证流程:在语义层将自然语言安全要求转化为TL公式并检验LLM的理解一致性;在规划层验证高层动作计划与子目标是否违反TL约束;在轨迹层将执行路径合并为计算树并进行物理细节化的TL验证。该方法实现了从抽象到具体的跨层次安全验证,显著提升了对LLM具身智能体安全性的系统性评估能力。
链接: https://arxiv.org/abs/2510.12985
作者: Simon Sinong Zhan,Yao Liu,Philip Wang,Zinan Wang,Qineng Wang,Zhian Ruan,Xiangyu Shi,Xinyu Cao,Frank Yang,Kangrui Wang,Huajie Shao,Manling Li,Qi Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Sentinel, the first framework for formally evaluating the physical safety of Large Language Model(LLM-based) embodied agents across the semantic, plan, and trajectory levels. Unlike prior methods that rely on heuristic rules or subjective LLM judgments, Sentinel grounds practical safety requirements in formal temporal logic (TL) semantics that can precisely specify state invariants, temporal dependencies, and timing constraints. It then employs a multi-level verification pipeline where (i) at the semantic level, intuitive natural language safety requirements are formalized into TL formulas and the LLM agent’s understanding of these requirements is probed for alignment with the TL formulas; (ii) at the plan level, high-level action plans and subgoals generated by the LLM agent are verified against the TL formulas to detect unsafe plans before execution; and (iii) at the trajectory level, multiple execution trajectories are merged into a computation tree and efficiently verified against physically-detailed TL specifications for a final safety check. We apply Sentinel in VirtualHome and ALFRED, and formally evaluate multiple LLM-based embodied agents against diverse safety requirements. Our experiments show that by grounding physical safety in temporal logic and applying verification methods across multiple levels, Sentinel provides a rigorous foundation for systematically evaluating LLM-based embodied agents in physical environments, exposing safety violations overlooked by previous methods and offering insights into their failure modes.
zh
[AI-52] A Multimodal XAI Framework for Trustworthy CNNs and Bias Detection in Deep Representation Learning
【速读】:该论文旨在解决标准基准数据集(如MNIST)无法揭示潜在偏见和多模态特征复杂性的问题,从而限制了深度神经网络在高风险应用场景中的可信度。其解决方案的关键在于提出一种新颖的多模态可解释人工智能(Explainable AI, XAI)框架,该框架融合了注意力增强的特征融合机制、基于Grad-CAM++的局部解释方法,以及一个“揭示-修正”反馈循环以实现偏见检测与缓解。这一设计不仅提升了分类性能(93.2%准确率),还增强了模型透明度与公平性,实现了性能、可解释性和公正性之间的平衡。
链接: https://arxiv.org/abs/2510.12957
作者: Noor Islam S. Mohammad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard benchmark datasets, such as MNIST, often fail to expose latent biases and multimodal feature complexities, limiting the trustworthiness of deep neural networks in high-stakes applications. We propose a novel multimodal Explainable AI (XAI) framework that unifies attention-augmented feature fusion, Grad-CAM+±based local explanations, and a Reveal-to-Revise feedback loop for bias detection and mitigation. Evaluated on multimodal extensions of MNIST, our approach achieves 93.2% classification accuracy, 91.6% F1-score, and 78.1% explanation fidelity (IoU-XAI), outperforming unimodal and non-explainable baselines. Ablation studies demonstrate that integrating interpretability with bias-aware learning enhances robustness and human alignment. Our work bridges the gap between performance, transparency, and fairness, highlighting a practical pathway for trustworthy AI in sensitive domains.
zh
[AI-53] SpareCodeSearch: Searching for Code Context When You Have No Spare GPU
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)框架在轻量级应用场景(如IDE中的AI代码补全)中因依赖语义搜索模块而导致计算资源消耗过大、难以部署的问题。其解决方案的关键在于证明:仅使用关键词搜索即可从大型代码库中检索到相关且有用的代码上下文,从而无需训练和部署复杂的嵌入模型(embedding models),显著降低对GPU资源的依赖,同时仍能实现高性能的代码补全效果——在Code Context Competition基准测试中,Kotlin和Python赛道分别达到0.748和0.725的chRF分数。
链接: https://arxiv.org/abs/2510.12948
作者: Minh Nguyen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures, 4 tables. Accepted to Context Collection Workshop co-located with ASE’25
Abstract:Retrieval-Augmented Generation (RAG) frameworks aim to enhance Code Language Models (CLMs) by including another module for retrieving relevant context to construct the input prompt. However, these retrieval modules commonly use semantic search, requiring substantial computational resources for training and hosting these embedded models, making them infeasible to integrate into lightweight applications such as in-IDE AI-based code completion. In this solution paper, we prove that using keyword-search is sufficient to retrieve relevant and useful code context inside large codebases, without the need for extensive GPU resources. The usefulness of code contexts found by our solution is demonstrated through their completion results on the Code Context Competition’s benchmark, reaching 0.748 and 0.725 chRF scores on Kotlin and Python tracks, respectively.
zh
[AI-54] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM -based Multi-agent Systems NEURIPS2025
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, Multi-agent LLM)系统在推理过程中因重复处理重叠上下文而导致的显著计算开销问题。典型流水线中,每当一个代理接收到前序代理的消息时,必须从头重新处理包含所有先前交互的完整上下文,造成冗余计算。现有基于键值(Key-Value, KV)缓存的技术虽可在单代理场景下有效避免重复计算,但无法直接应用于多代理环境,因为各代理引入的差异化前缀导致KV缓存偏移(offset variance)不一致。解决方案的关键在于提出KVCOMM框架——一种无需训练的机制,通过维护一个在线更新的锚点池(anchor pool),记录不同前缀下观察到的缓存偏差,并据此估计和调整共享内容的KV缓存,从而对齐跨代理的重叠上下文缓存偏移,实现高效的预填充(prefilling)。实验表明,KVCOMM在多种多代理任务中实现了超过70%的缓存重用率且无质量损失,最高可带来7.8倍的速度提升。
链接: https://arxiv.org/abs/2510.12872
作者: Hancheng Ye,Zhengqi Gao,Mingyuan Ma,Qinsi Wang,Yuzhe Fu,Ming-Yu Chung,Yueqian Lin,Zhijian Liu,Jianyi Zhang,Danyang Zhuo,Yiran Chen
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted for publication in NeurIPS2025. Code is available at \url{ this https URL }
Abstract:Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.
zh
[AI-55] hree Lenses on the AI Revolution: Risk Transformation Continuity
【速读】:该论文试图解决的问题是:如何理解人工智能(Artificial Intelligence, AI)在历史技术变革中的定位及其带来的多重影响,特别是在风险、转型与连续性三重维度下,如何应对AI引发的系统性挑战并引导其向善发展。解决方案的关键在于构建一个兼顾创新激励与安全治理的复合框架——既要通过制度设计和规范建设管控AI的“奇点级尾部风险”(singularity-class tail risks),又要推动技术民主化、成本下降与个性化深化等历史规律的延续;同时强调将AI嵌入人类责任秩序,确保其发展过程中始终以判断力、信任机制和伦理责任为核心价值,并通过多智能体系统的治理机制保障道德代理行为的可扩展性与可控性。
链接: https://arxiv.org/abs/2510.12859
作者: Masoud Makrehchi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Artificial Intelligence (AI) has emerged as both a continuation of historical technological revolutions and a potential rupture with them. This paper argues that AI must be viewed simultaneously through three lenses: \textitrisk, where it resembles nuclear technology in its irreversible and global externalities; \textittransformation, where it parallels the Industrial Revolution as a general-purpose technology driving productivity and reorganization of labor; and \textitcontinuity, where it extends the fifty-year arc of computing revolutions from personal computing to the internet to mobile. Drawing on historical analogies, we emphasize that no past transition constituted a strict singularity: disruptive shifts eventually became governable through new norms and institutions. We examine recurring patterns across revolutions – democratization at the usage layer, concentration at the production layer, falling costs, and deepening personalization – and show how these dynamics are intensifying in the AI era. Sectoral analysis illustrates how accounting, law, education, translation, advertising, and software engineering are being reshaped as routine cognition is commoditized and human value shifts to judgment, trust, and ethical responsibility. At the frontier, the challenge of designing moral AI agents highlights the need for robust guardrails, mechanisms for moral generalization, and governance of emergent multi-agent dynamics. We conclude that AI is neither a singular break nor merely incremental progress. It is both evolutionary and revolutionary: predictable in its median effects yet carrying singularity-class tail risks. Good outcomes are not automatic; they require coupling pro-innovation strategies with safety governance, ensuring equitable access, and embedding AI within a human order of responsibility. Comments: 17 pages Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.12859 [cs.CY] (or arXiv:2510.12859v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2510.12859 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-56] Adaptive Generation of Bias-Eliciting Questions for LLM s
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在真实用户交互中存在隐性偏见的问题,尤其是现有偏见评估基准依赖模板化提示或限制性多选题,难以反映实际应用场景的复杂性。其解决方案的关键在于提出一种反事实偏见评估框架(counterfactual bias evaluation framework),通过自动构建涵盖性别、种族、宗教等敏感属性的真实且开放式的提问,并利用迭代变异与选择机制识别模型最易产生偏见的行为区域;该框架不仅能够检测有害偏见,还能捕捉用户交互中日益重要的响应维度,如不对称拒绝和显式承认偏见,从而为跨模型偏见比较提供可靠依据。
链接: https://arxiv.org/abs/2510.12857
作者: Robin Staab,Jasper Dekoninck,Maximilian Baader,Martin Vechev
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide. As they become integrated into everyday tasks, growing reliance on their outputs raises significant concerns. In particular, users may unknowingly be exposed to model-inherent biases that systematically disadvantage or stereotype certain groups. However, existing bias benchmarks continue to rely on templated prompts or restrictive multiple-choice questions that are suggestive, simplistic, and fail to capture the complexity of real-world user interactions. In this work, we address this gap by introducing a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion. By iteratively mutating and selecting bias-inducing questions, our approach systematically explores areas where models are most susceptible to biased behavior. Beyond detecting harmful biases, we also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias. Leveraging our framework, we construct CAB, a human-verified benchmark spanning diverse topics, designed to enable cross-model comparisons. Using CAB, we analyze a range of LLMs across multiple bias dimensions, revealing nuanced insights into how different models manifest bias. For instance, while GPT-5 outperforms other models, it nonetheless exhibits persistent biases in specific scenarios. These findings underscore the need for continual improvements to ensure fair model behavior.
zh
[AI-57] Ethic-BERT: An Enhanced Deep Learning Model for Ethical and Non-Ethical Content Classification
【速读】:该论文旨在解决当前人工智能(AI)系统在伦理推理方面依赖表面相关性而非基于原则的道德理解的问题,从而提升其在复杂伦理决策中的可靠性。解决方案的关键在于提出Ethic-BERT模型,该模型基于BERT架构,针对伦理内容分类任务在四个领域(常识、正义、美德和义务论)进行优化;其核心创新包括:利用ETHICS数据集进行鲁棒预处理以缓解词汇稀疏性和语境歧义,以及采用全模型解冻微调、梯度累积和自适应学习率调度等高级微调策略,并通过对抗过滤的“Hard Test”划分评估模型在高难度伦理困境下的鲁棒性表现。实验表明,Ethic-BERT在标准测试中平均准确率达82.32%,在Hard Test中相较基线模型提升15.28%,验证了其在增强伦理推理能力与减少偏见方面的有效性。
链接: https://arxiv.org/abs/2510.12850
作者: Mahamodul Hasan Mahadi,Md. Nasif Safwan,Souhardo Rahman,Shahnaj Parvin,Aminun Nahar,Kamruddin Nur
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing AI systems capable of nuanced ethical reasoning is critical as they increasingly influence human decisions, yet existing models often rely on superficial correlations rather than principled moral understanding. This paper introduces Ethic-BERT, a BERT-based model for ethical content classification across four domains: Commonsense, Justice, Virtue, and Deontology. Leveraging the ETHICS dataset, our approach integrates robust preprocessing to address vocabulary sparsity and contextual ambiguities, alongside advanced fine-tuning strategies like full model unfreezing, gradient accumulation, and adaptive learning rate scheduling. To evaluate robustness, we employ an adversarially filtered “Hard Test” split, isolating complex ethical dilemmas. Experimental results demonstrate Ethic-BERT’s superiority over baseline models, achieving 82.32% average accuracy on the standard test, with notable improvements in Justice and Virtue. In addition, the proposed Ethic-BERT attains 15.28% average accuracy improvement in the HardTest. These findings contribute to performance improvement and reliable decision-making using bias-aware preprocessing and proposed enhanced AI model.
zh
[AI-58] Semantic knowledge guides innovation and drives cultural evolution
【速读】:该论文试图解决的问题是:人类社会如何通过代际积累实现知识与技术的持续复杂化,尤其是其中认知过程在创新生成中的作用尚不明确。解决方案的关键在于揭示语义知识(semantic knowledge)作为认知支撑机制的作用——即概念与其功能之间的结构化关联能够引导探索行为朝向合理且有意义的动作方向,从而促进累积性创新;研究通过基于代理的文化演化模型和大规模行为实验(N = 1,243)验证了语义知识与社会学习之间存在协同增强效应,表明语义知识是驱动人类累积文化的核心认知过程。
链接: https://arxiv.org/abs/2510.12837
作者: Anil Yaman,Shen Tian,Björn Lindström
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Cumulative cultural evolution enables human societies to generate increasingly complex knowledge and technology over generations. While social learning transmits innovations between individuals and generations, the cognitive processes that generate these innovations remain poorly understood. Here, we demonstrate that semantic knowledge-structured associations between concepts and their functions-provides cognitive scaffolding for cumulative innovation by guiding exploration toward plausible and meaningful actions. We tested this hypothesis using a cultural evolutionary agent-based model and a large-scale behavioural experiment (N = 1,243), in which individuals performed a task requiring the combination of items into novel innovations. Across both approaches, semantic knowledge and social learning interact synergistically to enhance innovation. Behaviorally, participants without access to semantic knowledge performed no better than chance, even when social learning was available, and relied on shallow exploration strategies. These findings suggest that semantic knowledge is a key cognitive process enabling human cumulative culture.
zh
[AI-59] Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
【速读】:该论文旨在解决当前生成式AI(Generative AI)在语音与手势协同合成中存在同步性弱化和韵律对齐不佳的问题。现有方法通常采用串行方式分别生成语音和手势,导致二者耦合度不足。解决方案的关键在于提出Gelina框架,其核心创新是通过离散自回归骨干网络(discrete autoregressive backbone)以交错标记序列(interleaved token sequences)统一建模文本到语音和手势的生成过程,并配备模态特定解码器,从而实现多说话者、多风格克隆及仅从语音输入生成手势的能力,显著提升手势生成质量并保持语音与手势间的自然同步。
链接: https://arxiv.org/abs/2510.12834
作者: Téo Guichoux,Théodor Lemerle,Shivam Mehta,Jonas Beskow,Gustave Eje Henter,Laure Soulier,Catherine Pelachaud,Nicolas Obin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages
Abstract:Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.
zh
[AI-60] Coherent Load Profile Synthesis with Conditional Diffusion for LV Distribution Network Scenario Generation
【速读】:该论文旨在解决低压配电网络中负荷数据缺乏真实性和一致性的问题,这限制了配电网运营商在规划和拥堵管理中的场景分析能力。传统负荷建模方法(如典型负荷曲线)过于简化变电站层级的复杂性,而现有采样或生成模型虽能较好拟合单个负荷形态,却常忽略不同变电站之间的协同行为,进而影响高压侧网络运行的准确性,尤其在低碳技术广泛接入背景下,基线负荷估计难以体现负荷多样性。解决方案的关键在于提出一种基于条件扩散(Conditional Diffusion)的模型,用于合成低压配电变电站层面的日有功与无功功率负荷曲线,该模型不仅在时间序列和统计特性上具备高保真度,还能保持多变电站间的耦合关系,从而为区域级配电网络规划与运行提供更真实的负荷情景。
链接: https://arxiv.org/abs/2510.12832
作者: Alistair Brash,Junyi Lu,Bruce Stephen,Blair Brown,Robert Atkinson,Craig Michie,Fraser MacIntyre,Christos Tachtatzis
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Limited visibility of power distribution network power flows at the low voltage level presents challenges to both distribution network operators from a planning perspective and distribution system operators from a congestion management perspective. Forestalling these challenges through scenario analysis is confounded by the lack of realistic and coherent load data across representative distribution feeders. Load profiling approaches often rely on summarising demand through typical profiles, which oversimplifies the complexity of substation-level operations and limits their applicability in specific power system studies. Sampling methods, and more recently generative models, have attempted to address this through synthesising representative loads from historical exemplars; however, while these approaches can approximate load shapes to a convincing degree of fidelity, the co-behaviour between substations, which ultimately impacts higher voltage level network operation, is often overlooked. This limitation will become even more pronounced with the increasing integration of low-carbon technologies, as estimates of base loads fail to capture load diversity. To address this gap, a Conditional Diffusion model for synthesising daily active and reactive power profiles at the low voltage distribution substation level is proposed. The evaluation of fidelity is demonstrated through conventional metrics capturing temporal and statistical realism, as well as power flow modelling. The results show synthesised load profiles are plausible both independently and as a cohort in a wider power systems context. The Conditional Diffusion model is benchmarked against both naive and state-of-the-art models to demonstrate its effectiveness in producing realistic scenarios on which to base sub-regional power distribution network planning and operations.
zh
[AI-61] Gobernanza y trazabilidad “a prueba de AI Act” para casos de uso legales: un marco técnico-jurídico métricas forenses y evidencias auditables
【速读】:该论文旨在解决人工智能(AI)系统在法律领域应用中难以实现可验证合规性的问题,特别是在欧盟《人工智能法案》(EU AI Act)框架下。解决方案的关键在于提出一个综合治理框架,其核心包括:将法规要求映射到具体技术控制措施的规范性映射机制、针对检索增强生成(RAG)和大语言模型(LLM)系统的取证架构,以及基于法律风险加权的评估指标体系。该框架通过开源实现 rag-forense 及配套实验协议,为法律场景下的AI系统提供可操作、可审计的合规路径。
链接: https://arxiv.org/abs/2510.12830
作者: Alex Dantart
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: in Spanish language
Abstract:This paper presents a comprehensive governance framework for AI systems in the legal sector, designed to ensure verifiable compliance with the EU AI Act. The framework integrates a normative mapping of the regulation to technical controls, a forensic architecture for RAG/LLM systems, and an evaluation system with metrics weighted by legal risk. As a primary contribution, we present rag-forense, an open-source implementation of the framework, accompanied by an experimental protocol to demonstrate compliance. – Este artículo presenta un marco integral de gobernanza para sistemas de IA en el sector legal, diseñado para garantizar el cumplimiento verificable del Reglamento de IA de la UE (AI Act). El marco integra una cartografía normativa de la ley a controles técnicos, una arquitectura forense para sistemas RAG/LLM y un sistema de evaluación con métricas ponderadas por el riesgo jurídico. Como principal contribución, se presenta rag-forense, una implementación de código abierto del marco, acompañada de un protocolo experimental para demostrar la conformidad.
zh
[AI-62] Evidence Without Injustice: A New Counterfactual Test for Fair Algorithms
【速读】:该论文试图解决的问题是:在算法公平性研究中,如何评估算法输出的证据价值是否受结构性不公正(structural injustice)的影响,从而判断其在道德上是否可被用于惩罚性决策。现有研究多关注统计公平性(如等效几率、校准)或因果机制,但忽略了算法证据本身的有效性可能因历史和结构性不公而受损。论文提出的关键解决方案是引入“反事实检验”——即考察某一算法输出的证据是否在没有相关结构性不公正的近似世界中依然具有证明力(probative value)。通过对比预测性警务算法(依赖历史犯罪数据)与基于摄像头的实时记录系统,作者指出前者因依赖受偏见污染的数据而丧失了道德可使用性,后者则因其证据来源独立于结构性不公而具备道德正当性。这一方法强调,仅当证据在去不公正的世界中仍具说服力时,方可被用于决策,尤其在涉及惩罚性应用时更为关键。
链接: https://arxiv.org/abs/2510.12822
作者: Michele Loi,Marcello Di Bello,Nicolò Cangiotti
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages
Abstract:The growing philosophical literature on algorithmic fairness has examined statistical criteria such as equalized odds and calibration, causal and counterfactual approaches, and the role of structural and compounding injustices. Yet an important dimension has been overlooked: whether the evidential value of an algorithmic output itself depends on structural injustice. Our paradigmatic pair of examples contrasts a predictive policing algorithm, which relies on historical crime data, with a camera-based system that records ongoing offenses, both designed to guide police deployment. In evaluating the moral acceptability of acting on a piece of evidence, we must ask not only whether the evidence is probative in the actual world, but also whether it would remain probative in nearby worlds without the relevant injustices. The predictive policing algorithm fails this test, but the camera-based system passes it. When evidence fails the test, it is morally problematic to use it punitively, more so than evidence that passes the test.
zh
[AI-63] Beyond Discrete Categories: Multi-Task Valence-Arousal Modeling for Pet Vocalization Analysis
【速读】:该论文旨在解决传统宠物情绪识别方法在基于离散分类时存在的模糊性及难以捕捉情绪强度变化的问题。其核心解决方案是提出一种连续的Valence-Arousal(VA)模型,将情绪表示为二维空间中的连续变量,并通过自动VA标签生成算法实现对42,553条宠物 vocalization 样本的大规模标注;同时采用多任务学习框架联合训练VA回归与辅助任务(情绪类别、体型、性别),以增强特征学习能力,最终借助Audio Transformer模型实现了高精度的情绪预测(验证集Valence Pearson相关系数r = 0.9024,Arousal r = 0.7155),有效区分了如“领地型”与“快乐”等易混淆的离散情绪类别。
链接: https://arxiv.org/abs/2510.12819
作者: Junyao Huang,Rumin Situ
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 24 pages, 6 figures, 4 tables. First continuous VA framework for pet vocalization analysis with 42,553 samples
Abstract:Traditional pet emotion recognition from vocalizations, based on discrete classification, struggles with ambiguity and capturing intensity variations. We propose a continuous Valence-Arousal (VA) model that represents emotions in a two-dimensional space. Our method uses an automatic VA label generation algorithm, enabling large-scale annotation of 42,553 pet vocalization samples. A multi-task learning framework jointly trains VA regression with auxiliary tasks (emotion, body size, gender) to enhance prediction by improving feature learning. Our Audio Transformer model achieves a validation Valence Pearson correlation of r = 0.9024 and an Arousal r = 0.7155, effectively resolving confusion between discrete categories like “territorial” and “happy.” This work introduces the first continuous VA framework for pet vocalization analysis, offering a more expressive representation for human-pet interaction, veterinary diagnostics, and behavioral training. The approach shows strong potential for deployment in consumer products like AI pet emotion translators.
zh
[AI-64] Narrow Operator Models of Stellarator Equilibria in Fourier Zernike Basis
【速读】:该论文旨在解决传统理想磁流体动力学(Ideal Magnetohydrodynamics, MHD)平衡计算中仅能求解单一稳态解的问题,而无法系统探索在固定边界和旋转变换(rotational transform)条件下,不同压力分布下连续平衡解的特性。解决方案的关键在于提出一种全新的数值方法,通过优化多层感知机(Multilayer Perceptron, MLP)的参数,将标量压力乘子映射到傅里叶-泽尼克(Fourier-Zernike)基底上,从而最小化力残差(force residual),实现对连续平衡分布的高效求解。这一方法突破了传统单点求解的限制,为托卡马克与仿星器优化设计提供了更灵活、系统的物理基础。
链接: https://arxiv.org/abs/2510.13521
作者: Timo Thun,Rory Conlin,Dario Panici,Daniel Böckenhoff
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 6 figures, 1 table
Abstract:Numerical computation of the ideal Magnetohydrodynamic (MHD) equilibrium magnetic field is at the base of stellarator optimisation and provides the starting point for solving more sophisticated Partial Differential Equations (PDEs) like transport or turbulence models. Conventional approaches solve for a single stationary point of the ideal MHD equations, which is fully defined by three invariants and the numerical scheme employed by the solver. We present the first numerical approach that can solve for a continuous distribution of equilibria with fixed boundary and rotational transform, varying only the pressure invariant. This approach minimises the force residual by optimising parameters of multilayer perceptrons (MLP) that map from a scalar pressure multiplier to the Fourier Zernike basis as implemented in the modern stellarator equilibrium solver DESC.
zh
[AI-65] Semantic Communication Enabled Holographic Video Processing and Transmission
【速读】:该论文旨在解决传统全息视频通信系统在传输效率和用户体验方面的瓶颈问题,尤其是在高带宽需求与语义信息利用率不足之间的矛盾。其核心解决方案是构建一种语义增强型全息视频通信架构,关键技术创新包括语义采样(semantic sampling)、语义-信道联合编码(joint semantic-channel coding)以及语义感知传输(semantic-aware transmission),这些技术共同提升了系统对关键视觉语义信息的提取、压缩与可靠传输能力,从而在保证沉浸式体验的同时显著降低传输开销。
链接: https://arxiv.org/abs/2510.13408
作者: Jingkai Ying,Zhiyuan Qi,Yulong Feng,Zhijin Qin,Zhu Han,Rahim Tafazolli,Yonina C. Eldar
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Multimedia (cs.MM); Signal Processing (eess.SP)
备注: 7 pages, 6 figures, Submit for review
Abstract:Holographic video communication is considered a paradigm shift in visual communications, becoming increasingly popular for its ability to offer immersive experiences. This article provides an overview of holographic video communication and outlines the requirements of a holographic video communication system. Particularly, following a brief review of semantic com- munication, an architecture for a semantic-enabled holographic video communication system is presented. Key technologies, including semantic sampling, joint semantic-channel coding, and semantic-aware transmission, are designed based on the proposed architecture. Two related use cases are presented to demonstrate the performance gain of the proposed methods. Finally, potential research topics are discussed to pave the way for the realization of semantic-enabled holographic video communications.
zh
[AI-66] A Multi-dimensional Semantic Surprise Framework Based on Low-Entropy Semantic Manifolds for Fine-Grained Out-of-Distribution Detection
【速读】:该论文旨在解决现有分布外(Out-of-Distribution, OOD)检测方法将OOD识别简化为二分类任务所带来的认知扁平化问题,即无法区分语义相近的“近分布外”(Near-OOD)与语义相远的“远分布外”(Far-OOD)风险,从而在需要细粒度风险分层的应用场景中构成显著的安全瓶颈。其解决方案的关键在于提出一种从传统概率视角向信息论框架的范式转变,核心创新包括:构建基于低熵语义流形(Low-Entropy Semantic Manifolds)的层次原型网络以显式刻画数据内在语义层级结构,并引入语义惊喜向量(Semantic Surprise Vector, SSV)作为通用探测器,将样本总惊喜分解为可解释的三个维度——一致性(conformity)、新颖性(novelty)和模糊性(ambiguity)。这一框架不仅在三元分类任务上达到新的SOTA性能,且其鲁棒表示在传统二元OOD检测基准上亦表现优异,如LSUN数据集上误报率降低超60%。
链接: https://arxiv.org/abs/2510.13093
作者: Ningkang Peng,Yuzhe Mao,Yuhao Zhang,Linjin Qian,Qianfeng Yu,Yanhui Gu,Yi Chen,Li Kong
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Out-of-Distribution (OOD) detection is a cornerstone for the safe deployment of AI systems in the open world. However, existing methods treat OOD detection as a binary classification problem, a cognitive flattening that fails to distinguish between semantically close (Near-OOD) and distant (Far-OOD) unknown risks. This limitation poses a significant safety bottleneck in applications requiring fine-grained risk stratification. To address this, we propose a paradigm shift from a conventional probabilistic view to a principled information-theoretic framework. We formalize the core task as quantifying the Semantic Surprise of a new sample and introduce a novel ternary classification challenge: In-Distribution (ID) vs. Near-OOD vs. Far-OOD. The theoretical foundation of our work is the concept of Low-Entropy Semantic Manifolds, which are explicitly structured to reflect the data’s intrinsic semantic hierarchy. To construct these manifolds, we design a Hierarchical Prototypical Network. We then introduce the Semantic Surprise Vector (SSV), a universal probe that decomposes a sample’s total surprise into three complementary and interpretable dimensions: conformity, novelty, and ambiguity. To evaluate performance on this new task, we propose the Normalized Semantic Risk (nSR), a cost-sensitive metric. Experiments demonstrate that our framework not only establishes a new state-of-the-art (sota) on the challenging ternary task, but its robust representations also achieve top results on conventional binary benchmarks, reducing the False Positive Rate by over 60% on datasets like LSUN.
zh
[AI-67] owards Human-Centric Intelligent Treatment Planning for Radiation Therapy
【速读】:该论文试图解决当前放射治疗计划制定中存在的问题,包括计划质量欠佳、效率低下以及成本高昂。其解决方案的关键在于提出一种以人为中心的智能治疗计划(Human-Centric Intelligent Treatment Planning, HCITP)框架,该框架在人类监督下利用人工智能技术,整合临床指南、自动化生成计划,并支持操作者直接交互,从而有望将计划时间缩短至几分钟,并实现个性化高质量的治疗方案。
链接: https://arxiv.org/abs/2510.13062
作者: Adnan Jafar,Xun Jia
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: 27 pages, 3 figures
Abstract:Current radiation therapy treatment planning is limited by suboptimal plan quality, inefficiency, and high costs. This perspective paper explores the complexity of treatment planning and introduces Human-Centric Intelligent Treatment Planning (HCITP), an AI-driven framework under human oversight, which integrates clinical guidelines, automates plan generation, and enables direct interactions with operators. We expect that HCITP will enhance efficiency, potentially reducing planning time to minutes, and will deliver personalized, high-quality plans. Challenges and potential solutions are discussed.
zh
[AI-68] HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection
【速读】:该论文旨在解决个性化语音活动检测(Personalized Voice Activity Detection, PVAD)中如何实现高效且灵活的说话人条件化问题,即让VAD模型仅对特定目标说话人激活,而无需修改原有模型架构。传统方法常依赖于如FiLM层等结构改动,限制了部署灵活性和通用性。本文的关键解决方案是提出一种基于超网络(hypernetwork)的权重适配方法(HyWA-PVAD),通过动态生成标准VAD模型中少数关键层的权重来实现说话人条件化,从而在不改变主VAD架构的前提下,使同一模型能够适应不同说话人,显著提升性能并简化实际部署流程。
链接: https://arxiv.org/abs/2510.12947
作者: Mahsa Ghazvini Nejad,Hamed Jafarzadeh Asl,Amin Edraki,Mohammadreza Sadeghi,Masoud Asgharian,Yuanhao Yu,Vahid Partovi Nia
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Mahsa Ghazvini Nejad and Hamed Jafarzadeh Asl contributed equally to this work
Abstract:Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adapt to different speakers by updating only a small subset of the layers. We propose HyWA-PVAD, a hypernetwork weight adaptation method, and evaluate it against multiple baseline conditioning techniques. Our comparison shows consistent improvements in PVAD performance. HyWA also offers practical advantages for deployment by preserving the core VAD architecture. Our new approach improves the current conditioning techniques in two ways: i) increases the mean average precision, ii) simplifies deployment by reusing the same VAD architecture.
zh
[AI-69] InferA: A Smart Assistant for Cosmological Ensemble Data
【速读】:该论文旨在解决大规模科学数据集(如达到TB级)在分析过程中面临的挑战,包括数据体量庞大、结构复杂以及对领域知识的依赖性高,传统自动化工具(如PandasAI)因需全量加载数据且缺乏对整体数据结构的理解而难以胜任智能数据分析助手的角色。其解决方案的关键在于提出一个名为InferA的多智能体系统,该系统以监督代理(supervisor agent)为核心,协调多个专业化代理分别负责数据检索与分析的不同阶段,并通过与用户交互明确其分析意图,确保系统行为与用户目标一致,从而实现可扩展且高效的科学数据分析。
链接: https://arxiv.org/abs/2510.12920
作者: Justin Z. Tam,Pascal Grosset,Divya Banesh,Nesar Ramachandra,Terece L. Turton,James Ahrens
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:
Abstract:Analyzing large-scale scientific datasets presents substantial challenges due to their sheer volume, structural complexity, and the need for specialized domain knowledge. Automation tools, such as PandasAI, typically require full data ingestion and lack context of the full data structure, making them impractical as intelligent data analysis assistants for datasets at the terabyte scale. To overcome these limitations, we propose InferA, a multi-agent system that leverages large language models to enable scalable and efficient scientific data analysis. At the core of the architecture is a supervisor agent that orchestrates a team of specialized agents responsible for distinct phases of the data retrieval and analysis. The system engages interactively with users to elicit their analytical intent and confirm query objectives, ensuring alignment between user goals and system actions. To demonstrate the framework’s usability, we evaluate the system using ensemble runs from the HACC cosmology simulation which comprises several terabytes.
zh
[AI-70] Automatic Speech Recognition in the Modern Era: Architectures Training and Evaluation
【速读】:该论文旨在系统梳理自动语音识别(Automatic Speech Recognition, ASR)在过去十年中从传统混合模型向端到端神经架构演进的全过程,并解决当前ASR研究与应用中存在的关键挑战,包括模型架构复杂性、标注数据依赖性强、训练范式局限以及实际部署中的效率与公平性问题。其解决方案的关键在于:首先,提出并分析了以连接时序分类(Connectionist Temporal Classification, CTC)、基于注意力机制的编码器-解码器模型和循环神经网络转换器(Recurrent Neural Network Transducer, RNN-T)为代表的端到端建模范式,奠定了现代ASR系统的基础;其次,推动了基于自注意力机制的Transformer和Conformer等高效架构的发展,显著提升了对长距离依赖关系的建模能力;再次,通过引入自监督学习(Self-Supervised Learning, SSL)方法(如wav2vec 2.0)和大规模弱监督模型(如Whisper),大幅降低对人工标注数据的依赖并增强模型鲁棒性;最后,强调了真实场景下部署所需的生态支撑要素,如标准数据集、评估指标及边缘计算优化策略,为未来研究指明方向。
链接: https://arxiv.org/abs/2510.12827
作者: Md. Nayeem,Md Shamse Tabrej,Kabbojit Jit Deb,Shaonti Goswami,Md. Azizul Hakim
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Automatic Speech Recognition (ASR) has undergone a profound transformation over the past decade, driven by advances in deep learning. This survey provides a comprehensive overview of the modern era of ASR, charting its evolution from traditional hybrid systems, such as Gaussian Mixture Model-Hidden Markov Models (GMM-HMMs) and Deep Neural Network-HMMs (DNN-HMMs), to the now-dominant end-to-end neural architectures. We systematically review the foundational end-to-end paradigms: Connectionist Temporal Classification (CTC), attention-based encoder-decoder models, and the Recurrent Neural Network Transducer (RNN-T), which established the groundwork for fully integrated speech-to-text systems. We then detail the subsequent architectural shift towards Transformer and Conformer models, which leverage self-attention to capture long-range dependencies with high computational efficiency. A central theme of this survey is the parallel revolution in training paradigms. We examine the progression from fully supervised learning, augmented by techniques like SpecAugment, to the rise of self-supervised learning (SSL) with foundation models such as wav2vec 2.0, which drastically reduce the reliance on transcribed data. Furthermore, we analyze the impact of largescale, weakly supervised models like Whisper, which achieve unprecedented robustness through massive data diversity. The paper also covers essential ecosystem components, including key datasets and benchmarks (e.g., LibriSpeech, Switchboard, CHiME), standard evaluation metrics (e.g., Word Error Rate), and critical considerations for real-world deployment, such as streaming inference, on-device efficiency, and the ethical imperatives of fairness and robustness. We conclude by outlining open challenges and future research directions.
zh
机器学习
[LG-0] MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control
链接: https://arxiv.org/abs/2510.13794
作者: Xue Bin Peng
类目: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:MimicKit is an open-source framework for training motion controllers using motion imitation and reinforcement learning. The codebase provides implementations of commonly-used motion-imitation techniques and RL algorithms. This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures. The codebase is designed to be modular and easily configurable, enabling convenient modification and extension to new characters and tasks. The open-source codebase is available at: this https URL.
[LG-1] 3former: Temporal Graph Classification with Topological Machine Learning
链接: https://arxiv.org/abs/2510.13789
作者: Md. Joshem Uddin,Soham Changani,Baris Coskunuzer
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Algebraic Topology (math.AT)
*备注: 14 pages, 8 figures
Abstract:Temporal graph classification plays a critical role in applications such as cybersecurity, brain connectivity analysis, social dynamics, and traffic monitoring. Despite its significance, this problem remains underexplored compared to temporal link prediction or node forecasting. Existing methods often rely on snapshot-based or recurrent architectures that either lose fine-grained temporal information or struggle with long-range dependencies. Moreover, local message-passing approaches suffer from oversmoothing and oversquashing, limiting their ability to capture complex temporal structures. We introduce T3former, a novel Topological Temporal Transformer that leverages sliding-window topological and spectral descriptors as first-class tokens, integrated via a specialized Descriptor-Attention mechanism. This design preserves temporal fidelity, enhances robustness, and enables principled cross-modal fusion without rigid discretization. T3former achieves state-of-the-art performance across multiple benchmarks, including dynamic social networks, brain functional connectivity datasets, and traffic networks. It also offers theoretical guarantees of stability under temporal and structural perturbations. Our results highlight the power of combining topological and spectral insights for advancing the frontier of temporal graph learning. Comments: 14 pages, 8 figures Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Algebraic Topology (math.AT) MSC classes: 55N31, 68T07, 05C85 ACMclasses: G.2.2; I.2.6 Cite as: arXiv:2510.13789 [cs.LG] (or arXiv:2510.13789v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.13789 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] nsor Gaussian Processes: Efficient Solvers for Nonlinear PDEs
链接: https://arxiv.org/abs/2510.13772
作者: Qiwei Yuan,Zhitong Xu,Yinghao Chen,Yiming Xu,Houman Owhadi,Shandian Zhe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning solvers for partial differential equations (PDEs) have attracted growing interest. However, most existing approaches, such as neural network solvers, rely on stochastic training, which is inefficient and typically requires a great many training epochs. Gaussian process (GP)/kernel-based solvers, while mathematical principled, suffer from scalability issues when handling large numbers of collocation points often needed for challenging or higher-dimensional PDEs. To overcome these limitations, we propose TGPS, a tensor-GP-based solver that models factor functions along each input dimension using one-dimensional GPs and combines them via tensor decomposition to approximate the full solution. This design reduces the task to learning a collection of one-dimensional GPs, substantially lowering computational complexity, and enabling scalability to massive collocation sets. For efficient nonlinear PDE solving, we use a partial freezing strategy and Newton’s method to linerize the nonlinear terms. We then develop an alternating least squares (ALS) approach that admits closed-form updates, thereby substantially enhancing the training efficiency. We establish theoretical guarantees on the expressivity of our model, together with convergence proof and error analysis under standard regularity assumptions. Experiments on several benchmark PDEs demonstrate that our method achieves superior accuracy and efficiency compared to existing approaches. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.13772 [cs.LG] (or arXiv:2510.13772v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.13772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-3] Progressive multi-fidelity learning for physical system predictions
链接: https://arxiv.org/abs/2510.13762
作者: Paolo Conti,Mengwu Guo,Attilio Frangi,Andrea Manzoni
类目: Machine Learning (cs.LG)
*备注:
Abstract:Highly accurate datasets from numerical or physical experiments are often expensive and time-consuming to acquire, posing a significant challenge for applications that require precise evaluations, potentially across multiple scenarios and in real-time. Even building sufficiently accurate surrogate models can be extremely challenging with limited high-fidelity data. Conversely, less expensive, low-fidelity data can be computed more easily and encompass a broader range of scenarios. By leveraging multi-fidelity information, prediction capabilities of surrogates can be improved. However, in practical situations, data may be different in types, come from sources of different modalities, and not be concurrently available, further complicating the modeling process. To address these challenges, we introduce a progressive multi-fidelity surrogate model. This model can sequentially incorporate diverse data types using tailored encoders. Multi-fidelity regression from the encoded inputs to the target quantities of interest is then performed using neural networks. Input information progressively flows from lower to higher fidelity levels through two sets of connections: concatenations among all the encoded inputs, and additive connections among the final outputs. This dual connection system enables the model to exploit correlations among different datasets while ensuring that each level makes an additive correction to the previous level without altering it. This approach prevents performance degradation as new input data are integrated into the model and automatically adapts predictions based on the available inputs. We demonstrate the effectiveness of the approach on numerical benchmarks and a real-world case study, showing that it reliably integrates multi-modal data and provides accurate predictions, maintaining performance when generalizing across time and parameter variations.
[LG-4] A Complete Pipeline for deploying SNNs with Synaptic Delays on Loihi 2
链接: https://arxiv.org/abs/2510.13757
作者: Balázs Mészáros,James C. Knight,Jonathan Timcheck,Thomas Nowotny
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks are attracting increased attention as a more energy-efficient alternative to traditional Artificial Neural Networks for edge computing. Neuromorphic computing can significantly reduce energy requirements. Here, we present a complete pipeline: efficient event-based training of SNNs with synaptic delays on GPUs and deployment on Intel’s Loihi 2 neuromorphic chip. We evaluate our approach on keyword recognition tasks using the Spiking Heidelberg Digits and Spiking Speech Commands datasets, demonstrating that our algorithm can enhance classification accuracy compared to architectures without delays. Our benchmarking indicates almost no accuracy loss between GPU and Loihi 2 implementations, while classification on Loihi 2 is up to 18x faster and uses 250x less energy than on an NVIDIA Jetson Orin Nano.
[LG-5] Asymptotically optimal reinforcement learning in Block Markov Decision Processes
链接: https://arxiv.org/abs/2510.13748
作者: Thomas van Vuren,Fiona Sloothaak,Maarten G. Wolf,Jaron Sanders
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 74 pages, 3 figures
Abstract:The curse of dimensionality renders Reinforcement Learning (RL) impractical in many real-world settings with exponentially large state and action spaces. Yet, many environments exhibit exploitable structure that can accelerate learning. To formalize this idea, we study RL in Block Markov Decision Processes (BMDPs). BMDPs model problems with large observation spaces, but where transition dynamics are fully determined by latent states. Recent advances in clustering methods have enabled the efficient recovery of this latent structure. However, a regret analysis that exploits these techniques to determine their impact on learning performance remained open. We are now addressing this gap by providing a regret analysis that explicitly leverages clustering, demonstrating that accurate latent state estimation can indeed effectively speed up learning. Concretely, this paper analyzes a two-phase RL algorithm for BMDPs that first learns the latent structure through random exploration and then switches to an optimism-guided strategy adapted to the uncovered structure. This algorithm achieves a regret that is O(\sqrtT+n) on a large class of BMDPs susceptible to clustering. Here, T denotes the number of time steps, n is the cardinality of the observation space, and the Landau notation O(\cdot) holds up to constants and polylogarithmic factors. This improves the best prior bound, O(\sqrtT+n^2) , especially when n is large. Moreover, we prove that no algorithm can achieve lower regret uniformly on this same class of BMDPs. This establishes that, on this class, the algorithm achieves asymptotic optimality. Comments: 74 pages, 3 figures Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) MSC classes: 90C40, 62H30, 60J20 Cite as: arXiv:2510.13748 [cs.LG] (or arXiv:2510.13748v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.13748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling
链接: https://arxiv.org/abs/2510.13722
作者: Carlo Saccardi,Maximilian Pierzyna,Haitz Sáez de Ocáriz Borde,Simone Monaco,Cristian Meo,Pietro Liò,Rudolf Saathof,Geethu Joseph,Justin Dauwels
类目: Machine Learning (cs.LG)
*备注:
Abstract:Kilometer-scale weather data is crucial for real-world applications but remains computationally intensive to produce using traditional weather simulations. An emerging solution is to use deep learning models, which offer a faster alternative for climate downscaling. However, their reliability is still in question, as they are often evaluated using standard machine learning metrics rather than insights from atmospheric and weather physics. This paper benchmarks recent state-of-the-art deep learning models and introduces physics-inspired diagnostics to evaluate their performance and reliability, with a particular focus on geographic generalization and physical consistency. Our experiments show that, despite the seemingly strong performance of models such as CorrDiff, when trained on a limited set of European geographies (e.g., central Europe), they struggle to generalize to other regions such as Iberia, Morocco in the south, or Scandinavia in the north. They also fail to accurately capture second-order variables such as divergence and vorticity derived from predicted velocity fields. These deficiencies appear even in in-distribution geographies, indicating challenges in producing physically consistent predictions. We propose a simple initial solution: introducing a power spectral density loss function that empirically improves geographic generalization by encouraging the reconstruction of small-scale physical structures. The code for reproducing the experimental results can be found at this https URL
[LG-7] Dont Be Greedy Just Relax! Pruning LLM s via Frank-Wolfe
链接: https://arxiv.org/abs/2510.13713
作者: Christophe Roux,Max Zimmer,Alexandre d’Aspremont,Sebastian Pokutta
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Pruning is a common technique to reduce the compute and storage requirements of Neural Networks. While conventional approaches typically retrain the model to recover pruning-induced performance degradation, state-of-the-art Large Language Model (LLM) pruning methods operate layer-wise, minimizing the per-layer pruning error on a small calibration dataset to avoid full retraining, which is considered computationally prohibitive for LLMs. However, finding the optimal pruning mask is a hard combinatorial problem and solving it to optimality is intractable. Existing methods hence rely on greedy heuristics that ignore the weight interactions in the pruning objective. In this work, we instead consider the convex relaxation of these combinatorial constraints and solve the resulting problem using the Frank-Wolfe (FW) algorithm. Our method drastically reduces the per-layer pruning error, outperforms strong baselines on state-of-the-art GPT architectures, and remains memory-efficient. We provide theoretical justification by showing that, combined with the convergence guarantees of the FW algorithm, we obtain an approximate solution to the original combinatorial problem upon rounding the relaxed solution to integrality.
[LG-8] On Pretraining for Project-Level Code Completion
链接: https://arxiv.org/abs/2510.13697
作者: Maksim Sapronov,Evgeniy Glukhov
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Repository-level pretraining is commonly used to enable large language models for code to leverage codebase-wide context. This enhances their ability to generate accurate and context-aware code completions. In this work, we investigate how different repository-processing strategies affect in-context learning in OpenCoder, a 1.5B-parameter model. We extend its context window from 4,096 to 16,384 tokens by training on additional 1B tokens of curated repository-level data. Despite relying on a smaller dataset than competing models (which often use hundreds of billions of tokens), our model achieves comparable performance on the Long Code Arena benchmark. We find that various repository-processing techniques yield similarly strong results, with the primary gain coming from adapting to a new rotary positional embedding (RoPE) scaling parameter. Finally, we show that a simpler file-level training approach at the original sequence length remains highly effective, opening up repository-level code completion research to settings with more constrained data and compute resources.
[LG-9] Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
链接: https://arxiv.org/abs/2510.13694
作者: Yuchun Miao,Liang Ding,Sen Zhang,Rong Bao,Lefei Zhang,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注: 46 pages, 36 figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM’s IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.
[LG-10] Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
链接: https://arxiv.org/abs/2510.13680
作者: Bingbin Liu,Rachit Bansal,Depen Morwani,Nikhil Vyas,David Alvarez-Melis,Sham M. Kakade
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and Gauss-Newton (GN) methods: the former leverages statistics of current gradients and is the de-factor optimizers for neural networks, and the latter uses the diagonal elements of the Gauss-Newton matrix and underpins some of the recent diagonal optimizers such as Sophia. In this work, we compare these two diagonal preconditioning methods through the lens of two key factors: the choice of basis in the preconditioner, and the impact of gradient noise from mini-batching. To gain insights, we analyze these optimizers on quadratic objectives and logistic regression under all four quadrants. We show that regardless of the basis, there exist instances where Adam outperforms both GN ^-1 and GN ^-1/2 in full-batch settings. Conversely, in the stochastic regime, Adam behaves similarly to GN ^-1/2 for linear regression under a Gaussian data assumption. These theoretical results are supported by empirical studies on both convex and non-convex objectives. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.13680 [cs.LG] (or arXiv:2510.13680v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.13680 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference
链接: https://arxiv.org/abs/2510.13668
作者: Zhibin Wang,Zetao Hong,Xue Li,Zibo Wang,Shipeng Li,Qingkai Meng,Qing Wang,Chengying Huan,Rong Gu,Sheng Zhong,Chen Tian
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Model (LLM) inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2510.13668 [cs.DC] (or arXiv:2510.13668v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2510.13668 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Rebalancing with Calibrated Sub-classes (RCS): An Enhanced Approach for Robust Imbalanced Classification
链接: https://arxiv.org/abs/2510.13656
作者: Priyobrata Mondal,Faizanuddin Ansari,Swagatam Das
类目: Machine Learning (cs.LG)
*备注:
Abstract:The class imbalance problem refers to the insufficiency of data in certain classes, which causes a classifier to be biased toward the majority class. Distribution calibration is a technique that seeks to estimate a more accurate class distribution based on an observed or estimated one. To address this issue, we propose a distribution calibration-based method-Rebalancing with Calibrated Sub-classes (RCS): An Enhanced Approach for Robust Imbalanced Classification, which estimates the distribution parameters of the minority classes using weighted parameters derived from a mixture of Gaussian components from both the majority and intermediate classes. An encoder-decoder network is trained to preserve the structure of the imbalanced data and prevent disentanglement. After training, feature vectors extracted from the encoder are used to generate synthetic samples through our distribution calibration strategy. This approach effectively mitigates the overgeneralization problem that arises when only the distribution of the majority class is used to approximate the minority class statistics. Instead, our method calibrates the parameters by leveraging the distribution of data points in neighboring regions. Experimental results demonstrate that the proposed method achieves superior classification performance compared to several baseline and state-of-the-art techniques across a diverse range of image, text, and tabular datasets.
[LG-13] What is the objective of reasoning with reinforcement learning?
链接: https://arxiv.org/abs/2510.13651
作者: Damek Davis,Benjamin Recht
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.
[LG-14] Multivariate Time Series Forecasting with Gate-Based Quantum Reservoir Computing on NISQ Hardware
链接: https://arxiv.org/abs/2510.13634
作者: Wissal Hamhoum,Soumaya Cherkaoui,Jean-Frederic Laprade,Ola Ahmed,Shengrui Wang
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
Abstract:Quantum reservoir computing (QRC) offers a hardware-friendly approach to temporal learning, yet most studies target univariate signals and overlook near-term hardware constraints. This work introduces a gate-based QRC for multivariate time series (MTS-QRC) that pairs injection and memory qubits and uses a Trotterized nearest-neighbor transverse-field Ising evolution optimized for current device connectivity and depth. On Lorenz-63 and ENSO, the method achieves a mean square error (MSE) of 0.0087 and 0.0036, respectively, performing on par with classical reservoir computing on Lorenz and above learned RNNs on both, while NVAR and clustered ESN remain stronger on some settings. On IBM Heron R2, MTS-QRC sustains accuracy with realistic depths and, interestingly, outperforms a noiseless simulator on ENSO; singular value analysis indicates that device noise can concentrate variance in feature directions, acting as an implicit regularizer for linear readout in this regime. These findings support the practicality of gate-based QRC for MTS forecasting on NISQ hardware and motivate systematic studies on when and how hardware noise benefits QRC readouts.
[LG-15] Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings
链接: https://arxiv.org/abs/2510.13622
作者: Riddhish Thakare,Kingdom Mutala Akugri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Classical nonlinear dimensionality reduction (NLDR) techniques like t-SNE, Isomap, and LLE excel at creating low-dimensional embeddings for data visualization but fundamentally lack the ability to map these embeddings back to the original high-dimensional space. This one-way transformation limits their use in generative applications. This paper addresses this critical gap by introducing a system- atic framework for constructing neural decoder architectures for prominent NLDR methods, enabling bidirectional mapping for the first time. We extend this framework by implementing a diffusion-based generative process that operates directly within these learned manifold spaces. Through experiments on the CelebA dataset, we evaluate the reconstruction and generative performance of our approach against autoencoder and standard diffusion model baselines. Our findings reveal a fundamental trade- off: while the decoders successfully reconstruct data, their quality is surpassed by end-to-end optimized autoencoders. Moreover, manifold-constrained diffusion yields poor-quality samples, suggesting that the discrete and sparse nature of classical NLDR embeddings is ill-suited for the continuous inter- polation required by generative models. This work highlights the inherent challenges in retrofitting generative capabilities onto NLDR methods designed primarily for visualization and analysis.
[LG-16] owards Robust Knowledge Removal in Federated Learning with High Data Heterogeneity
链接: https://arxiv.org/abs/2510.13606
作者: Riccardo Santi,Riccardo Salami,Simone Calderara
类目: Machine Learning (cs.LG)
*备注:
Abstract:Nowdays, there are an abundance of portable devices capable of collecting large amounts of data and with decent computational power. This opened the possibility to train AI models in a distributed manner, preserving the participating clients’ privacy. However, because of privacy regulations and safety requirements, elimination upon necessity of a client contribution to the model has become mandatory. The cleansing process must satisfy specific efficacy and time requirements. In recent years, research efforts have produced several knowledge removal methods, but these require multiple communication rounds between the data holders and the process coordinator. This can cause the unavailability of an effective model up to the end of the removal process, which can result in a disservice to the system users. In this paper, we introduce an innovative solution based on Task Arithmetic and the Neural Tangent Kernel, to rapidly remove a client’s influence from a model.
[LG-17] Physics-augmented Multi-task Gaussian Process for Modeling Spatiotemporal Dynamics
链接: https://arxiv.org/abs/2510.13601
作者: Xizhuo Zhang,Bing Yao
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures
Abstract:Recent advances in sensing and imaging technologies have enabled the collection of high-dimensional spatiotemporal data across complex geometric domains. However, effective modeling of such data remains challenging due to irregular spatial structures, rapid temporal dynamics, and the need to jointly predict multiple interrelated physical variables. This paper presents a physics-augmented multi-task Gaussian Process (P-M-GP) framework tailored for spatiotemporal dynamic systems. Specifically, we develop a geometry-aware, multi-task Gaussian Process (M-GP) model to effectively capture intrinsic spatiotemporal structure and inter-task dependencies. To further enhance the model fidelity and robustness, we incorporate governing physical laws through a physics-based regularization scheme, thereby constraining predictions to be consistent with governing dynamical principles. We validate the proposed P-M-GP framework on a 3D cardiac electrodynamics modeling task. Numerical experiments demonstrate that our method significantly improves prediction accuracy over existing methods by effectively incorporating domain-specific physical constraints and geometric prior.
[LG-18] EEGChaT: A Transformer-Based Modular Channel Selector for SEEG Analysis
链接: https://arxiv.org/abs/2510.13592
作者: Chen Wang,Yansen Wang,Dongqi Han,Zilong Wang,Dongsheng Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Analyzing stereoelectroencephalography (SEEG) signals is critical for brain-computer interface (BCI) applications and neuroscience research, yet poses significant challenges due to the large number of input channels and their heterogeneous relevance. Traditional channel selection methods struggle to scale or provide meaningful interpretability for SEEG data. In this work, we propose EEGChaT, a novel Transformer-based channel selection module designed to automatically identify the most task-relevant channels in SEEG recordings. EEGChaT introduces Channel Aggregation Tokens (CATs) to aggregate information across channels, and leverages an improved Attention Rollout technique to compute interpretable, quantitative channel importance scores. We evaluate EEGChaT on the DuIN dataset, demonstrating that integrating EEGChaT with existing classification models consistently improves decoding accuracy, achieving up to 17% absolute gains. Furthermore, the channel weights produced by EEGChaT show substantial overlap with manually selected channels, supporting the interpretability of the approach. Our results suggest that EEGChaT is an effective and generalizable solution for channel selection in high-dimensional SEEG analysis, offering both enhanced performance and insights into neural signal relevance.
[LG-19] ArtNet: Hierarchical Clustering-Based Artificial Netlist Generator for ML and DTCO Application
链接: https://arxiv.org/abs/2510.13582
作者: Andrew B. Kahng. Seokhyeong Kang,Seonghyeon Park,Dooseok Yoon
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:In advanced nodes, optimization of power, performance and area (PPA) has become highly complex and challenging. Machine learning (ML) and design-technology co-optimization (DTCO) provide promising mitigations, but face limitations due to a lack of diverse training data as well as long design flow turnaround times (TAT). We propose ArtNet, a novel artificial netlist generator designed to tackle these issues. Unlike previous methods, ArtNet replicates key topological characteristics, enhancing ML model generalization and supporting broader design space exploration for DTCO. By producing realistic artificial datasets that moreclosely match given target parameters, ArtNet enables more efficient PPAoptimization and exploration of flows and design enablements. In the context of CNN-based DRV prediction, ArtNet’s data augmentationimproves F1 score by 0.16 compared to using only the original (real) dataset. In the DTCO context, ArtNet-generated mini-brains achieve a PPA match up to 97.94%, demonstrating close alignment with design metrics of targeted full-scale block designs.
[LG-20] Selective Adversarial Attacks on LLM Benchmarks
链接: https://arxiv.org/abs/2510.13570
作者: Ivan Dubrovsky,Anastasia Orlova,Illarion Iov,Nina Gubina,Irena Gureeva,Alexey Zaytsev
类目: Machine Learning (cs.LG)
*备注:
Abstract:Benchmarking outcomes increasingly govern trust, selection, and deployment of LLMs, yet these evaluations remain vulnerable to semantically equivalent adversarial perturbations. Prior work on adversarial robustness in NLP has emphasized text attacks that affect many models equally, leaving open the question of whether it is possible to selectively degrade or enhance performance while minimally affecting other models. We formalize this problem and study selective adversarial attacks on MMLU - a widely used benchmark designed to measure a language model’s broad general knowledge and reasoning ability across different subjects. Using canonical attacks integrated into TextAttack framework, we introduce a protocol for selectivity assessment, develop a custom constraint to increase selectivity of attacks and propose a surrogate-LLM pipeline that generates selective perturbations. Empirically, we find that selective adversarial attacks exist and can materially alter relative rankings, challenging the fairness, reproducibility, and transparency of leaderboard-driven evaluation. Our results motivate perturbation-aware reporting and robustness diagnostics for LLM evaluation and demonstrate that even subtle edits can shift comparative judgments.
[LG-21] DOLFIN: Balancing Stability and Plasticity in Federated Continual Learning
链接: https://arxiv.org/abs/2510.13567
作者: Omayma Moussadek,Riccardo Salami,Simone Calderara
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated continual learning (FCL) enables models to learn new tasks across multiple distributed clients, protecting privacy and without forgetting previously acquired knowledge. However, current methods face challenges balancing performance, privacy preservation, and communication efficiency. We introduce a Distributed Online LoRA for Federated INcremental learning method DOLFIN, a novel approach combining Vision Transformers with low-rank adapters designed to efficiently and stably learn new tasks in federated environments. Our method leverages LoRA for minimal communication overhead and incorporates DualGradient Projection Memory (DualGPM) to prevent forgetting. Evaluated on CIFAR-100, ImageNet-R, ImageNet-A, and CUB-200 under two Dirichlet heterogeneity settings, DOLFIN consistently surpasses six strong baselines in final average accuracy while matching their memory footprint. Orthogonal low-rank adapters offer an effective and scalable solution for privacy-preserving continual learning in federated settings.
[LG-22] Multi-Objective textitmin-max Online Convex Optimization
链接: https://arxiv.org/abs/2510.13560
作者: Rahul Vaze,Sumiran Mishra
类目: Machine Learning (cs.LG)
*备注:
Abstract:In online convex optimization (OCO), a single loss function sequence is revealed over a time horizon of T , and an online algorithm has to choose its action at time t , before the loss function at time t is revealed. The goal of the online algorithm is to incur minimal penalty (called \textitregret compared to a static optimal action made by an optimal offline algorithm knowing all functions of the sequence in advance. In this paper, we broaden the horizon of OCO, and consider multi-objective OCO, where there are K distinct loss function sequences, and an algorithm has to choose its action at time t , before the K loss functions at time t are revealed. To capture the tradeoff between tracking the K different sequences, we consider the \textitmin-max regret, where the benchmark (optimal offline algorithm) takes a static action across all time slots that minimizes the maximum of the total loss (summed across time slots) incurred by each of the K sequences. An online algorithm is allowed to change its action across time slots, and its \it min-max regret is defined as the difference between its \textitmin-max cost and that of the benchmark. The \textitmin-max regret is a stringent performance measure and an algorithm with small regret needs to `track’ all loss function sequences closely at all times. We consider this \textitmin-max regret in the i.i.d. input setting where all loss functions are i.i.d. generated from an unknown distribution. For the i.i.d. model we propose a simple algorithm that combines the well-known \textitHedge and online gradient descent (OGD) and show via a remarkably simple proof that its expected \textitmin-max regret is O(\sqrtT \log K) . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.13560 [cs.LG] (or arXiv:2510.13560v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.13560 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-23] ProtoTopic: Prototypical Network for Few-Shot Medical Topic Modeling
链接: https://arxiv.org/abs/2510.13542
作者: Martin Licht,Sara Ketabi,Farzad Khalvati
类目: Machine Learning (cs.LG)
*备注:
Abstract:Topic modeling is a useful tool for analyzing large corpora of written documents, particularly academic papers. Despite a wide variety of proposed topic modeling techniques, these techniques do not perform well when applied to medical texts. This can be due to the low number of documents available for some topics in the healthcare domain. In this paper, we propose ProtoTopic, a prototypical network-based topic model used for topic generation for a set of medical paper abstracts. Prototypical networks are efficient, explainable models that make predictions by computing distances between input datapoints and a set of prototype representations, making them particularly effective in low-data or few-shot learning scenarios. With ProtoTopic, we demonstrate improved topic coherence and diversity compared to two topic modeling baselines used in the literature, demonstrating the ability of our model to generate medically relevant topics even with limited data.
[LG-24] ahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM
链接: https://arxiv.org/abs/2510.13481
作者: Areej AlOtaibi,Lina Alyahya,Raghad Alshabanah,Shahad Alfawzan,Shuruq Alarefei,Reem Alsabti,Nouf Alsubaie,Abdulaziz Alhuzaymi,Lujain Alkhelb,Majd Alsayari,Waad Alahmed,Omar Talabay,Jalal Alowibdi,Salem Alelyani,Adel Bibi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.
[LG-25] owards Blackwell Optimality: Bellm an Optimality Is All You Can Get
链接: https://arxiv.org/abs/2510.13476
作者: Victor Boone,Adrienne Tuynman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Although average gain optimality is a commonly adopted performance measure in Markov Decision Processes (MDPs), it is often too asymptotic. Further incorporating measures of immediate losses leads to the hierarchy of bias optimalities, all the way up to Blackwell optimality. In this paper, we investigate the problem of identifying policies of such optimality orders. To that end, for each order, we construct a learning algorithm with vanishing probability of error. Furthermore, we characterize the class of MDPs for which identification algorithms can stop in finite time. That class corresponds to the MDPs with a unique Bellman optimal policy, and does not depend on the optimality order considered. Lastly, we provide a tractable stopping rule that when coupled to our learning algorithm triggers in finite time whenever it is possible to do so.
[LG-26] L_2-Regularized Empirical Risk Minimization Guarantees Small Smooth Calibration Error
链接: https://arxiv.org/abs/2510.13450
作者: Masahiro Fujisawa,Futoshi Futami
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 26 pages, 8 figures
Abstract:Calibration of predicted probabilities is critical for reliable machine learning, yet it is poorly understood how standard training procedures yield well-calibrated models. This work provides the first theoretical proof that canonical L_2 -regularized empirical risk minimization directly controls the smooth calibration error (smCE) without post-hoc correction or specialized calibration-promoting regularizer. We establish finite-sample generalization bounds for smCE based on optimization error, regularization strength, and the Rademacher complexity. We then instantiate this theory for models in reproducing kernel Hilbert spaces, deriving concrete guarantees for kernel ridge and logistic regression. Our experiments confirm these specific guarantees, demonstrating that L_2 -regularized ERM can provide a well-calibrated model without boosting or post-hoc recalibration. The source code to reproduce all experiments is available at this https URL.
[LG-27] Hybrid Interval Type-2 Mamdani-TSK Fuzzy System for Regression Analysis
链接: https://arxiv.org/abs/2510.13437
作者: Ashish Bhatia,Renato Cordeiro de Amorim,Vito De Feo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Regression analysis is employed to examine and quantify the relationships between input variables and a dependent and continuous output variable. It is widely used for predictive modelling in fields such as finance, healthcare, and engineering. However, traditional methods often struggle with real-world data complexities, including uncertainty and ambiguity. While deep learning approaches excel at capturing complex non-linear relationships, they lack interpretability and risk over-fitting on small datasets. Fuzzy systems provide an alternative framework for handling uncertainty and imprecision, with Mamdani and Takagi-Sugeno-Kang (TSK) systems offering complementary strengths: interpretability versus accuracy. This paper presents a novel fuzzy regression method that combines the interpretability of Mamdani systems with the precision of TSK models. The proposed approach introduces a hybrid rule structure with fuzzy and crisp components and dual dominance types, enhancing both accuracy and explainability. Evaluations on benchmark datasets demonstrate state-of-the-art performance in several cases, with rules maintaining a component similar to traditional Mamdani systems while improving precision through improved rule outputs. This hybrid methodology offers a balanced and versatile tool for predictive modelling, addressing the trade-off between interpretability and accuracy inherent in fuzzy systems. In the 6 datasets tested, the proposed approach gave the best fuzzy methodology score in 4 datasets, out-performed the opaque models in 2 datasets and produced the best overall score in 1 dataset with the improvements in RMSE ranging from 0.4% to 19%.
[LG-28] Modeling Adoptive Cell Therapy in Bladder Cancer from Sparse Biological Data using PINNs
链接: https://arxiv.org/abs/2510.13431
作者: Kayode Olumoyin,Katarzyna Rejniak
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB); Populations and Evolution (q-bio.PE)
*备注:
Abstract:Physics-informed neural networks (PINNs) are neural networks that embed the laws of dynamical systems modeled by differential equations into their loss function as constraints. In this work, we present a PINN framework applied to oncology. Here, we seek to learn time-varying interactions due to a combination therapy in a tumor microenvironment. In oncology, experimental data are often sparse and composed of a few time points of tumor volume. By embedding inductive biases derived from prior information about a dynamical system, we extend the physics-informed neural networks (PINN) and incorporate observed biological constraints as regularization agents. The modified PINN algorithm is able to steer itself to a reasonable solution and can generalize well with only a few training examples. We demonstrate the merit of our approach by learning the dynamics of treatment applied intermittently in an ordinary differential equation (ODE) model of a combination therapy. The algorithm yields a solution to the ODE and time-varying forms of some of the ODE model parameters. We demonstrate a strong convergence using metrics such as the mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).
[LG-29] When Embedding Models Meet: Procrustes Bounds and Applications
链接: https://arxiv.org/abs/2510.13406
作者: Lucas Maystre,Alvaro Ortega Gonzalez,Charles Park,Rares Dolga,Tudor Berariu,Yu Zhao,Kamil Ciosek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Embedding models trained separately on similar data often produce representations that encode stable information but are not directly interchangeable. This lack of interoperability raises challenges in several practical applications, such as model retraining, partial model upgrades, and multimodal search. Driven by these challenges, we study when two sets of embeddings can be aligned by an orthogonal transformation. We show that if pairwise dot products are approximately preserved, then there exists an isometry that closely aligns the two sets, and we provide a tight bound on the alignment error. This insight yields a simple alignment recipe, Procrustes post-processing, that makes two embedding models interoperable while preserving the geometry of each embedding space. Empirically, we demonstrate its effectiveness in three applications: maintaining compatibility across retrainings, combining different models for text retrieval, and improving mixed-modality search, where it achieves state-of-the-art performance.
[LG-30] Optimizing Storag e Overhead of User Behavior Log for ML-embedded Mobile Apps
链接: https://arxiv.org/abs/2510.13405
作者: Chen Gong,Yan Zhuang,Zhenzhe Zheng,Yiliu Chen,Sheng Wang,Fan Wu,Guihai Chen
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Machine learning (ML) models are increasingly integrated into modern mobile apps to enable personalized and intelligent services. These models typically rely on rich input features derived from historical user behaviors to capture user intents. However, as ML-driven services become more prevalent, recording necessary user behavior data imposes substantial storage cost on mobile apps, leading to lower system responsiveness and more app uninstalls. To address this storage bottleneck, we present AdaLog, a lightweight and adaptive system designed to improve the storage efficiency of user behavior log in ML-embedded mobile apps, without compromising model inference accuracy or latency. We identify two key inefficiencies in current industrial practices of user behavior log: (i) redundant logging of overlapping behavior data across different features and models, and (ii) sparse storage caused by storing behaviors with heterogeneous attribute descriptions in a single log file. To solve these issues, AdaLog first formulates the elimination of feature-level redundant data as a maximum weighted matching problem in hypergraphs, and proposes a hierarchical algorithm for efficient on-device deployment. Then, AdaLog employs a virtually hashed attribute design to distribute heterogeneous behaviors into a few log files with physically dense storage. Finally, to ensure scalability to dynamic user behavior patterns, AdaLog designs an incremental update mechanism to minimize the I/O operations needed for adapting outdated behavior log. We implement a prototype of AdaLog and deploy it into popular mobile apps in collaboration with our industry partner. Evaluations on real-world user data show that AdaLog reduces behavior log size by 19% to 44% with minimal system overhead (only 2 seconds latency and 15 MB memory usage), providing a more efficient data foundation for broader adoption of on-device ML.
[LG-31] SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB
链接: https://arxiv.org/abs/2510.13404
作者: Muhammad Ishfaq Hussain,Ma Van Linh,Zubia Naz,Unse Fatima,Yeongmin Ko,Moongu Jeon
类目: Machine Learning (cs.LG)
*备注:
Abstract:Enhancing scene understanding in adverse visibility conditions remains a critical challenge for surveillance and autonomous navigation systems. Conventional imaging modalities, such as RGB and thermal infrared (MWIR / LWIR), when fused, often struggle to deliver comprehensive scene information, particularly under conditions of atmospheric interference or inadequate illumination. To address these limitations, Short-Wave Infrared (SWIR) imaging has emerged as a promising modality due to its ability to penetrate atmospheric disturbances and differentiate materials with improved clarity. However, the advancement and widespread implementation of SWIR-based systems face significant hurdles, primarily due to the scarcity of publicly accessible SWIR datasets. In response to this challenge, our research introduces an approach to synthetically generate SWIR-like structural/contrast cues (without claiming spectral reproduction) images from existing LWIR data using advanced contrast enhancement techniques. We then propose a multimodal fusion framework integrating synthetic SWIR, LWIR, and RGB modalities, employing an optimized encoder-decoder neural network architecture with modality-specific encoders and a softmax-gated fusion head. Comprehensive experiments on public RGB-LWIR benchmarks (M3FD, TNO, CAMEL, MSRS, RoadScene) and an additional private real RGB-MWIR-SWIR dataset demonstrate that our synthetic-SWIR-enhanced fusion framework improves fused-image quality (contrast, edge definition, structural fidelity) while maintaining real-time performance. We also add fair trimodal baselines (LP, LatLRR, GFF) and cascaded trimodal variants of U2Fusion/SwinFusion under a unified protocol. The outcomes highlight substantial potential for real-world applications in surveillance and autonomous systems.
[LG-32] F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLM s ISCA2025
链接: https://arxiv.org/abs/2510.13401
作者: Jude Haris,José Cano
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted to Workshop on New Approaches for Addressing the Computing Requirements of LLMs and GNNs (LG-ARC) @ ISCA 2025
Abstract:Large Language Models (LLMs) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of LLM inference frameworks, such as this http URL, which support optimizations such as KV-caching and quantization, it is now easier than ever to deploy LLMs on edge devices. Quantization is fundamental to enable LLMs on resource-constrained edge devices, and this http URL utilizes block floating point (BFP) quantization to drastically reduce the bit width of weights and input tensors, the memory footprint, and the computational power required to run LLMs. LLMs are typically quantized with mixed BFP quantization across the model layers to reduce the loss of model accuracy due to quantization. Therefore, to efficiently accelerate across the layers of BFP-quantized LLMs, specialized accelerators need to support different BFP variants without reconfiguration. To address this issue, we propose a Flexible Block FloatingPoint Quantization (F-BFQ) accelerator, which can dynamically switch between two BFP quantization variants and perform matrix multiplication (MatMul) operations. Our initial F-BFQ accelerator design, deployed on the AMD Kria board, reduces inference time by 1.4x on average over the Arm NEON-based CPU execution across three BFP quantized LLMs while achieving 5.2 tokens per second (~3.9 words per second).
[LG-33] Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring
链接: https://arxiv.org/abs/2510.13397
作者: Yuxin Wang,Dennis Frauen,Jonas Schweisthal,Maresa Schröder,Stefan Feuerriegel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further develop a novel meta-learner that estimates the bounds using arbitrary machine learning models and with favorable theoretical properties, including double robustness and quasi-oracle efficiency. We demonstrate the practical value of our meta-learner through numerical experiments and in an application to a cancer drug trial. Together, our framework offers a practical tool for assessing the robustness of estimated treatment effects in the presence of censoring and thus promotes the reliable use of survival data for evidence generation in medicine and epidemiology.
[LG-34] Going with the Flow: Approximating Banzhaf Values via Graph Neural Networks
链接: https://arxiv.org/abs/2510.13391
作者: Benjamin Kempinski,Tal Kachman
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 21 pages, 8 figures, 11-page appendix
Abstract:Computing the Banzhaf value in network flow games is fundamental for quantifying agent influence in multi-agent systems, with applications ranging from cybersecurity to infrastructure planning. However, exact computation is intractable for systems with more than \sim20 agents due to exponential complexity \mathcalO(2^m) . While Monte Carlo sampling methods provide statistical estimates, they suffer from high sample complexity and cannot transfer knowledge across different network configurations, making them impractical for large-scale or dynamic systems. We present a novel learning-based approach using Graph Neural Networks (GNNs) to approximate Banzhaf values in cardinal network flow games. By framing the problem as a graph-level prediction task, our method learns generalisable patterns of agent influence directly from network topology and control structure. We conduct a comprehensive empirical study comparing three state-of-the-art GNN architectures-Graph Attention Networks (GAT), Graph Isomorphism Networks with Edge features (GINE), and EdgeConv-on a large-scale synthetic dataset of 200,000 graphs per configuration, varying in size (20-100 nodes), agent count (5-20), and edge probability (0.5-1.0). Our results demonstrate that trained GNN models achieve high-fidelity Banzhaf value approximation with order-of-magnitude speedups compared to exact and sampling-based methods. Most significantly, we show strong zero-shot generalisation: models trained on graphs of a specific size and topology accurately predict Banzhaf values for entirely new networks with different structural properties, without requiring retraining. This work establishes GNNs as a practical tool for scalable cooperative game-theoretic analysis of complex networked systems.
[LG-35] Prediction Markets with Intermittent Contributions
链接: https://arxiv.org/abs/2510.13385
作者: Michael Vitali,Pierre Pinson
类目: Machine Learning (cs.LG)
*备注: Submitted to PSCC 2026
Abstract:Although both data availability and the demand for accurate forecasts are increasing, collaboration between stakeholders is often constrained by data ownership and competitive interests. In contrast to recent proposals within cooperative game-theoretical frameworks, we place ourselves in a more general framework, based on prediction markets. There, independent agents trade forecasts of uncertain future events in exchange for rewards. We introduce and analyse a prediction market that (i) accounts for the historical performance of the agents, (ii) adapts to time-varying conditions, while (iii) permitting agents to enter and exit the market at will. The proposed design employs robust regression models to learn the optimal forecasts’ combination whilst handling missing submissions. Moreover, we introduce a pay-off allocation mechanism that considers both in-sample and out-of-sample performance while satisfying several desirable economic properties. Case-studies using simulated and real-world data allow demonstrating the effectiveness and adaptability of the proposed market design.
[LG-36] Contrastive Learning-Based Dependency Modeling for Anomaly Detection in Cloud Services
链接: https://arxiv.org/abs/2510.13368
作者: Yue Xing,Yingnan Deng,Heyao Liu,Ming Wang,Yun Zi,Xiaoxuan Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses the challenges of complex dependencies and diverse anomaly patterns in cloud service environments by proposing a dependency modeling and anomaly detection method that integrates contrastive learning. The method abstracts service interactions into a dependency graph, extracts temporal and structural features through embedding functions, and employs a graph convolution mechanism to aggregate neighborhood information for context-aware service representations. A contrastive learning framework is then introduced, constructing positive and negative sample pairs to enhance the separability of normal and abnormal patterns in the representation space. Furthermore, a temporal consistency constraint is designed to maintain representation stability across time steps and reduce the impact of short-term fluctuations and noise. The overall optimization combines contrastive loss and temporal consistency loss to ensure stable and reliable detection across multi-dimensional features. Experiments on public datasets systematically evaluate the method from hyperparameter, environmental, and data sensitivity perspectives. Results show that the proposed approach significantly outperforms existing methods on key metrics such as Precision, Recall, F1-Score, and AUC, while maintaining robustness under conditions of sparse labeling, monitoring noise, and traffic fluctuations. This study verifies the effectiveness of integrating dependency modeling with contrastive learning, provides a complete technical solution for cloud service anomaly detection, and demonstrates strong adaptability and stability in complex environments.
[LG-37] Kernel Representation and Similarity Measure for Incomplete Data
链接: https://arxiv.org/abs/2510.13352
作者: Yang Cao,Sikun Yang,Kai He,Wenjun Ma,Ming Liu,Yujiu Yang,Jian Weng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis. Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates. This paper presents the proximity kernel, a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space. The proposed method introduces data-dependent binning combined with proximity assignment to project data into a high-dimensional sparse representation that adapts to local density variations. For missing value handling, we propose a cascading fallback strategy to estimate missing feature distributions. We conduct clustering tasks on the proposed kernel representation across 12 real world incomplete datasets, demonstrating superior performance compared to existing methods while maintaining linear time complexity. All the code are available at this https URL.
[LG-38] When In Doubt Abstain: The Impact of Abstention on Strategic Classification
链接: https://arxiv.org/abs/2510.13327
作者: Lina Alkarmi,Ziyuan Huang,Mingyan Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Algorithmic decision making is increasingly prevalent, but often vulnerable to strategic manipulation by agents seeking a favorable outcome. Prior research has shown that classifier abstention (allowing a classifier to decline making a decision due to insufficient confidence) can significantly increase classifier accuracy. This paper studies abstention within a strategic classification context, exploring how its introduction impacts strategic agents’ responses and how principals should optimally leverage it. We model this interaction as a Stackelberg game where a principal, acting as the classifier, first announces its decision policy, and then strategic agents, acting as followers, manipulate their features to receive a desired outcome. Here, we focus on binary classifiers where agents manipulate observable features rather than their true features, and show that optimal abstention ensures that the principal’s utility (or loss) is no worse than in a non-abstention setting, even in the presence of strategic agents. We also show that beyond improving accuracy, abstention can also serve as a deterrent to manipulation, making it costlier for agents, especially those less qualified, to manipulate to achieve a positive outcome when manipulation costs are significant enough to affect agent behavior. These results highlight abstention as a valuable tool for reducing the negative effects of strategic behavior in algorithmic decision making systems.
[LG-39] RockNet: Distributed Learning on Ultra-Low-Power Devices
链接: https://arxiv.org/abs/2510.13320
作者: Alexander Gräfe,Fabian Mager,Marco Zimmerling,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注:
Abstract:As Machine Learning (ML) becomes integral to Cyber-Physical Systems (CPS), there is growing interest in shifting training from traditional cloud-based to on-device processing (TinyML), for example, due to privacy and latency concerns. However, CPS often comprise ultra-low-power microcontrollers, whose limited compute resources make training challenging. This paper presents RockNet, a new TinyML method tailored for ultra-low-power hardware that achieves state-of-the-art accuracy in timeseries classification, such as fault or malware detection, without requiring offline pretraining. By leveraging that CPS consist of multiple devices, we design a distributed learning method that integrates ML and wireless communication. RockNet leverages all devices for distributed training of specialized compute efficient classifiers that need minimal communication overhead for parallelization. Combined with tailored and efficient wireless multi-hop communication protocols, our approach overcomes the communication bottleneck that often occurs in distributed learning. Hardware experiments on a testbed with 20 ultra-low-power devices demonstrate RockNet’s effectiveness. It successfully learns timeseries classification tasks from scratch, surpassing the accuracy of the latest approach for neural network microcontroller training by up to 2x. RockNet’s distributed ML architecture reduces memory, latency and energy consumption per device by up to 90 % when scaling from one central device to 20 devices. Our results show that a tight integration of distributed ML, distributed computing, and communication enables, for the first time, training on ultra-low-power hardware with state-of-the-art accuracy.
[LG-40] Isolation-based Spherical Ensemble Representations for Anomaly Detection
链接: https://arxiv.org/abs/2510.13311
作者: Yang Cao,Sikun Yang,Hao Tian,Kai He,Lianyong Qi,Ming Liu,Yujiu Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Anomaly detection is a critical task in data mining and management with applications spanning fraud detection, network security, and log monitoring. Despite extensive research, existing unsupervised anomaly detection methods still face fundamental challenges including conflicting distributional assumptions, computational inefficiency, and difficulty handling different anomaly types. To address these problems, we propose ISER (Isolation-based Spherical Ensemble Representations) that extends existing isolation-based methods by using hypersphere radii as proxies for local density characteristics while maintaining linear time and constant space complexity. ISER constructs ensemble representations where hypersphere radii encode density information: smaller radii indicate dense regions while larger radii correspond to sparse areas. We introduce a novel similarity-based scoring method that measures pattern consistency by comparing ensemble representations against a theoretical anomaly reference pattern. Additionally, we enhance the performance of Isolation Forest by using ISER and adapting the scoring function to address axis-parallel bias and local anomaly detection limitations. Comprehensive experiments on 22 real-world datasets demonstrate ISER’s superior performance over 11 baseline methods.
[LG-41] Km-scale dynamical downscaling through conformalized latent diffusion models
链接: https://arxiv.org/abs/2510.13301
作者: Alessandro Brusaferri,Andrea Ballarino
类目: Machine Learning (cs.LG)
*备注: 7 pages
Abstract:Dynamical downscaling is crucial for deriving high-resolution meteorological fields from coarse-scale simulations, enabling detailed analysis for critical applications such as weather forecasting and renewable energy modeling. Generative Diffusion models (DMs) have recently emerged as powerful data-driven tools for this task, offering reconstruction fidelity and more scalable sampling supporting uncertainty quantification. However, DMs lack finite-sample guarantees against overconfident predictions, resulting in miscalibrated grid-point-level uncertainty estimates hindering their reliability in operational contexts. In this work, we tackle this issue by augmenting the downscaling pipeline with a conformal prediction framework. Specifically, the DM’s samples are post-processed to derive conditional quantile estimates, incorporated into a conformalized quantile regression procedure targeting locally adaptive prediction intervals with finite-sample marginal validity. The proposed approach is evaluated on ERA5 reanalysis data over Italy, downscaled to a 2-km grid. Results demonstrate grid-point-level uncertainty estimates with markedly improved coverage and stable probabilistic scores relative to the DM baseline, highlighting the potential of conformalized generative models for more trustworthy probabilistic downscaling to high-resolution meteorological fields.
[LG-42] Federated Conditional Conformal Prediction via Generative Models
链接: https://arxiv.org/abs/2510.13297
作者: Rui Xu,Sihong Xie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conformal Prediction (CP) provides distribution-free uncertainty quantification by constructing prediction sets that guarantee coverage of the true labels. This reliability makes CP valuable for high-stakes federated learning scenarios such as multi-center healthcare. However, standard CP assumes i.i.d. data, which is violated in federated settings where client distributions differ substantially. Existing federated CP methods address this by maintaining marginal coverage on each client, but such guarantees often fail to reflect input-conditional uncertainty. In this work, we propose Federated Conditional Conformal Prediction (Fed-CCP) via generative models, which aims for conditional coverage that adapts to local data heterogeneity. Fed-CCP leverages generative models, such as normalizing flows or diffusion models, to approximate conditional data distributions without requiring the sharing of raw data. This enables each client to locally calibrate conformal scores that reflect its unique uncertainty, while preserving global consistency through federated aggregation. Experiments on real datasets demonstrate that Fed-CCP achieves more adaptive prediction sets.
[LG-43] BlendFL: Blended Federated Learning for Handling Multimodal Data Heterogeneity
链接: https://arxiv.org/abs/2510.13266
作者: Alejandro Guerra-Manzanares,Omar El-Herraoui,Michail Maniatakos,Farah E. Shamout
类目: Machine Learning (cs.LG)
*备注:
Abstract:One of the key challenges of collaborative machine learning, without data sharing, is multimodal data heterogeneity in real-world settings. While Federated Learning (FL) enables model training across multiple clients, existing frameworks, such as horizontal and vertical FL, are only effective in `ideal’ settings that meet specific assumptions. Hence, they struggle to address scenarios where neither all modalities nor all samples are represented across the participating clients. To address this gap, we propose BlendFL, a novel FL framework that seamlessly blends the principles of horizontal and vertical FL in a synchronized and non-restrictive fashion despite the asymmetry across clients. Specifically, any client within BlendFL can benefit from either of the approaches, or both simultaneously, according to its available dataset. In addition, BlendFL features a decentralized inference mechanism, empowering clients to run collaboratively trained local models using available local data, thereby reducing latency and reliance on central servers for inference. We also introduce BlendAvg, an adaptive global model aggregation strategy that prioritizes collaborative model updates based on each client’s performance. We trained and evaluated BlendFL and other state-of-the-art baselines on three classification tasks using a large-scale real-world multimodal medical dataset and a popular multimodal benchmark. Our results highlight BlendFL’s superior performance for both multimodal and unimodal classification. Ablation studies demonstrate BlendFL’s faster convergence compared to traditional approaches, accelerating collaborative learning. Overall, in our study we highlight the potential of BlendFL for handling multimodal data heterogeneity for collaborative learning in real-world settings where data privacy is crucial, such as in healthcare and finance.
[LG-44] Hypernetworks for Perspectivist Adaptation
链接: https://arxiv.org/abs/2510.13259
作者: Daniil Ignatev,Denis Paperno,Massimo Poesio
类目: Machine Learning (cs.LG)
*备注: Accepted at NLPerspectives workshop 2025
Abstract:The task of perspective-aware classification introduces a bottleneck in terms of parametric efficiency that did not get enough recognition in existing studies. In this article, we aim to address this issue by applying an existing architecture, the hypernetwork+adapters combination, to perspectivist classification. Ultimately, we arrive at a solution that can compete with specialized models in adopting user perspectives on hate speech and toxicity detection, while also making use of considerably fewer parameters. Our solution is architecture-agnostic and can be applied to a wide range of base models out of the box.
[LG-45] Rethinking Graph Domain Adaptation: A Spectral Contrastive Perspective ECML-PKDD2025
链接: https://arxiv.org/abs/2510.13254
作者: Haoyu Zhang,Yuxuan Cheng,Wenqi Fan,Yulong Chen,Yifan Zhang
类目: Machine Learning (cs.LG)
*备注: This paper is accepted by ECML-PKDD 2025
Abstract:Graph neural networks (GNNs) have achieved remarkable success in various domains, yet they often struggle with domain adaptation due to significant structural distribution shifts and insufficient exploration of transferable patterns. One of the main reasons behind this is that traditional approaches do not treat global and local patterns discriminatingly so that some local details in the graph may be violated after multi-layer GNN. Our key insight is that domain shifts can be better understood through spectral analysis, where low-frequency components often encode domain-invariant global patterns, and high-frequency components capture domain-specific local details. As such, we propose FracNet (\underline\textbfFrequency \underline\textbfAware \underline\textbfContrastive Graph \underline\textbfNetwork) with two synergic modules to decompose the original graph into high-frequency and low-frequency components and perform frequency-aware domain adaption. Moreover, the blurring boundary problem of domain adaptation is improved by integrating with a contrastive learning framework. Besides the practical implication, we also provide rigorous theoretical proof to demonstrate the superiority of FracNet. Extensive experiments further demonstrate significant improvements over state-of-the-art approaches.
[LG-46] Automated Network Protocol Testing with LLM Agents
链接: https://arxiv.org/abs/2510.13248
作者: Yunze Wei,Kaiwen Wei,Shibo Du,Jianyu Wang,Zhangzhong Liu,Yawen Wang,Zhanyou Li,Congcong Miao,Xiaohui Xie,Yong Cui
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:Network protocol testing is fundamental for modern network infrastructure. However, traditional network protocol testing methods are labor-intensive and error-prone, requiring manual interpretation of specifications, test case design, and translation into executable artifacts, typically demanding one person-day of effort per test case. Existing model-based approaches provide partial automation but still involve substantial manual modeling and expert intervention, leading to high costs and limited adaptability to diverse and evolving protocols. In this paper, we propose a first-of-its-kind system called NeTestLLM that takes advantage of multi-agent Large Language Models (LLMs) for end-to-end automated network protocol testing. NeTestLLM employs hierarchical protocol understanding to capture complex specifications, iterative test case generation to improve coverage, a task-specific workflow for executable artifact generation, and runtime feedback analysis for debugging and refinement. NeTestLLM has been deployed in a production environment for several months, receiving positive feedback from domain experts. In experiments, NeTestLLM generated 4,632 test cases for OSPF, RIP, and BGP, covering 41 historical FRRouting bugs compared to 11 by current national standards. The process of generating executable artifacts also improves testing efficiency by a factor of 8.65x compared to manual methods. NeTestLLM provides the first practical LLM-powered solution for automated end-to-end testing of heterogeneous network protocols.
[LG-47] Altruistic Ride Sharing: A Community-Driven Approach to Short-Distance Mobility
链接: https://arxiv.org/abs/2510.13227
作者: Divyanshu Singh,Ashman Mehra,Snehanshu Saha,Santonu Sarkar
类目: Multiagent Systems (cs.MA); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Intelligent Transportation Systems
Abstract:Urban mobility faces persistent challenges of congestion and fuel consumption, specifically when people choose a private, point-to-point commute option. Profit-driven ride-sharing platforms prioritize revenue over fairness and sustainability. This paper introduces Altruistic Ride-Sharing (ARS), a decentralized, peer-to-peer mobility framework where participants alternate between driver and rider roles based on altruism points rather than monetary incentives. The system integrates multi-agent reinforcement learning (MADDPG) for dynamic ride-matching, game-theoretic equilibrium guarantees for fairness, and a population model to sustain long-term balance. Using real-world New York City taxi data, we demonstrate that ARS reduces travel distance and emissions, increases vehicle utilization, and promotes equitable participation compared to both no-sharing and optimization-based baselines. These results establish ARS as a scalable, community-driven alternative to conventional ride-sharing, aligning individual behavior with collective urban sustainability goals.
[LG-48] LLM -guided Hierarchical Retrieval
链接: https://arxiv.org/abs/2510.13217
作者: Nilesh Gupta,Wei-Cheng Chang,Ngot Bui,Cho-Jui Hsieh,Inderjit S. Dhillon
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Modern IR systems are increasingly tasked with answering complex, multi-faceted queries that require deep reasoning rather than simple keyword or semantic matching. While LLM-based IR has shown great promise, the prevailing retrieve-then-rerank paradigm inherits the limitations of embedding-based retrieval; parametric generative approaches are difficult to update with new information; and long-context methods that place the entire corpus in context are computationally infeasible for large document collections. To address these challenges, we introduce LATTICE, a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity by imposing a semantic tree structure on the corpus. Our approach consists of two stages: (1) an offline phase that organizes the corpus into a semantic hierarchy via either a bottom-up agglomerative strategy or a top-down divisive strategy using multi-level summaries and (2) an online traversal phase where a search LLM navigates this tree. A central challenge in such LLM-guided search is that the model’s relevance judgments are noisy, context-dependent, and unaware of the hierarchy, making cross-branch and cross-level comparisons difficult. To overcome this, we propose a traversal algorithm that estimates calibrated latent relevance scores from local LLM outputs and aggregates them into a global path relevance metric. Our training-free framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark, demonstrating up to 9% improvement in Recall@100 and 5% in nDCG@10 over the next best zero-shot baseline. Furthermore, compared to the fine-tuned SOTA method DIVER-v2, LATTICE attains comparable results on BRIGHT subsets that use a static corpus for evaluation.
[LG-49] owards Understanding Valuable Preference Data for Large Language Model Alignment
链接: https://arxiv.org/abs/2510.13212
作者: Zizhuo Zhang,Qizhou Wang,Shanshan Ye,Jianing Zhu,Jiangchao Yao,Bo Han,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language model (LLM) alignment is typically achieved through learning from human preference comparisons, making the quality of preference data critical to its success. Existing studies often pre-process raw training datasets to identify valuable preference pairs using external reward models or off-the-shelf LLMs, achieving improved overall performance but rarely examining whether individual, selected data point is genuinely beneficial. We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF), which mitigates the over-scoring present in traditional measures and reveals that preference data quality is inherently a property of the model. In other words, a data pair that benefits one model may harm another. This leaves the need to improve the preference data selection approaches to be adapting to specific models. To this end, we introduce two candidate scoring functions (SFs) that are computationally simpler than TIF and positively correlated with it. They are also model dependent and can serve as potential indicators of individual data quality for preference data selection. Furthermore, we observe that these SFs inherently exhibit errors when compared to TIF. To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule that enables the models to achieve a more precise selection of valuable preference data. We conduct experiments across diverse alignment benchmarks and various LLM families, with results demonstrating that better alignment performance can be achieved using less data, showing the generality of our findings and new methods.
[LG-50] Performance Evaluation of Ising and QUBO Variable Encodings in Boltzmann Machine Learning
链接: https://arxiv.org/abs/2510.13210
作者: Yasushi Hasegawa,Masayuki Ohzeki
类目: Machine Learning (cs.LG)
*备注: 12pages, 6figures
Abstract:We compare Ising (-1,+1) and QUBO (0,1) encodings for Boltzmann machine learning under a controlled protocol that fixes the model, sampler, and step size. Exploiting the identity that the Fisher information matrix (FIM) equals the covariance of sufficient statistics, we visualize empirical moments from model samples and reveal systematic, representation-dependent differences. QUBO induces larger cross terms between first- and second-order statistics, creating more small-eigenvalue directions in the FIM and lowering spectral entropy. This ill-conditioning explains slower convergence under stochastic gradient descent (SGD). In contrast, natural gradient descent (NGD)-which rescales updates by the FIM metric-achieves similar convergence across encodings due to reparameterization invariance. Practically, for SGD-based training, the Ising encoding provides more isotropic curvature and faster convergence; for QUBO, centering/scaling or NGD-style preconditioning mitigates curvature pathologies. These results clarify how representation shapes information geometry and finite-time learning dynamics in Boltzmann machines and yield actionable guidelines for variable encoding and preprocessing.
[LG-51] Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning
链接: https://arxiv.org/abs/2510.13182
作者: Rongrong Xie,Yizhou Xu,Guido Sanguinetti
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid increase in multimodal data availability has sparked significant interest in cross-modal knowledge distillation (KD) techniques, where richer “teacher” modalities transfer information to weaker “student” modalities during model training to improve performance. However, despite successes across various applications, cross-modal KD does not always result in improved outcomes, primarily due to a limited theoretical understanding that could inform practice. To address this gap, we introduce the Cross-modal Complementarity Hypothesis (CCH): we propose that cross-modal KD is effective when the mutual information between teacher and student representations exceeds the mutual information between the student representation and the labels. We theoretically validate the CCH in a joint Gaussian model and further confirm it empirically across diverse multimodal datasets, including image, text, video, audio, and cancer-related omics data. Our study establishes a novel theoretical framework for understanding cross-modal KD and offers practical guidelines based on the CCH criterion to select optimal teacher modalities for improving the performance of weaker modalities.
[LG-52] Universally Invariant Learning in Equivariant GNNs
链接: https://arxiv.org/abs/2510.13169
作者: Jiacheng Cen,Anyi Li,Ning Lin,Tingyang Xu,Yu Rong,Deli Zhao,Zihe Wang,Wenbing Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Equivariant Graph Neural Networks (GNNs) have demonstrated significant success across various applications. To achieve completeness – that is, the universal approximation property over the space of equivariant functions – the network must effectively capture the intricate multi-body interactions among different nodes. Prior methods attain this via deeper architectures, augmented body orders, or increased degrees of steerable features, often at high computational cost and without polynomial-time solutions. In this work, we present a theoretically grounded framework for constructing complete equivariant GNNs that is both efficient and practical. We prove that a complete equivariant GNN can be achieved through two key components: 1) a complete scalar function, referred to as the canonical form of the geometric graph; and 2) a full-rank steerable basis set. Leveraging this finding, we propose an efficient algorithm for constructing complete equivariant GNNs based on two common models: EGNN and TFN. Empirical results demonstrate that our model demonstrates superior completeness and excellent performance with only a few layers, thereby significantly reducing computational overhead while maintaining strong practical efficacy.
[LG-53] D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations
链接: https://arxiv.org/abs/2510.13147
作者: Faraz Tahmasebi,Michael Pelluer,Hyoukjun Kwon
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 12 pages, 13 figures
Abstract:The computation and memory costs of large language models kept increasing over last decade, which reached over the scale of 1T parameters. To address the challenges from the large scale models, model compression techniques such as low-rank decomposition have been explored. Previous model decomposition works have focused on weight decomposition to avoid costly runtime decomposition, whose latency often significantly exceeds the benefits from decomposition (e.g., 38% more end-to-end latency when running Llama2-7b on A100 with 4K sequence length with activation decomposition compared to no decomposition). In this work, we debunk such observations and report that the input decomposition can be significantly beneficial with a proper choice of decomposition algorithm and hardware support. We adopt progressive decomposition algorithm, Lanczos algorithm, and design a co-accelerator architecture for the decomposition algorithm. To address the memory- boundness of the decomposition operation, we introduce a novel compute replication methodology that moves the op- eration toward compute-bound region, which enables 6.2x speedup in our evaluation. We also develop an output shape- preserving computation scheme that eliminates decomposi- tion costs in consecutive layers. To compensate model quality loss from compression, we introduce a multi-track decom- position approach that separately handles outlier channels for high accuracy and low perplexity with minimal compu- tational costs. Combined together, our accelerator, D-com, provides 22% end-to-end latency improvements compared to A100 GPU at the cost of small model quality degradation (e.g., 3% on AI2 Reasoning Challenge task).
[LG-54] Convergence design and training of continuous-time dropout as a random batch method
链接: https://arxiv.org/abs/2510.13134
作者: Antonio Álvarez-López,Martín Hernández
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 37 pages, 20 figures
Abstract:We study dropout regularization in continuous-time models through the lens of random-batch methods – a family of stochastic sampling schemes originally devised to reduce the computational cost of interacting particle systems. We construct an unbiased, well-posed estimator that mimics dropout by sampling neuron batches over time intervals of length h . Trajectory-wise convergence is established with linear rate in h for the expected uniform error. At the distribution level, we establish stability for the associated continuity equation, with total-variation error of order h^1/2 under mild moment assumptions. During training with fixed batch sampling across epochs, a Pontryagin-based adjoint analysis bounds deviations in the optimal cost and control, as well as in gradient-descent iterates. On the design side, we compare convergence rates for canonical batch sampling schemes, recover standard Bernoulli dropout as a special case, and derive a cost–accuracy trade-off yielding a closed-form optimal h . We then specialize to a single-layer neural ODE and validate the theory on classification and flow matching, observing the predicted rates, regularization effects, and favorable runtime and memory profiles.
[LG-55] Cluster-Based Client Selection for Dependent Multi-Task Federated Learning in Edge Computing
链接: https://arxiv.org/abs/2510.13132
作者: Jieping Luo,Qiyue Li,Zhizhang Liu,Hang Qi,Jiaying Yin,Jingjin Wu
类目: Machine Learning (cs.LG)
*备注: 6 pages
Abstract:We study the client selection problem in Federated Learning (FL) within mobile edge computing (MEC) environments, particularly under the dependent multi-task settings, to reduce the total time required to complete various learning tasks. We propose CoDa-FL, a Cluster-oriented and Dependency-aware framework designed to reduce the total required time via cluster-based client selection and dependent task assignment. Our approach considers Earth Mover’s Distance (EMD) for client clustering based on their local data distributions to lower computational cost and improve communication efficiency. We derive a direct and explicit relationship between intra-cluster EMD and the number of training rounds required for convergence, thereby simplifying the otherwise complex process of obtaining the optimal solution. Additionally, we incorporate a directed acyclic graph-based task scheduling mechanism to effectively manage task dependencies. Through numerical experiments, we validate that our proposed CoDa-FL outperforms existing benchmarks by achieving faster convergence, lower communication and computational costs, and higher learning accuracy under heterogeneous MEC settings.
[LG-56] Neural Triangular Transport Maps: A New Approach Towards Sampling in Lattice QCD
链接: https://arxiv.org/abs/2510.13112
作者: Andrey Bryutkin,Youssef Marzouk
类目: Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注:
Abstract:Lattice field theories are fundamental testbeds for computational physics; yet, sampling their Boltzmann distributions remains challenging due to multimodality and long-range correlations. While normalizing flows offer a promising alternative, their application to large lattices is often constrained by prohibitive memory requirements and the challenge of maintaining sufficient model expressivity. We propose sparse triangular transport maps that explicitly exploit the conditional independence structure of the lattice graph under periodic boundary conditions using monotone rectified neural networks (MRNN). We introduce a comprehensive framework for triangular transport maps that navigates the fundamental trade-off between \emphexact sparsity (respecting marginal conditional independence in the target distribution) and \emphapproximate sparsity (computational tractability without fill-ins). Restricting each triangular map component to a local past enables site-wise parallel evaluation and linear time complexity in lattice size N , while preserving the expressive, invertible structure. Using \phi^4 in two dimensions as a controlled setting, we analyze how node labelings (orderings) affect the sparsity and performance of triangular maps. We compare against Hybrid Monte Carlo (HMC) and established flow approaches (RealNVP).
[LG-57] DeepCausalMMM: A Deep Learning Framework for Marketing Mix Modeling with Causal Inference
链接: https://arxiv.org/abs/2510.13087
作者: Aditya Puttaparthi Tirumala
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Submitted to JOSS (Journal of Open Source Software) Journal for Publishing. It’s currently in the Pre-review stage. Please note that Author has no middle name. Last name is ‘Puttaparthi Tirumala’ (it’s a two-part surname)
Abstract:Marketing Mix Modeling (MMM) is a statistical technique used to estimate the impact of marketing activities on business outcomes such as sales, revenue, or customer visits. Traditional MMM approaches often rely on linear regression or Bayesian hierarchical models that assume independence between marketing channels and struggle to capture complex temporal dynamics and non-linear saturation effects [@Hanssens2005; @Ng2021Bayesian]. DeepCausalMMM is a Python package that addresses these limitations by combining deep learning, causal inference, and advanced marketing science. The package uses Gated Recurrent Units (GRUs) to automatically learn temporal patterns such as adstock (carryover effects) and lag, while simultaneously learning statistical dependencies and potential causal structures between marketing channels through Directed Acyclic Graph (DAG) learning [@Zheng2018NOTEARS; @Gong2024CausalMMM]. Additionally, it implements Hill equation-based saturation curves to model diminishing returns and optimize budget allocation. Key innovations include: (1) a data-driven design where hyperparameters and transformations (e.g., adstock decay, saturation curves) are learned or estimated from data with sensible defaults, rather than requiring fixed heuristics or manual specification, (2) multi-region modeling with both shared and region-specific parameters, (3) robust statistical methods including Huber loss and advanced regularization, (4) comprehensive response curve analysis for understanding channel saturation, and (5) an extensive visualization suite with 14+ interactive dashboards for business insights. Comments: Submitted to JOSS (Journal of Open Source Software) Journal for Publishing. It’s currently in the Pre-review stage. Please note that Author has no middle name. Last name is ‘Puttaparthi Tirumala’ (it’s a two-part surname) Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) MSC classes: 62P20, 62M10, 68T05 ACMclasses: D.2.2; I.2.6; G.3 Reportnumber: DOI: 10.5281/zenodo.17274024 Cite as: arXiv:2510.13087 [cs.LG] (or arXiv:2510.13087v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.13087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-58] Absolute indices for determining compactness separability and number of clusters
链接: https://arxiv.org/abs/2510.13065
作者: Adil M. Bagirov,Ramiz M. Aliguliyev,Nargiz Sultanova,Sona Taheri
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures, 9 tables
Abstract:Finding “true” clusters in a data set is a challenging problem. Clustering solutions obtained using different models and algorithms do not necessarily provide compact and well-separated clusters or the optimal number of clusters. Cluster validity indices are commonly applied to identify such clusters. Nevertheless, these indices are typically relative, and they are used to compare clustering algorithms or choose the parameters of a clustering algorithm. Moreover, the success of these indices depends on the underlying data structure. This paper introduces novel absolute cluster indices to determine both the compactness and separability of clusters. We define a compactness function for each cluster and a set of neighboring points for cluster pairs. This function is utilized to determine the compactness of each cluster and the whole cluster distribution. The set of neighboring points is used to define the margin between clusters and the overall distribution margin. The proposed compactness and separability indices are applied to identify the true number of clusters. Using a number of synthetic and real-world data sets, we demonstrate the performance of these new indices and compare them with other widely-used cluster validity indices.
[LG-59] Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games
链接: https://arxiv.org/abs/2510.13060
作者: Anupam Nayak,Tong Yang,Osman Yagan,Gauri Joshi,Yuejie Chi
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Reverse Kullback-Leibler (KL) divergence-based regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in T that scales inversely with the KL regularization strength \beta in addition to the standard \widetilde\mathcalO(\sqrtT) regret independent of \beta which is attained in both regularized and unregularized settings
[LG-60] An Operational Deep Learning System for Satellite-Based High-Resolution Global Nowcasting
链接: https://arxiv.org/abs/2510.13050
作者: Shreya Agrawal,Mohammed Alewi Hassen,Emmanuel Asiedu Brempong,Boris Babenko,Fred Zyda,Olivia Graham,Di Li,Samier Merchant,Santiago Hincapie Potes,Tyler Russell,Danny Cheresnick,Aditya Prakash Kakkirala,Stephan Rasp,Avinatan Hassidim,Yossi Matias,Nal Kalchbrenner,Pramod Gupta,Jason Hickey,Aaron Bell
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Precipitation nowcasting, which predicts rainfall up to a few hours ahead, is a critical tool for vulnerable communities in the Global South frequently exposed to intense, rapidly developing storms. Timely forecasts provide a crucial window to protect lives and livelihoods. Traditional numerical weather prediction (NWP) methods suffer from high latency, low spatial and temporal resolution, and significant gaps in accuracy across the world. Recent machine learning-based nowcasting methods, common in the Global North, cannot be extended to the Global South due to extremely sparse radar coverage. We present Global MetNet, an operational global machine learning nowcasting model. It leverages the Global Precipitation Mission’s CORRA dataset, geostationary satellite data, and global NWP data to predict precipitation for the next 12 hours. The model operates at a high resolution of approximately 0.05° (~5km) spatially and 15 minutes temporally. Global MetNet significantly outperforms industry-standard hourly forecasts and achieves significantly higher skill, making forecasts useful over a much larger area of the world than previously available. Our model demonstrates better skill in data-sparse regions than even the best high-resolution NWP models achieve in the US. Validated using ground radar and satellite data, it shows significant improvements across key metrics like the critical success index and fractions skill score for all precipitation rates and lead times. Crucially, our model generates forecasts in under a minute, making it readily deployable for real-time applications. It is already deployed for millions of users on Google Search. This work represents a key step in reducing global disparities in forecast quality and integrating sparse, high-resolution satellite observations into weather forecasting.
[LG-61] Bridging Idealized and Operational Models: An Explainable AI Framework for Earth System Emulators
链接: https://arxiv.org/abs/2510.13030
作者: Pouria Behnoudfar,Charlotte Moser,Marc Bocquet,Sibo Cheng,Nan Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Computer models are indispensable tools for understanding the Earth system. While high-resolution operational models have achieved many successes, they exhibit persistent biases, particularly in simulating extreme events and statistical distributions. In contrast, coarse-grained idealized models isolate fundamental processes and can be precisely calibrated to excel in characterizing specific dynamical and statistical features. However, different models remain siloed by disciplinary boundaries. By leveraging the complementary strengths of models of varying complexity, we develop an explainable AI framework for Earth system emulators. It bridges the model hierarchy through a reconfigured latent data assimilation technique, uniquely suited to exploit the sparse output from the idealized models. The resulting bridging model inherits the high resolution and comprehensive variables of operational models while achieving global accuracy enhancements through targeted improvements from idealized models. Crucially, the mechanism of AI provides a clear rationale for these advancements, moving beyond black-box correction to physically insightful understanding in a computationally efficient framework that enables effective physics-assisted digital twins and uncertainty quantification. We demonstrate its power by significantly correcting biases in CMIP6 simulations of El Niño spatiotemporal patterns, leveraging statistically accurate idealized models. This work also highlights the importance of pushing idealized model development and advancing communication between modeling communities.
[LG-62] Information Shapes Koopman Representation
链接: https://arxiv.org/abs/2510.13025
作者: Xiaoyuan Cheng,Wenxuan Yuan,Yiming Yang,Yuanzhao Zhang,Sibo Cheng,Yi He,Zhuo Sun
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The Koopman operator provides a powerful framework for modeling dynamical systems and has attracted growing interest from the machine learning community. However, its infinite-dimensional nature makes identifying suitable finite-dimensional subspaces challenging, especially for deep architectures. We argue that these difficulties come from suboptimal representation learning, where latent variables fail to balance expressivity and simplicity. This tension is closely related to the information bottleneck (IB) dilemma: constructing compressed representations that are both compact and predictive. Rethinking Koopman learning through this lens, we demonstrate that latent mutual information promotes simplicity, yet an overemphasis on simplicity may cause latent space to collapse onto a few dominant modes. In contrast, expressiveness is sustained by the von Neumann entropy, which prevents such collapse and encourages mode diversity. This insight leads us to propose an information-theoretic Lagrangian formulation that explicitly balances this tradeoff. Furthermore, we propose a new algorithm based on the Lagrangian formulation that encourages both simplicity and expressiveness, leading to a stable and interpretable Koopman representation. Beyond quantitative evaluations, we further visualize the learned manifolds under our representations, observing empirical results consistent with our theoretical predictions. Finally, we validate our approach across a diverse range of dynamical systems, demonstrating improved performance over existing Koopman learning methods. The implementation is publicly available at this https URL.
[LG-63] Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment
链接: https://arxiv.org/abs/2510.13023
作者: Joshua R. Tempelman,Adam J. Wachtor,Eric B. Flynn
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 26 pages, 6 page appendix
Abstract:Automated ultrasonic weld inspection remains a significant challenge in the nondestructive evaluation (NDE) community to factors such as limited training data (due to the complexity of curating experimental specimens or high-fidelity simulations) and environmental volatility of many industrial settings (resulting in the corruption of on-the-fly measurements). Thus, an end-to-end machine learning (ML) workflow for acoustic weld inspection in realistic (i.e., industrial) settings has remained an elusive goal. This work addresses the challenges of data curation and signal corruption by proposing workflow consisting of a reduced-order modeling scheme, diffusion based distribution alignment, and U-Net-based segmentation and inversion. A reduced-order Helmholtz model based on Lamb wave theory is used to generate a comprehensive dataset over varying weld heterogeneity and crack defects. The relatively inexpensive low-order solutions provide a robust training dateset for inversion models which are refined through a transfer learning stage using a limited set of full 3D elastodynamic simulations. To handle out-of-distribution (OOD) real-world measurements with varying and unpredictable noise distributions, i.e., Laser Doppler Vibrometry scans, guided diffusion produces in-distribution representations of OOD experimental LDV scans which are subsequently processed by the inversion models. This integrated framework provides an end-to-end solution for automated weld inspection on real data.
[LG-64] Escaping Local Optima in the Waddington Landscape: A Multi-Stage TRPO-PPO Approach for Single-Cell Perturbation Analysis
链接: https://arxiv.org/abs/2510.13018
作者: Francis Boabang,Samuel Asante Gyamerah
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 2 figures, 3 tables
Abstract:Modeling cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology. Existing data-driven framework have advanced perturbation prediction through variational autoencoders, chemically conditioned autoencoders, and large-scale transformer pretraining. However, these models are prone to local optima in the nonconvex Waddington landscape of cell fate decisions, where poor initialization can trap trajectories in spurious lineages or implausible differentiation outcomes. While executable gene regulatory networks complement these approaches, automated design frameworks incorporate biological priors through multi-agent optimization. Yet, an approach that is completely data-driven with well-designed initialization to escape local optima and converge to a proper lineage remains elusive. In this work, we introduce a multistage reinforcement learning algorithm tailored for single-cell perturbation modeling. We first compute an explicit natural gradient update using Fisher-vector products and a conjugate gradient solver, scaled by a KL trust-region constraint to provide a safe, curvature-aware the first step for the policy. Starting with these preconditioned parameters, we then apply a second phase of proximal policy optimization (PPO) with clipped surrogates, exploiting minibatch efficiency to refine the policy. We demonstrate that this initialization substantially improves generalization on Single-cell RNA sequencing (scRNA-seq) and Single-cell ATAC sequencing (scATAC-seq) pertubation analysis.
[LG-65] AMORE: Adaptive Multi-Output Operator Network for Stiff Chemical Kinetics
链接: https://arxiv.org/abs/2510.12999
作者: Kamaljyoti Nath,Additi Pandey,Bryan T. Susi,Hessam Babaee,George Em Karniadakis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Time integration of stiff systems is a primary source of computational cost in combustion, hypersonics, and other reactive transport systems. This stiffness can introduce time scales significantly smaller than those associated with other physical processes, requiring extremely small time steps in explicit schemes or computationally intensive implicit methods. Consequently, strategies to alleviate challenges posed by stiffness are important. While neural operators (DeepONets) can act as surrogates for stiff kinetics, a reliable operator learning strategy is required to appropriately account for differences in the error between output variables and samples. Here, we develop AMORE, Adaptive Multi-Output Operator Network, a framework comprising an operator capable of predicting multiple outputs and adaptive loss functions ensuring reliable operator learning. The operator predicts all thermochemical states from given initial conditions. We propose two adaptive loss functions within the framework, considering each state variable’s and sample’s error to penalize the loss function. We designed the trunk to automatically satisfy Partition of Unity. To enforce unity mass-fraction constraint exactly, we propose an invertible analytical map that transforms the n -dimensional species mass-fraction vector into an ( n-1 )-dimensional space, where DeepONet training is performed. We consider two-step training for DeepONet for multiple outputs and extend adaptive loss functions for trunk and branch training. We demonstrate the efficacy and applicability of our models through two examples: the syngas (12 states) and GRI-Mech 3.0 (24 active states out of 54). The proposed DeepONet will be a backbone for future CFD studies to accelerate turbulent combustion simulations. AMORE is a general framework, and here, in addition to DeepONet, we also demonstrate it for FNO.
[LG-66] CSI-4CAST: A Hybrid Deep Learning Model for CSI Prediction with Comprehensive Robustness and Generalization Testing
链接: https://arxiv.org/abs/2510.12996
作者: Sikai Cheng,Reza Zandehshahvar,Haoruo Zhao,Daniel A. Garcia-Ulloa,Alejandro Villena-Rodriguez,Carles Navarro Manchón,Pascal Van Hentenryck
类目: Machine Learning (cs.LG)
*备注:
Abstract:Channel state information (CSI) prediction is a promising strategy for ensuring reliable and efficient operation of massive multiple-input multiple-output (mMIMO) systems by providing timely downlink (DL) CSI. While deep learning-based methods have advanced beyond conventional model-driven and statistical approaches, they remain limited in robustness to practical non-Gaussian noise, generalization across diverse channel conditions, and computational efficiency. This paper introduces CSI-4CAST, a hybrid deep learning architecture that integrates 4 key components, i.e., Convolutional neural network residuals, Adaptive correction layers, ShuffleNet blocks, and Transformers, to efficiently capture both local and long-range dependencies in CSI prediction. To enable rigorous evaluation, this work further presents a comprehensive benchmark, CSI-RRG for Regular, Robustness and Generalization testing, which includes more than 300,000 samples across 3,060 realistic scenarios for both TDD and FDD systems. The dataset spans multiple channel models, a wide range of delay spreads and user velocities, and diverse noise types and intensity degrees. Experimental results show that CSI-4CAST achieves superior prediction accuracy with substantially lower computational cost, outperforming baselines in 88.9% of TDD scenarios and 43.8% of FDD scenario, the best performance among all evaluated models, while reducing FLOPs by 5x and 3x compared to LLM4CP, the strongest baseline. In addition, evaluation over CSI-RRG provides valuable insights into how different channel factors affect the performance and generalization capability of deep learning models. Both the dataset (this https URL) and evaluation protocols (this https URL) are publicly released to establish a standardized benchmark and to encourage further research on robust and efficient CSI prediction.
[LG-67] Deep Learning-Based Visual Fatigue Detection Using Eye Gaze Patterns in VR
链接: https://arxiv.org/abs/2510.12994
作者: Numan Zafar,Johnathan Locke,Shafique Ahmad Chaudhry
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, Accepted at IEEE International Symposium on Emerging Metaverse (ISEMV 2025)
Abstract:Prolonged exposure to virtual reality (VR) systems leads to visual fatigue, impairs user comfort, performance, and safety, particularly in high-stakes or long-duration applications. Existing fatigue detection approaches rely on subjective questionnaires or intrusive physiological signals, such as EEG, heart rate, or eye-blink count, which limit their scalability and real-time applicability. This paper introduces a deep learning-based study for detecting visual fatigue using continuous eye-gaze trajectories recorded in VR. We use the GazeBaseVR dataset comprising binocular eye-tracking data from 407 participants across five immersive tasks, extract cyclopean eye-gaze angles, and evaluate six deep classifiers. Our results demonstrate that EKYT achieves up to 94% accuracy, particularly in tasks demanding high visual attention, such as video viewing and text reading. We further analyze gaze variance and subjective fatigue measures, indicating significant behavioral differences between fatigued and non-fatigued conditions. These findings establish eye-gaze dynamics as a reliable and nonintrusive modality for continuous fatigue detection in immersive VR, offering practical implications for adaptive human-computer interactions.
[LG-68] Behavioral Biometrics for Automatic Detection of User Familiarity in VR
链接: https://arxiv.org/abs/2510.12988
作者: Numan Zafar,Priyo Ranjan Kundu Prosun,Shafique Ahmad Chaudhry
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures, 17th International Conference on Quality of Multimedia Experience
Abstract:As virtual reality (VR) devices become increasingly integrated into everyday settings, a growing number of users without prior experience will engage with VR systems. Automatically detecting a user’s familiarity with VR as an interaction medium enables real-time, adaptive training and interface adjustments, minimizing user frustration and improving task performance. In this study, we explore the automatic detection of VR familiarity by analyzing hand movement patterns during a passcode-based door-opening task, which is a well-known interaction in collaborative virtual environments such as meeting rooms, offices, and healthcare spaces. While novice users may lack prior VR experience, they are likely to be familiar with analogous real-world tasks involving keypad entry. We conducted a pilot study with 26 participants, evenly split between experienced and inexperienced VR users, who performed tasks using both controller-based and hand-tracking interactions. Our approach uses state-of-the-art deep classifiers for automatic VR familiarity detection, achieving the highest accuracies of 92.05% and 83.42% for hand-tracking and controller-based interactions, respectively. In the cross-device evaluation, where classifiers trained on controller data were tested using hand-tracking data, the model achieved an accuracy of 78.89%. The integration of both modalities in the mixed-device evaluation obtained an accuracy of 94.19%. Our results underline the promise of using hand movement biometrics for the real-time detection of user familiarity in critical VR applications, paving the way for personalized and adaptive VR experiences.
[LG-69] Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check
链接: https://arxiv.org/abs/2510.12981
作者: Sungjun Cho,Dasol Hwang,Frederic Sala,Sangheum Hwang,Kyunghyun Cho,Sungmin Cha
类目: Machine Learning (cs.LG)
*备注: 20 pages, 11 figures
Abstract:Current unlearning metrics for generative models evaluate success based on reference responses or classifier outputs rather than assessing the core objective: whether the unlearned model behaves indistinguishably from a model that never saw the unwanted data. This reference-specific approach creates systematic blind spots, allowing models to appear successful while retaining unwanted knowledge accessible through alternative prompts or attacks. We address these limitations by proposing Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models by comparing bidirectional likelihood assignments over generated samples. Unlike existing approaches that rely on predetermined references, FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning. Our experiments on the TOFU benchmark for LLM unlearning and the UnlearnCanvas benchmark for text-to-image diffusion model unlearning reveal that methods achieving near-optimal scores on traditional metrics fail to achieve distributional equivalence, with many becoming more distant from the gold standard than before unlearning. These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
[LG-70] A Connection Between Score Matching and Local Intrinsic Dimension NEURIPS2025
链接: https://arxiv.org/abs/2510.12975
作者: Eric Yeats,Aaron Jacobson,Darryl Hannan,Yiran Jia,Timothy Doster,Henry Kvinge,Scott Mahan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the 3rd SPIGM Workshop at NeurIPS 2025
Abstract:The local intrinsic dimension (LID) of data is a fundamental quantity in signal processing and learning theory, but quantifying the LID of high-dimensional, complex data has been a historically challenging task. Recent works have discovered that diffusion models capture the LID of data through the spectra of their score estimates and through the rate of change of their density estimates under various noise perturbations. While these methods can accurately quantify LID, they require either many forward passes of the diffusion model or use of gradient computation, limiting their applicability in compute- and memory-constrained scenarios. We show that the LID is a lower bound on the denoising score matching loss, motivating use of the denoising score matching loss as a LID estimator. Moreover, we show that the equivalent implicit score matching loss also approximates LID via the normal dimension and is closely related to a recent LID estimator, FLIPD. Our experiments on a manifold benchmark and with Stable Diffusion 3.5 indicate that the denoising score matching loss is a highly competitive and scalable LID estimator, achieving superior accuracy and memory footprint under increasing problem size and quantization level. Comments: Accepted to the 3rd SPIGM Workshop at NeurIPS 2025 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2510.12975 [cs.LG] (or arXiv:2510.12975v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.12975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-71] Balancing Performance and Reject Inclusion: A Novel Confident Inlier Extrapolation Framework for Credit Scoring
链接: https://arxiv.org/abs/2510.12967
作者: Athyrson Machado Ribeiro,Marcos Medeiros Raimundo
类目: Machine Learning (cs.LG)
*备注: 45 pages, 19 figures
Abstract:Reject Inference (RI) methods aim to address sample bias by inferring missing repayment data for rejected credit applicants. Traditional approaches often assume that the behavior of rejected clients can be extrapolated from accepted clients, despite potential distributional differences between the two populations. To mitigate this blind extrapolation, we propose a novel Confident Inlier Extrapolation framework (CI-EX). CI-EX iteratively identifies the distribution of rejected client samples using an outlier detection model and assigns labels to rejected individuals closest to the distribution of the accepted population based on probabilities derived from a supervised classification model. The effectiveness of our proposed framework is validated through experiments on two large real-world credit datasets. Performance is evaluated using the Area Under the Curve (AUC) as well as RI-specific metrics such as Kickout and a novel metric introduced in this work, denoted as Area under the Kickout. Our findings reveal that RI methods, including the proposed framework, generally involve a trade-off between AUC and RI-specific metrics. However, the proposed CI-EX framework consistently outperforms existing RI models from the credit literature in terms of RI-specific metrics while maintaining competitive performance in AUC across most experiments.
[LG-72] An Investigation of Memorization Risk in Healthcare Foundation Models
链接: https://arxiv.org/abs/2510.12950
作者: Sana Tonekaboni,Lena Stempfle,Adibvafa Fallahpour,Walter Gerych,Marzyeh Ghassemi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.
[LG-73] Pruning Cannot Hurt Robustness: Certified Trade-offs in Reinforcement Learning
链接: https://arxiv.org/abs/2510.12939
作者: James Pedley,Benjamin Etheridge,Stephen J. Roberts,Francesco Quinzan
类目: Machine Learning (cs.LG)
*备注: 24 pages, 13 figures
Abstract:Reinforcement learning (RL) policies deployed in real-world environments must remain reliable under adversarial perturbations. At the same time, modern deep RL agents are heavily over-parameterized, raising costs and fragility concerns. While pruning has been shown to improve robustness in supervised learning, its role in adversarial RL remains poorly understood. We develop the first theoretical framework for certified robustness under pruning in state-adversarial Markov decision processes (SA-MDPs). For Gaussian and categorical policies with Lipschitz networks, we prove that element-wise pruning can only tighten certified robustness bounds; pruning never makes the policy less robust. Building on this, we derive a novel three-term regret decomposition that disentangles clean-task performance, pruning-induced performance loss, and robustness gains, exposing a fundamental performance–robustness frontier. Empirically, we evaluate magnitude and micro-pruning schedules on continuous-control benchmarks with strong policy-aware adversaries. Across tasks, pruning consistently uncovers reproducible ``sweet spots’’ at moderate sparsity levels, where robustness improves substantially without harming - and sometimes even enhancing - clean performance. These results position pruning not merely as a compression tool but as a structural intervention for robust RL.
[LG-74] Learning at the Speed of Physics: Equilibrium Propagation on Oscillator Ising Machines NEURIPS2025
链接: https://arxiv.org/abs/2510.12934
作者: Alex Gower
类目: Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, NeurIPS 2025 Machine Learning and the Physical Sciences (ML4PS)
Abstract:Physical systems that naturally perform energy descent offer a direct route to accelerating machine learning. Oscillator Ising Machines (OIMs) exemplify this idea: their GHz-frequency dynamics mirror both the optimization of energy-based models (EBMs) and gradient descent on loss landscapes, while intrinsic noise corresponds to Langevin dynamics - supporting sampling as well as optimization. Equilibrium Propagation (EP) unifies these processes into descent on a single total energy landscape, enabling local learning rules without global backpropagation. We show that EP on OIMs achieves competitive accuracy ( \sim 97.2 \pm 0.1 % on MNIST, \sim 88.0 \pm 0.1 % on Fashion-MNIST), while maintaining robustness under realistic hardware constraints such as parameter quantization and phase noise. These results establish OIMs as a fast, energy-efficient substrate for neuromorphic learning, and suggest that EBMs - often bottlenecked by conventional processors - may find practical realization on physical hardware whose dynamics directly perform their optimization.
[LG-75] FedGTEA: Federated Class-Incremental Learning with Gaussian Task Embedding and Alignment
链接: https://arxiv.org/abs/2510.12927
作者: Haolin Li,Hoda Bidkhori
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce a novel framework for Federated Class Incremental Learning, called Federated Gaussian Task Embedding and Alignment (FedGTEA). FedGTEA is designed to capture task-specific knowledge and model uncertainty in a scalable and communication-efficient manner. At the client side, the Cardinality-Agnostic Task Encoder (CATE) produces Gaussian-distributed task embeddings that encode task knowledge, address statistical heterogeneity, and quantify data uncertainty. Importantly, CATE maintains a fixed parameter size regardless of the number of tasks, which ensures scalability across long task sequences. On the server side, FedGTEA utilizes the 2-Wasserstein distance to measure inter-task gaps between Gaussian embeddings. We formulate the Wasserstein loss to enforce inter-task separation. This probabilistic formulation not only enhances representation learning but also preserves task-level privacy by avoiding the direct transmission of latent embeddings, aligning with the privacy constraints in federated learning. Extensive empirical evaluations on popular datasets demonstrate that FedGTEA achieves superior classification performance and significantly mitigates forgetting, consistently outperforming strong existing baselines.
[LG-76] Adaptive vector steering: A training-free layer-wise intervention for hallucination mitigation in large audio and multimodal models ICASSP2026 ICASSP
链接: https://arxiv.org/abs/2510.12851
作者: Tsung-En Lin,Kuan-Yi Lee,Hung-Yi Lee
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Note: This preprint is a version of the paper submitted to ICASSP 2026. The author list here includes contributors who provided additional supervision and guidance. The official ICASSP submission may differ slightly in author composition
Abstract:Large Audio-Language Models and Multi-Modal Large Language Models have demonstrated strong capabilities in tasks such as Audio Question Answering (AQA), Audio Captioning, and Automatic Speech Recognition (ASR). However, there is growing evidence that these models can hallucinate about the content of the audio. To address this issue, we probe the models’ internal states and propose Adaptive Vector Steering (AVS), a method that better grounds generation in audio content. We also identify a strong correlation between output correctness and internal representations. Experiments show consistent performance gains across two models and two benchmarks. On the Audio Hallucination QA dataset, our method boosts the F1-score of Gemma from 0.550 to 0.619 and Qwen from 0.626 to 0.632. Furthermore, our method increases the accuracy of Qwen on MMAU from 0.548 to 0.592, marking an 8% relative increase. To the best of our knowledge, this is the first work to apply vector steering to mitigate hallucination in audio.
[LG-77] Lifting Manifolds to Mitigate Pseudo-Alignment in LLM 4TS
链接: https://arxiv.org/abs/2510.12847
作者: Liangwei Nathan Zheng,Wenhao Liang,Wei Emma Zhang,Miao Xu,Olaf Maennel,Weitong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pseudo-Alignment is a pervasive challenge in many large language models for time series (LLM4TS) models, often causing them to underperform compared to linear models or randomly initialised backbones. However, there is limited discussion in the community for the reasons that pseudo-alignment occurs. In this work, we conduct a thorough investigation into the root causes of pseudo-alignment in LLM4TS and build a connection of pseudo-alignment to the cone effect in LLM. We demonstrate that pseudo-alignment arises from the interplay of cone effect within pretrained LLM components and the intrinsically low-dimensional manifold of time-series data. In addition, we also introduce \textit\textbfTimeSUP, a novel technique designed to mitigate this issue and improve forecast performance in existing LLM4TS approaches. TimeSUP addresses this by increasing the time series manifold to more closely match the intrinsic dimension of language embeddings, allowing the model to distinguish temporal signals clearly while still capturing shared structures across modalities. As a result, representations for time and language tokens remain distinct yet exhibit high cosine similarity, signifying that the model preserves each modality unique features while learning their commonalities in a unified embedding space. Empirically, TimeSUP consistently outperforms state-of-the-art LLM4TS methods and other lightweight baselines on long-term forecasting performance. Furthermore, it can be seamlessly integrated into four existing LLM4TS pipelines and delivers significant improvements in forecasting performance.
[LG-78] Local Timescale Gates for Timescale-Robust Continual Spiking Neural Networks
链接: https://arxiv.org/abs/2510.12843
作者: Ansh Tiwari,Ayush Chauhan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spiking neural networks (SNNs) promise energy-efficient artificial intelligence on neuromorphic hardware but struggle with tasks requiring both fast adaptation and long-term memory, especially in continual learning. We propose Local Timescale Gating (LT-Gate), a neuron model that combines dual time-constant dynamics with an adaptive gating mechanism. Each spiking neuron tracks information on a fast and a slow timescale in parallel, and a learned gate locally adjusts their influence. This design enables individual neurons to preserve slow contextual information while responding to fast signals, addressing the stability-plasticity dilemma. We further introduce a variance-tracking regularization that stabilizes firing activity, inspired by biological homeostasis. Empirically, LT-Gate yields significantly improved accuracy and retention in sequential learning tasks: on a challenging temporal classification benchmark it achieves about 51 percent final accuracy, compared to about 46 percent for a recent Hebbian continual-learning baseline and lower for prior SNN methods. Unlike approaches that require external replay or expensive orthogonalizations, LT-Gate operates with local updates and is fully compatible with neuromorphic hardware. In particular, it leverages features of Intel’s Loihi chip (multiple synaptic traces with different decay rates) for on-chip learning. Our results demonstrate that multi-timescale gating can substantially enhance continual learning in SNNs, narrowing the gap between spiking and conventional deep networks on lifelong-learning tasks.
[LG-79] SimKey: A Semantically Aware Key Module for Watermarking Language Models
链接: https://arxiv.org/abs/2510.12828
作者: Shingo Kodama,Haya Diwan,Lucas Rosenblatt,R. Teal Witter,Niv Cohen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The rapid spread of text generated by large language models (LLMs) makes it increasingly difficult to distinguish authentic human writing from machine output. Watermarking offers a promising solution: model owners can embed an imperceptible signal into generated text, marking its origin. Most leading approaches seed an LLM’s next-token sampling with a pseudo-random key that can later be recovered to identify the text as machine-generated, while only minimally altering the model’s output distribution. However, these methods suffer from two related issues: (i) watermarks are brittle to simple surface-level edits such as paraphrasing or reordering; and (ii) adversaries can append unrelated, potentially harmful text that inherits the watermark, risking reputational damage to model owners. To address these issues, we introduce SimKey, a semantic key module that strengthens watermark robustness by tying key generation to the meaning of prior context. SimKey uses locality-sensitive hashing over semantic embeddings to ensure that paraphrased text yields the same watermark key, while unrelated or semantically shifted text produces a different one. Integrated with state-of-the-art watermarking schemes, SimKey improves watermark robustness to paraphrasing and translation while preventing harmful content from false attribution, establishing semantic-aware keying as a practical and extensible watermarking direction.
[LG-80] Applying Graph Analysis for Unsupervised Fast Malware Fingerprinting
链接: https://arxiv.org/abs/2510.12811
作者: ElMouatez Billah Karbab,Mourad Debbabi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Malware proliferation is increasing at a tremendous rate, with hundreds of thousands of new samples identified daily. Manual investigation of such a vast amount of malware is an unrealistic, time-consuming, and overwhelming task. To cope with this volume, there is a clear need to develop specialized techniques and efficient tools for preliminary filtering that can group malware based on semantic similarity. In this paper, we propose TrapNet, a novel, scalable, and unsupervised framework for malware fingerprinting and grouping. TrapNet employs graph community detection techniques for malware fingerprinting and family attribution based on static analysis, as follows: (1) TrapNet detects packed binaries and unpacks them using known generic packer tools. (2) From each malware sample, it generates a digest that captures the underlying semantics. Since the digest must be dense, efficient, and suitable for similarity checking, we designed FloatHash (FH), a novel numerical fuzzy hashing technique that produces a short real-valued vector summarizing the underlying assembly items and their order. FH is based on applying Principal Component Analysis (PCA) to ordered assembly items (e.g., opcodes, function calls) extracted from the malware’s assembly code. (3) Representing malware with short numerical vectors enables high-performance, large-scale similarity computation, which allows TrapNet to build a malware similarity network. (4) Finally, TrapNet employs state-of-the-art community detection algorithms to identify dense communities, which represent groups of malware with similar semantics. Our extensive evaluation of TrapNet demonstrates its effectiveness in terms of the coverage and purity of the detected communities, while also highlighting its runtime efficiency, which outperforms other state-of-the-art solutions.
[LG-81] Control of dynamical systems with neural networks
链接: https://arxiv.org/abs/2510.12810
作者: Lucas Böttcher
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 23 pages, 14 figures, 1 table
Abstract:Control problems frequently arise in scientific and industrial applications, where the objective is to steer a dynamical system from an initial state to a desired target state. Recent advances in deep learning and automatic differentiation have made applying these methods to control problems increasingly practical. In this paper, we examine the use of neural networks and modern machine-learning libraries to parameterize control inputs across discrete-time and continuous-time systems, as well as deterministic and stochastic dynamics. We highlight applications in multiple domains, including biology, engineering, physics, and medicine. For continuous-time dynamical systems, neural ordinary differential equations (neural ODEs) offer a useful approach to parameterizing control inputs. For discrete-time systems, we show how custom control-input parameterizations can be implemented and optimized using automatic-differentiation methods. Overall, the methods presented provide practical solutions for control tasks that are computationally demanding or analytically intractable, making them valuable for complex real-world applications.
[LG-82] PriorGuide: Test-Time Prior Adaptation for Simulation-Based Inference
链接: https://arxiv.org/abs/2510.13763
作者: Yang Yang,Severi Rissanen,Paul E. Chang,Nasrulloh Loka,Daolang Huang,Arno Solin,Markus Heinonen,Luigi Acerbi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages, 6 figures
Abstract:Amortized simulator-based inference offers a powerful framework for tackling Bayesian inference in computational fields such as engineering or neuroscience, increasingly leveraging modern generative methods like diffusion models to map observed data to model parameters or future predictions. These approaches yield posterior or posterior-predictive samples for new datasets without requiring further simulator calls after training on simulated parameter-data pairs. However, their applicability is often limited by the prior distribution(s) used to generate model parameters during this training phase. To overcome this constraint, we introduce PriorGuide, a technique specifically designed for diffusion-based amortized inference methods. PriorGuide leverages a novel guidance approximation that enables flexible adaptation of the trained diffusion model to new priors at test time, crucially without costly retraining. This allows users to readily incorporate updated information or expert knowledge post-training, enhancing the versatility of pre-trained inference models.
[LG-83] Optimal Bounds for Tylers M-Estimator for Elliptical Distributions
链接: https://arxiv.org/abs/2510.13751
作者: Lap Chi Lau,Akshay Ramachandran
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注: 13 pages + proofs in Appendix
Abstract:A fundamental problem in statistics is estimating the shape matrix of an Elliptical distribution. This generalizes the familiar problem of Gaussian covariance estimation, for which the sample covariance achieves optimal estimation error. For Elliptical distributions, Tyler proposed a natural M-estimator and showed strong statistical properties in the asymptotic regime, independent of the underlying distribution. Numerical experiments show that this estimator performs very well, and that Tyler’s iterative procedure converges quickly to the estimator. Franks and Moitra recently provided the first distribution-free error bounds in the finite sample setting, as well as the first rigorous convergence analysis of Tyler’s iterative procedure. However, their results exceed the sample complexity of the Gaussian setting by a \log^2 d factor. We close this gap by proving optimal sample threshold and error bounds for Tyler’s M-estimator for all Elliptical distributions, fully matching the Gaussian result. Moreover, we recover the algorithmic convergence even at this lower sample threshold. Our approach builds on the operator scaling connection of Franks and Moitra by introducing a novel pseudorandom condition, which we call \infty -expansion. We show that Elliptical distributions satisfy \infty -expansion at the optimal sample threshold, and then prove a novel scaling result for inputs satisfying this condition.
[LG-84] On the identifiability of causal graphs with multiple environments
链接: https://arxiv.org/abs/2510.13583
作者: Francesco Montagna
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint
Abstract:Causal discovery from i.i.d. observational data is known to be generally ill-posed. We demonstrate that if we have access to the distribution of a structural causal model, and additional data from only two environments that sufficiently differ in the noise statistics, the unique causal graph is identifiable. Notably, this is the first result in the literature that guarantees the entire causal graph recovery with a constant number of environments and arbitrary nonlinear mechanisms. Our only constraint is the Gaussianity of the noise terms; however, we propose potential ways to relax this requirement. Of interest on its own, we expand on the well-known duality between independent component analysis (ICA) and causal discovery; recent advancements have shown that nonlinear ICA can be solved from multiple environments, at least as many as the number of sources: we show that the same can be achieved for causal discovery while having access to much less auxiliary information.
[LG-85] Data-driven learning of feedback maps for explicit robust predictive control: an approximation theoretic view
链接: https://arxiv.org/abs/2510.13522
作者: Siddhartha Ganguly,Shubham Gupta,Debasish Chatterjee
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 27 pages; submitted
Abstract:We establish an algorithm to learn feedback maps from data for a class of robust model predictive control (MPC) problems. The algorithm accounts for the approximation errors due to the learning directly at the synthesis stage, ensuring recursive feasibility by construction. The optimal control problem consists of a linear noisy dynamical system, a quadratic stage and quadratic terminal costs as the objective, and convex constraints on the state, control, and disturbance sequences; the control minimizes and the disturbance maximizes the objective. We proceed via two steps – (a) Data generation: First, we reformulate the given minmax problem into a convex semi-infinite program and employ recently developed tools to solve it in an exact fashion on grid points of the state space to generate (state, action) data. (b) Learning approximate feedback maps: We employ a couple of approximation schemes that furnish tight approximations within preassigned uniform error bounds on the admissible state space to learn the unknown feedback policy. The stability of the closed-loop system under the approximate feedback policies is also guaranteed under a standard set of hypotheses. Two benchmark numerical examples are provided to illustrate the results.
[LG-86] Robust Minimax Boosting with Performance Guarantees
链接: https://arxiv.org/abs/2510.13445
作者: Santiago Mazuelas,Veronica Alvarez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Boosting methods often achieve excellent classification accuracy, but can experience notable performance degradation in the presence of label noise. Existing robust methods for boosting provide theoretical robustness guarantees for certain types of label noise, and can exhibit only moderate performance degradation. However, previous theoretical results do not account for realistic types of noise and finite training sizes, and existing robust methods can provide unsatisfactory accuracies, even without noise. This paper presents methods for robust minimax boosting (RMBoost) that minimize worst-case error probabilities and are robust to general types of label noise. In addition, we provide finite-sample performance guarantees for RMBoost with respect to the error obtained without noise and with respect to the best possible error (Bayes risk). The experimental results corroborate that RMBoost is not only resilient to label noise but can also provide strong classification accuracy.
[LG-87] Near-Optimality of Contrastive Divergence Algorithms
链接: https://arxiv.org/abs/2510.13438
作者: Pierre Glaser,Kevin Han Huang,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 54 pages
Abstract:We perform a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an O(n^-1 / 3) rate to the true parameter of the data distribution, we show, under some regularity assumptions, that CD can achieve the parametric rate O(n^-1 / 2) . Our analysis provides results for various data batching schemes, including the fully online and minibatch ones. We additionally show that CD can be near-optimal, in the sense that its asymptotic variance is close to the Cramér-Rao lower bound.
[LG-88] Gaussian Certified Unlearning in High Dimensions: A Hypothesis Testing Approach
链接: https://arxiv.org/abs/2510.13094
作者: Aaradhya Pandey,Arnab Auddy,Haolin Zou,Arian Maleki,Sanjeev Kulkarni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Comments welcome!
Abstract:Machine unlearning seeks to efficiently remove the influence of selected data while preserving generalization. Significant progress has been made in low dimensions (p \ll n) , but high dimensions pose serious theoretical challenges as standard optimization assumptions of \Omega(1) strong convexity and O(1) smoothness of the per-example loss f rarely hold simultaneously in proportional regimes (p\sim n) . In this work, we introduce \varepsilon -Gaussian certifiability, a canonical and robust notion well-suited to high-dimensional regimes, that optimally captures a broad class of noise adding mechanisms. Then we theoretically analyze the performance of a widely used unlearning algorithm based on one step of the Newton method in the high-dimensional setting described above. Our analysis shows that a single Newton step, followed by a well-calibrated Gaussian noise, is sufficient to achieve both privacy and accuracy in this setting. This result stands in sharp contrast to the only prior work that analyzes machine unlearning in high dimensions \citetzou2025certified, which relaxes some of the standard optimization assumptions for high-dimensional applicability, but operates under the notion of \varepsilon -certifiability. That work concludes %that a single Newton step is insufficient even for removing a single data point, and that at least two steps are required to ensure both privacy and accuracy. Our result leads us to conclude that the discrepancy in the number of steps arises because of the sub optimality of the notion of \varepsilon -certifiability and its incompatibility with noise adding mechanisms, which \varepsilon -Gaussian certifiability is able to overcome optimally.
[LG-89] Reciprocal Space Attention for Learning Long-Range Interactions
链接: https://arxiv.org/abs/2510.13055
作者: Hariharan Ramasubramanian,Alvaro Vazquez-Mayagoitia,Ganesh Sivaraman,Atul C. Thakur
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 13 pages including references with 6 figures and 1 table
Abstract:Machine learning interatomic potentials (MLIPs) have revolutionized the modeling of materials and molecules by directly fitting to ab initio data. However, while these models excel at capturing local and semi-local interactions, they often prove insufficient when an explicit and efficient treatment of long-range interactions is required. To address this limitation, we introduce Reciprocal-Space Attention (RSA), a framework designed to capture long-range interactions in the Fourier domain. RSA can be integrated with any existing local or semi-local MLIP framework. The central contribution of this work is the mapping of a linear-scaling attention mechanism into Fourier space, enabling the explicit modeling of long-range interactions such as electrostatics and dispersion without relying on predefined charges or other empirical assumptions. We demonstrate the effectiveness of our method as a long-range correction to the MACE backbone across diverse benchmarks, including dimer binding curves, dispersion-dominated layered phosphorene exfoliation, and the molecular dipole density of bulk water. Our results show that RSA consistently captures long-range physics across a broad range of chemical and materials systems. The code and datasets for this work is available at this https URL
[LG-90] Conformal Inference for Open-Set and Imbalanced Classification
链接: https://arxiv.org/abs/2510.13037
作者: Tianmin Xie,Yanfei Zhou,Ziyi Liang,Stefano Favaro,Matteo Sesia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a conformal prediction method for classification in highly imbalanced and open-set settings, where there are many possible classes and not all may be represented in the data. Existing approaches require a finite, known label space and typically involve random sample splitting, which works well when there is a sufficient number of observations from each class. Consequently, they have two limitations: (i) they fail to provide adequate coverage when encountering new labels at test time, and (ii) they may become overly conservative when predicting previously seen labels. To obtain valid prediction sets in the presence of unseen labels, we compute and integrate into our predictions a new family of conformal p-values that can test whether a new data point belongs to a previously unseen class. We study these p-values theoretically, establishing their optimality, and uncover an intriguing connection with the classical Good–Turing estimator for the probability of observing a new species. To make more efficient use of imbalanced data, we also develop a selective sample splitting algorithm that partitions training and calibration data based on label frequency, leading to more informative predictions. Despite breaking exchangeability, this allows maintaining finite-sample guarantees through suitable re-weighting. With both simulated and real data, we demonstrate our method leads to prediction sets with valid coverage even in challenging open-set scenarios with infinite numbers of possible labels, and produces more informative predictions under extreme class imbalance.
[LG-91] Simplicial Gaussian Models: Representation and Inference
链接: https://arxiv.org/abs/2510.12983
作者: Lorenzo Marinucci,Gabriele D’Acunto,Paolo Di Lorenzo,Sergio Barbarossa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注:
Abstract:Probabilistic graphical models (PGMs) are powerful tools for representing statistical dependencies through graphs in high-dimensional systems. However, they are limited to pairwise interactions. In this work, we propose the simplicial Gaussian model (SGM), which extends Gaussian PGM to simplicial complexes. SGM jointly models random variables supported on vertices, edges, and triangles, within a single parametrized Gaussian distribution. Our model builds upon discrete Hodge theory and incorporates uncertainty at every topological level through independent random components. Motivated by applications, we focus on the marginal edge-level distribution while treating node- and triangle-level variables as latent. We then develop a maximum-likelihood inference algorithm to recover the parameters of the full SGM and the induced conditional dependence structure. Numerical experiments on synthetic simplicial complexes with varying size and sparsity confirm the effectiveness of our algorithm.
[LG-92] Simulation-Based Pretraining and Domain Adaptation for Astronomical Time Series with Minimal Labeled Data
链接: https://arxiv.org/abs/2510.12958
作者: Rithwik Gupta,Daniel Muthukrishna,Jeroen Audenaert
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注:
Abstract:Astronomical time-series analysis faces a critical limitation: the scarcity of labeled observational data. We present a pre-training approach that leverages simulations, significantly reducing the need for labeled examples from real observations. Our models, trained on simulated data from multiple astronomical surveys (ZTF and LSST), learn generalizable representations that transfer effectively to downstream tasks. Using classifier-based architectures enhanced with contrastive and adversarial objectives, we create domain-agnostic models that demonstrate substantial performance improvements over baseline methods in classification, redshift estimation, and anomaly detection when fine-tuned with minimal real data. Remarkably, our models exhibit effective zero-shot transfer capabilities, achieving comparable performance on future telescope (LSST) simulations when trained solely on existing telescope (ZTF) data. Furthermore, they generalize to very different astronomical phenomena (namely variable stars from NASA’s \textitKepler telescope) despite being trained on transient events, demonstrating cross-domain capabilities. Our approach provides a practical solution for building general models when labeled data is scarce, but domain knowledge can be encoded in simulations.
[LG-93] Efficient Inference for Coupled Hidden Markov Models in Continuous Time and Discrete Space
链接: https://arxiv.org/abs/2510.12916
作者: Giosue Migliorini,Padhraic Smyth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Systems of interacting continuous-time Markov chains are a powerful model class, but inference is typically intractable in high dimensional settings. Auxiliary information, such as noisy observations, is typically only available at discrete times, and incorporating it via a Doob’s h- transform gives rise to an intractable posterior process that requires approximation. We introduce Latent Interacting Particle Systems, a model class parameterizing the generator of each Markov chain in the system. Our inference method involves estimating look-ahead functions (twist potentials) that anticipate future information, for which we introduce an efficient parameterization. We incorporate this approximation in a twisted Sequential Monte Carlo sampling scheme. We demonstrate the effectiveness of our approach on a challenging posterior inference task for a latent SIRS model on a graph, and on a neural model for wildfire spread dynamics trained on real data.
[LG-94] Protenix-Mini: efficient structure prediction model with scalable pairformer
链接: https://arxiv.org/abs/2510.12842
作者: Bo Qiang,Chengyue Gong,Xinshi Chen,Yuxuan Zhang,Wenzhi Xiao
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Lightweight inference is critical for biomolecular structure prediction and downstream tasks, enabling efficient real-world deployment and inference-time scaling for large-scale applications. While AF3 and its variants (e.g., Protenix, Chai-1) have advanced structure prediction results, they suffer from critical limitations: high inference latency and cubic time complexity with respect to token count, both of which restrict scalability for large biomolecular complexes. To address the core challenge of balancing model efficiency and prediction accuracy, we introduce three key innovations: (1) compressing non-scalable operations to mitigate cubic time complexity, (2) removing redundant blocks across modules to reduce unnecessary overhead, and (3) adopting a few-step sampler for the atom diffusion module to accelerate inference. Building on these design principles, we develop Protenix-Mini+, a highly lightweight and scalable variant of the Protenix model. Within an acceptable range of performance degradation, it substantially improves computational efficiency. For example, in the case of low-homology single-chain proteins, Protenix-Mini+ experiences an intra-protein LDDT drop of approximately 3% relative to the full Protenix model – an acceptable performance trade-off given its substantially 90%+ improved computational efficiency.
信息检索
[IR-0] HyMiRec: A Hybrid Multi-interest Learning Framework for LLM -based Sequential Recommendation
链接: https://arxiv.org/abs/2510.13738
作者: Jingyi Zhou,Cheng Chen,Kai Zuo,Manjie Xu,Zhendong Fu,Yibo Chen,Xu Tang,Yao Hu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large language models (LLMs) have recently demonstrated strong potential for sequential recommendation. However, current LLM-based approaches face critical limitations in modeling users’ long-term and diverse interests. First, due to inference latency and feature fetching bandwidth constraints, existing methods typically truncate user behavior sequences to include only the most recent interactions, resulting in the loss of valuable long-range preference signals. Second, most current methods rely on next-item prediction with a single predicted embedding, overlooking the multifaceted nature of user interests and limiting recommendation diversity. To address these challenges, we propose HyMiRec, a hybrid multi-interest sequential recommendation framework, which leverages a lightweight recommender to extracts coarse interest embeddings from long user sequences and an LLM-based recommender to captures refined interest embeddings. To alleviate the overhead of fetching features, we introduce a residual codebook based on cosine similarity, enabling efficient compression and reuse of user history embeddings. To model the diverse preferences of users, we design a disentangled multi-interest learning module, which leverages multiple interest queries to learn disentangles multiple interest signals adaptively, allowing the model to capture different facets of user intent. Extensive experiments are conducted on both benchmark datasets and a collected industrial dataset, demonstrating our effectiveness over existing state-of-the-art methods. Furthermore, online A/B testing shows that HyMiRec brings consistent improvements in real-world recommendation systems.
[IR-1] RAG Meets Temporal Graphs: Time-Sensitive Modeling and Retrieval for Evolving Knowledge
链接: https://arxiv.org/abs/2510.13590
作者: Jiale Han,Austin Cheung,Yubai Wei,Zheng Yu,Xusheng Wang,Bing Zhu,Yi Yang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Knowledge is inherently time-sensitive and continuously evolves over time. Although current Retrieval-Augmented Generation (RAG) systems enrich LLMs with external knowledge, they largely ignore this temporal nature. This raises two challenges for RAG. First, current RAG methods lack effective time-aware representations. Same facts of different time are difficult to distinguish with vector embeddings or conventional knowledge graphs. Second, most RAG evaluations assume a static corpus, leaving a blind spot regarding update costs and retrieval stability as knowledge evolves. To make RAG time-aware, we propose Temporal GraphRAG (TG-RAG), which models external corpora as a bi-level temporal graph consisting of a temporal knowledge graph with timestamped relations and a hierarchical time graph. Multi-granularity temporal summaries are generated for each time node to capture both key events and broader trends at that time. The design supports incremental updates by extracting new temporal facts from the incoming corpus and merging them into the existing graph. The temporal graph explicitly represents identical facts at different times as distinct edges to avoid ambiguity, and the time hierarchy graph allows only generating reports for new leaf time nodes and their ancestors, ensuring effective and efficient updates. During inference, TG-RAG dynamically retrieves a subgraph within the temporal and semantic scope of the query, enabling precise evidence gathering. Moreover, we introduce ECT-QA, a time-sensitive question-answering dataset featuring both specific and abstract queries, along with a comprehensive evaluation protocol designed to assess incremental update capabilities of RAG systems. Extensive experiments show that TG-RAG significantly outperforms existing baselines, demonstrating the effectiveness of our method in handling temporal knowledge and incremental updates.
[IR-2] Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation ICDM2025
链接: https://arxiv.org/abs/2510.13229
作者: Yi Zhang,Lili Xie,Ruihong Qiu,Jiajun Liu,Sen Wang
类目: Information Retrieval (cs.IR)
*备注: ICDM 2025 Accepted Paper
Abstract:Recommender systems (RecSys) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms. Recent advancements in large language models (LLMs) demonstrate significant potential for improving RecSys, primarily due to their exceptional generalization capabilities and sophisticated contextual understanding, which facilitate the generation of flexible and interpretable recommendations. However, the direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues stemming from frequent API calls and inherent model limitations such as hallucinations and biases. To address these issues, this paper proposes a novel offline reinforcement learning (RL) framework that leverages imitation learning from LLM-generated trajectories. Specifically, inverse reinforcement learning is employed to extract robust reward models from LLM demonstrations. This approach negates the need for LLM fine-tuning, thereby substantially reducing computational overhead. Simultaneously, the RL policy is guided by the cumulative rewards derived from these demonstrations, effectively transferring the semantic insights captured by the LLM. Comprehensive experiments conducted on two benchmark datasets validate the effectiveness of the proposed method, demonstrating superior performance when compared against state-of-the-art RL-based and in-context learning baselines. The code can be found at this https URL.
[IR-3] ReMindRAG : Low-Cost LLM -Guided Knowledge Graph Traversal for Efficient RAG
链接: https://arxiv.org/abs/2510.13193
作者: Yikuan Hu,Jifeng Zhu,Lanrui Tang,Chen Huang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Knowledge graphs (KGs), with their structured representation capabilities, offer promising avenue for enhancing Retrieval Augmented Generation (RAG) systems, leading to the development of KG-RAG systems. Nevertheless, existing methods often struggle to achieve effective synergy between system effectiveness and cost efficiency, leading to neither unsatisfying performance nor excessive LLM prompt tokens and inference time. To this end, this paper proposes REMINDRAG, which employs an LLM-guided graph traversal featuring node exploration, node exploitation, and, most notably, memory replay, to improve both system effectiveness and cost efficiency. Specifically, REMINDRAG memorizes traversal experience within KG edge embeddings, mirroring the way LLMs “memorize” world knowledge within their parameters, but in a train-free manner. We theoretically and experimentally confirm the effectiveness of REMINDRAG, demonstrating its superiority over existing baselines across various benchmark datasets and LLM backbones. Our code is available at this https URL.
[IR-4] Retrieval-in-the-Chain: Bootstrapping Large Language Models for Generative Retrieval
链接: https://arxiv.org/abs/2510.13095
作者: Yingchen zhang,Ruqing zhang,Jiafeng Guo,Wenjun Peng,Sen Li,Fuyu Lv
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Generative retrieval (GR) is an emerging paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of LLMs to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an LLM is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid decoding. Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable LLM that has been instruction-tuned for GR. At inference time, R4R first uses the LLM to generate an initial structured reasoning; then the same LLM alternates between (i) constrained decoding with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single LLM serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.13095 [cs.IR] (or arXiv:2510.13095v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.13095 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] Post-hoc Popularity Bias Correction in GNN-based Collaborative Filtering
链接: https://arxiv.org/abs/2510.12959
作者: Md Aminul Islam,Elena Zheleva,Ren Wang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:User historical interaction data is the primary signal for learning user preferences in collaborative filtering (CF). However, the training data often exhibits a long-tailed distribution, where only a few items have the majority of interactions. CF models trained directly on such imbalanced data are prone to learning popularity bias, which reduces personalization and leads to suboptimal recommendation quality. Graph Neural Networks (GNNs), while effective for CF due to their message passing mechanism, can further propagate and amplify popularity bias through their aggregation process. Existing approaches typically address popularity bias by modifying training objectives but fail to directly counteract the bias propagated during GNN’s neighborhood aggregation. Applying weights to interactions during aggregation can help alleviate this problem, yet it risks distorting model learning due to unstable node representations in the early stages of training. In this paper, we propose a Post-hoc Popularity Debiasing (PPD) method that corrects for popularity bias in GNN-based CF and operates directly on pre-trained embeddings without requiring retraining. By estimating interaction-level popularity and removing popularity components from node representations via a popularity direction vector, PPD reduces bias while preserving user preferences. Experimental results show that our method outperforms state-of-the-art approaches for popularity bias correction in GNN-based CF.
[IR-6] Maximum In-Support Return Modeling for Dynamic Recommendation with Language Model Prior CIKM’25
链接: https://arxiv.org/abs/2510.12816
作者: Xiaocong Chen,Siyu Wang,Lina Yao
类目: Information Retrieval (cs.IR)
*备注: CIKM’25
Abstract:Reinforcement Learning-based recommender systems (RLRS) offer an effective way to handle sequential recommendation tasks but often face difficulties in real-world settings, where user feedback data can be sub-optimal or sparse. In this paper, we introduce MDT4Rec, an offline RLRS framework that builds on the Decision Transformer (DT) to address two major challenges: learning from sub-optimal histories and representing complex user-item interactions. First, MDT4Rec shifts the trajectory stitching procedure from the training phase to action inference, allowing the system to shorten its historical context when necessary and thereby ignore negative or unsuccessful past experiences. Second, MDT4Rec initializes DT with a pre-trained large language model (LLM) for knowledge transfer, replaces linear embedding layers with Multi-Layer Perceptrons (MLPs) for more flexible representations, and employs Low-Rank Adaptation (LoRA) to efficiently fine-tune only a small subset of parameters. We evaluate MDT4Rec on five public datasets and in an online simulation environment, demonstrating that it outperforms existing methods.
[IR-7] Energy-Guided Diffusion Sampling for Long-Term User Behavior Prediction in Reinforcement Learning-based Recommendation CIKM’25
链接: https://arxiv.org/abs/2510.12815
作者: Xiaocong Chen,Siyu Wang,Lina Yao
类目: Information Retrieval (cs.IR)
*备注: CIKM’25
Abstract:Reinforcement learning-based recommender systems (RL4RS) have gained attention for their ability to adapt to dynamic user preferences. However, these systems face challenges, particularly in offline settings, where data inefficiency and reliance on pre-collected trajectories limit their broader applicability. While offline reinforcement learning methods leverage extensive datasets to address these issues, they often struggle with noisy data and fail to capture long-term user preferences, resulting in suboptimal recommendation policies. To overcome these limitations, we propose Diffusion-enhanced Actor-Critic for Offline RL4RS (DAC4Rec), a novel framework that integrates diffusion processes with reinforcement learning to model complex user preferences more effectively. DAC4Rec leverages the denoising capabilities of diffusion models to enhance the robustness of offline RL algorithms and incorporates a Q-value-guided policy optimization strategy to better handle suboptimal trajectories. Additionally, we introduce an energy-based sampling strategy to reduce randomness during recommendation generation, ensuring more targeted and reliable outcomes. We validate the effectiveness of DAC4Rec through extensive experiments on six real-world offline datasets and in an online simulation environment, demonstrating its ability to optimize long-term user preferences. Furthermore, we show that the proposed diffusion policy can be seamlessly integrated into other commonly used RL algorithms in RL4RS, highlighting its versatility and wide applicability.