本篇博文主要内容为 2025-10-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-10-02)

今日共更新664篇论文,其中:

  • 自然语言处理95篇(Computation and Language (cs.CL))
  • 人工智能246篇(Artificial Intelligence (cs.AI))
  • 计算机视觉129篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习242篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] BroRL: Scaling Reinforcement Learning via Broadened Exploration

【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大型语言模型在训练过程中出现性能饱和的问题,即随着训练步数增加,模型表现趋于稳定且提升有限。其解决方案的关键在于提出一种新的扩展范式——BroRL(Broadening Rollouts for Reinforcement Learning),通过显著增加每条样本的采样轨迹(rollout)数量(从数十到数百),以更充分地拓宽探索空间,从而持续提升模型性能。该方法的核心机制源于对正确与错误token概率质量变化的“质量平衡方程”分析,在单步强化学习假设下证明:采样轨迹中的token始终促进正确质量增长,而未采样token的影响随rollout数量N增大而减弱,最终保障整体正确概率质量的单调上升。实验证明,BroRL可有效突破ProRL在3K训练步后的性能瓶颈,实现稳定且持续的改进,并在多个基准测试中达到1.5B参数模型的最先进水平。

链接: https://arxiv.org/abs/2510.01180
作者: Jian Hu,Mingjie Liu,Ximing Lu,Fang Wu,Zaid Harchaoui,Shizhe Diao,Yejin Choi,Pavlo Molchanov,Jun Yang,Jan Kautz,Yi Dong
机构: NVIDIA(英伟达); Stanford University (斯坦福大学); University of Washington (华盛顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.
zh

[NLP-1] OUCAN: Synthesizing 1.5M Tool-Agent ic Data from Real-World MCP Environments

【速读】: 该论文旨在解决开源社区在构建工具型大语言模型(Tool-Agentic Large Language Models)时面临的高质量、宽松许可训练数据匮乏问题。现有数据集普遍缺乏多样性、真实性和复杂性,尤其在多工具协同与多轮交互场景下表现不足。解决方案的关键在于提出Toucan——目前公开可用的最大规模工具型代理数据集,包含150万条轨迹,这些轨迹由近500个真实世界Model Context Protocols (MCPs)生成。其核心创新在于利用真实的MCP环境进行数据合成,结合五种不同模型生成多样化工具调用查询、三组教师模型通过两种代理框架生成复杂轨迹,并辅以规则和模型双重验证机制保障质量;此外还引入三种扩展机制增强任务多样性并模拟多轮对话,从而显著提升微调模型在BFCL V3和MCP-Universe Bench等基准上的性能表现。

链接: https://arxiv.org/abs/2510.01179
作者: Zhangchen Xu,Adriana Meza Soria,Shawn Tan,Anurag Roy,Ashish Sunil Agrawal,Radha Poovendran,Rameswar Panda
机构: University of Washington (华盛顿大学); MIT-IBM Watson AI Lab (MIT-IBM 沃森人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages, 13 figures

点击查看摘要

Abstract:Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi-tool and multi-turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool-agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real-world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool-use queries using five distinct models, applies model-based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule-based and model-based validation ensures high-quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi-turn conversations. Models fine-tuned on Toucan outperform larger closed-source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP-Universe Bench.
zh

[NLP-2] Code2Video: A Code-centric Paradigm for Educational Video Generation

【速读】: 该论文旨在解决当前生成式AI在制作专业教育视频方面的局限性问题,即现有模型虽能在像素空间中生成视频,但难以满足教育场景对学科知识准确性、视觉结构精确性和叙事连贯性的高要求。其解决方案的关键在于提出一种以代码为中心的代理框架Code2Video,通过三个协同工作的智能体实现可控生成:(i) Planner负责将课程内容组织为时序一致的流程并准备视觉素材;(ii) Coder将结构化指令转化为可执行Python代码,并引入作用域引导的自动修复机制提升效率;(iii) Critic利用视觉语言模型(Vision-Language Model, VLM)结合视觉锚点提示优化空间布局与清晰度。该方法显著提升了生成视频的专业性与可控性,实验证明其效果优于直接代码生成方式40%,且生成质量接近人工制作的教学视频。

链接: https://arxiv.org/abs/2510.01174
作者: Yanzhe Chen,Kevin Qinghong Lin,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Project Page: this https URL

点击查看摘要

Abstract:While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at this https URL.
zh

[NLP-3] Energy-Regularized Sequential Model Editing on Hyperspheres

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行连续知识编辑时因权重表示不稳定而导致的性能退化与灾难性遗忘问题。现有方法在多次编辑后易引发神经元权重分布失衡,进而损害模型对已有知识的保留能力。解决方案的关键在于引入超球面均匀性(hyperspherical uniformity)作为稳定性的度量指标,并提出SPHERE(Sparse Projection for Hyperspherical Energy-Regularized Editing)策略:通过识别预训练权重矩阵主超球方向的正交稀疏空间,将新知识投影至该空间,从而抑制对主方向的扰动,实现对先验知识的有效保护和可靠连续更新。实验表明,该方法显著提升了编辑效果(平均提升16.41%),同时维持了模型整体性能的稳定性。

链接: https://arxiv.org/abs/2510.01172
作者: Qingyuan Liu,Jia-Chen Gu,Yunzhi Yao,Hong Wang,Nanyun Peng
机构: Columbia University (哥伦比亚大学); University of California, Los Angeles (加州大学洛杉矶分校); Zhejiang University (浙江大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: The code is available at this https URL . arXiv admin note: text overlap with arXiv:2410.02355 by other authors

点击查看摘要

Abstract:Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.
zh

[NLP-4] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练对齐(post-training alignment)过程中出现的模式崩溃(mode collapse)问题,即模型生成内容多样性显著下降的现象。传统研究多归因于算法限制,而本文提出一个数据层面的根本驱动因素:偏好数据中的典型性偏差(typicality bias),即标注者因认知心理学中已知的熟悉效应,系统性偏好常见文本。作者通过理论建模和实证分析验证了该偏差的存在及其在模式崩溃中的核心作用。解决方案的关键在于引入一种无需训练的提示策略——口语化采样(Verbalized Sampling, VS),其要求模型在生成时显式表述一组响应的概率分布(如“生成5个关于咖啡的笑话及其对应概率”)。实验表明,VS在创意写作、对话模拟、开放问答及合成数据生成等任务中显著提升多样性(提升1.6–2.1倍),同时保持事实准确性和安全性,并且更强大的模型从中获益更多,体现了该方法在推理阶段有效释放预训练生成多样性的潜力。

链接: https://arxiv.org/abs/2510.01171
作者: Jiayi Zhang,Simon Yu,Derek Chong,Anthony Sicilia,Michael R. Tomz,Christopher D. Manning,Weiyan Shi
机构: Northeastern University(东北大学); Stanford University(斯坦福大学); West Virginia University(西弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 82 pages, 26 figures, 34 tables. Code is available at this https URL

点击查看摘要

Abstract:Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., ``Generate 5 jokes about coffee and their corresponding probabilities’'). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.
zh

[NLP-5] Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多维对齐(multi-dimensional alignment)中的挑战,即如何同时在可验证奖励(如数学准确性)、不可验证主观偏好(如人类价值观)以及复杂交互场景(如多轮AI辅导对话)等不同领域实现有效对齐。传统方法通常将异构信号压缩为单一优化目标,导致各目标之间存在冲突,训练效率低下且推理时缺乏用户控制。其解决方案的关键在于提出一个统一框架:首先通过标准化过程奖励模型(Process Reward Model, PRM)训练,统一监督链式思维推理;其次采用多动作头DPO(Multi-Action-Head DPO, MAH-DPO)与向量化奖励机制,使奖励维度对应多个目标而非单一标量,从而实现多目标强化学习;最后实验证明该框架可在多个任务中同步提升性能,最小化目标间权衡,并提供细粒度的推理时用户控制能力。

链接: https://arxiv.org/abs/2510.01167
作者: Yiran Shen,Yu Xia,Jonathan Chang,Prithviraj Ammanabrolu
机构: UC San Diego (加州大学圣地亚哥分校); Databricks (数据布鲁克斯); NVIDIA (英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single optimizeable objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards (mathematical accuracy), non-verifiable subjective preferences (human values), and complex interactive scenarios (multi-turn AI tutoring dialogues). Such multi-objective reinforcement learning setups are often plagued by the individual objectives being at odds with each other, resulting in inefficient training and little user control during inference. We propose a unified framework that: (i) standardizes process reward model (PRM) training across both verifiable and non-verifiable settings to better supervise models’ chain-of-thought reasoning; (ii) performs multi-objective alignment by training the LLM with our \textbfM ulti- \textbfA ction- \textbfH ead \textbfDPO (MAH-DPO) and a vectorized reward where the dimensions of the vector correspond to the various objectives instead of a single scalar; and (iii) demonstrates how such a system provides fine-grained inference-time user control. Experiments across math reasoning, value alignment, and multi-turn dialogue show that our framework improves performance across multiple objectives simultaneously, while minimizing cross-objective trade-offs and enabling flexible inference time user control. The code can be found at this https URL.
zh

[NLP-6] GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning EMNLP2025

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法因依赖静态数据库而导致的适应性差和演示(demonstration)相关性不足的问题。其解决方案的关键在于提出一种动态演示生成机制——生成式检索对齐演示器(Generative Retrieval-Aligned Demonstrator, GRAD),通过训练一个语言模型根据输入内容自动生成简洁且针对性强的演示样本,从而在有限的token预算下提供更优质的上下文支持。该方法显著提升了模型在数学推理及跨学科STEM任务中的表现,并展示了小模型生成的演示可有效指导大模型推理,实现资源受限场景下的高效、可扩展Few-shot学习范式。

链接: https://arxiv.org/abs/2510.01165
作者: Oussama Gabouj,Kamel Charaf,Ivan Zakazov,Nicolas Baldwin,Robert West
机构: EPFL, Lausanne, Switzerland (瑞士洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2025 (findings)

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance across diverse tasks, but their effectiveness often depends on the quality of the provided context. Retrieval-Augmented Generation (RAG) enriches prompts with external information, but its reliance on static databases constrains adaptability and can result in irrelevant demonstrations. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. We demonstrate the superiority of GRAD under budget constraints, where we limit both the number of tokens used per demonstration and the number of tokens used for the final output. Trained solely on a math dataset, GRAD consistently outperforms strong baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM questions, highlighting GRAD’s robust generalization to out-of-distribution (OOD) domains such as physics, chemistry, and computer science. Furthermore, we show that demonstrations generated by trained smaller models can effectively guide larger target models, reducing training costs while maintaining competitive accuracy. Overall, this work introduces a scalable demonstration generator model presenting the first step toward a dynamic few-shot learning paradigm in resource-constrained settings. We release the code used for the project.
zh

[NLP-7] Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在分配稀缺社会资源时缺乏明确伦理原则与公平性考量的问题。其解决方案的关键在于提出并构建了一个名为“社会福利函数基准”(Social Welfare Function Benchmark, SWF Benchmark)的动态模拟环境,其中LLM扮演主权分配者角色,在一个异质社区中分配任务,从而持续地在集体效率(以投资回报率衡量)与分配公平性(以基尼系数衡量)之间制造权衡。通过评估20个前沿LLM,研究揭示了现有模型在社会资源配置决策中的倾向性和脆弱性,强调需开发专门化基准和针对性对齐方法以提升AI治理能力。

链接: https://arxiv.org/abs/2510.01164
作者: Zhengliang Shi,Ruotian Ma,Jen-tse Huang,Xinbei Ma,Xingyu Chen,Mengru Wang,Qu Yang,Yue Wang,Fanghua Ye,Ziyang Chen,Shanyi Wang,Cixing Li,Wenxuan Wang,Zhaopeng Tu,Xiaolong Li,Zhaochun Ren,Linus
机构: Tencent(腾讯); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model’s general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.
zh

[NLP-8] Backdoor Attacks Against Speech Language Models

【速读】: 该论文旨在解决音频后门攻击(audio backdoor attack)对语音语言模型(speech language model)的安全性威胁问题,特别是在多模态大语言模型(multimodal large language models)中,由于依赖预训练的语音编码器(speech encoder),这些编码器可能被恶意注入后门从而影响下游任务的可靠性。研究通过系统性实验验证了该攻击在四种语音编码器和三个数据集上的有效性,覆盖自动语音识别(ASR)、语音情绪识别、性别与年龄预测四项任务,成功率高达90.76%至99.41%。解决方案的关键在于提出一种基于微调(fine-tuning)的防御机制,能够有效缓解由污染的预训练编码器引入的后门风险,从而提升模型整体鲁棒性。

链接: https://arxiv.org/abs/2510.01157
作者: Alexandrine Fortier,Thomas Thebaud,Jesús Villalba,Najim Dehak,Patrick Cardinal
机构: École de technologie supérieure(魁北克科技学院); Johns Hopkins University(约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
zh

[NLP-9] Pay-Per-Search Models are Abstention Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对超出其参数化知识边界的问题时,容易产生幻觉(hallucination)回答的问题。传统方法通常依赖预定义的知识边界来构建训练数据以实现模型 abstention(回避回答),但这种方法难以泛化且成本高。论文提出 MASH(Modeling Abstention via Selective Help-seeking)框架,其核心创新在于:利用模型主动调用外部搜索工具的行为作为 abstention 的代理信号——通过设计“按次搜索付费”的强化学习奖励机制,在惩罚无意义搜索的同时奖励准确答案,从而引导模型在无法回答时选择求助而非虚构答案。该方案无需预先设定知识边界,而是将 abstention 作为辅助任务(selective help-seeking)的副产物自然涌现,显著提升了多跳问答任务中的准确率(+7.6%),并展现出即插即用的抽象能力。

链接: https://arxiv.org/abs/2510.01152
作者: Mustafa Omer Gul,Claire Cardie,Tanya Goyal
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, with 10 dedicated to citations and appendix. 9 tables and 9 figures. Preprint, under review

点击查看摘要

Abstract:LLMs cannot reliably recognize their parametric knowledge boundaries and often hallucinate answers to outside-of-boundary questions. In contrast, humans recognize their limitations and can either seek external help for such questions or abstain. In this paper, we introduce MASH (Modeling Abstention via Selective Help-seeking), a training framework that readily extracts abstentions from LLMs. Our key idea is that any external help-seeking by an LLM, i.e. search tool use, can serve as a proxy for abstention if the external help (search) is appropriately penalized while simultaneously rewarding answer accuracy. MASH operationalizes this idea using reinforcement learning with a pay-per-search reward. We run experiments on three knowledge-intensive QA datasets. Our results show that MASH substantially improves upon the selective help-seeking performance of prior efficient search approaches; on multi-hop datasets, MASH improves answer accuracy by 7.6%. Furthermore, MASH demonstrates strong off-the-shelf abstention – it can distinguish between unanswerable/answerable questions and selectively generate responses for answerable questions – showcasing behavior analogous to specialized abstention approaches. We emphasize that contrary to prior abstention methods, MASH does not require pre-determining knowledge boundaries to construct training data. Instead, MASH’s abstentions are a by-product of training for the auxiliary selective help-seeking task. Overall, we show that MASH training effectively aligns search tool use with parametric knowledge, which can be successfully leveraged for making abstention decisions. Comments: 21 pages, with 10 dedicated to citations and appendix. 9 tables and 9 figures. Preprint, under review Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.01152 [cs.CL] (or arXiv:2510.01152v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.01152 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在非英语场景下自动评估效果不佳的问题,尤其是如何有效训练多语言奖励模型(Reward Model),以提升其在多语言环境中的泛化能力。解决方案的关键在于提出 mR3——一个覆盖 72 种语言的、基于评分标准无关(rubric-agnostic)的奖励推理模型,并通过系统性研究数据选择与课程学习策略,发现整合目标语言推理数据集能显著提升模型性能。该方法在多语言奖励模型基准测试中达到当前最优表现,且模型规模比同类大模型小至多 9 倍,验证了高效训练策略的有效性。

链接: https://arxiv.org/abs/2510.01146
作者: David Anugraha,Shou-Yi Hung,Zilu Tang,Annie En-Shiun Lee,Derry Tanti Wijaya,Genta Indra Winata
机构: Stanford University (斯坦福大学); University of Toronto (多伦多大学); Boston University (波士顿大学); Ontario Tech University (安大略理工大学); Monash University Indonesia (蒙纳士大学印度尼西亚分校); Capital One (资本一号)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at this https URL.
zh

[NLP-11] Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review

【速读】: 该论文旨在解决非洲低资源语言在自动语音识别(Automatic Speech Recognition, ASR)领域中数据集匮乏、模型训练方法不足、评估指标单一及资源可用性差等问题,从而推动数字包容性发展。其关键解决方案在于系统梳理现有研究,提出以社区驱动的数据集建设、轻量化建模技术、伦理合规的数据采集与共享机制、以及更全面的评估体系(如引入字符错误率 Character Error Rate, CER 和声调错误率 Diacritic Error Rate, DER)为核心的改进路径,并强调通过跨机构合作和持续基准测试促进ASR技术在非洲语言中的可持续发展。

链接: https://arxiv.org/abs/2510.01145
作者: Sukairaj Hafiz Imam,Tadesse Destaw Belay,Kedir Yassin Husse,Ibrahim Said Ahmad,Idris Abdulmumin,Hadiza Ali Umar,Muhammad Yahuza Bello,Joyce Nakatumba-Nabende,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad
机构: Northwest University Kano Nigeria; Instituto Politécnico Nacional Mexico; University of Gondar Ethiopia; Bayero University Kano Nigeria; University of Pretoria South Africa; Makerere University Uganda; University of Hamburg Germany; Imperial College London United Kingdom
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv for studies published between January 2020 and July 2025. We include studies related to ASR datasets, models or metrics for African languages, while excluding non-African, duplicates, and low-quality studies (score 3/5). We screen 71 out of 2,062 records and we record a total of 74 datasets across 111 languages, encompassing approximately 11,206 hours of speech. Fewer than 15% of research provided reproducible materials, and dataset licensing is not clear. Self-supervised and transfer learning techniques are promising, but are hindered by limited pre-training data, inadequate coverage of dialects, and the availability of resources. Most of the researchers use Word Error Rate (WER), with very minimal use of linguistically informed scores such as Character Error Rate (CER) or Diacritic Error Rate (DER), and thus with limited application in tonal and morphologically rich languages. The existing evidence on ASR systems is inconsistent, hindered by issues like dataset availability, poor annotations, licensing uncertainties, and limited benchmarking. Nevertheless, the rise of community-driven initiatives and methodological advancements indicates a pathway for improvement. Sustainable development for this area will also include stakeholder partnership, creation of ethically well-balanced datasets, use of lightweight modelling techniques, and active benchmarking.
zh

[NLP-12] Prompt Curriculum Learning for Efficient LLM Post-Training

【速读】: 该论文旨在解决语言模型(Language Model, LM)在强化学习(Reinforcement Learning, RL)后训练过程中对批量处理(batching)和提示选择策略敏感的问题,尤其是如何高效地识别并利用具有中间难度的提示以提升推理能力。解决方案的关键在于提出Prompt Curriculum Learning (PCL),其核心机制是通过一个与当前策略同步更新的价值模型(value model)在线识别当前策略下具有中间难度的提示,从而聚焦于信息量高、有效比例(effective ratio)高的样本进行训练。相比基于rollout的筛选方法,PCL避免了昂贵的生成回放过程,在MATH和DeepScaleR数据集上分别实现了12.1倍和16.9倍的速度提升,同时在性能或收敛速度上优于现有方法,实现了推理导向RL中性能上限与训练效率之间的更好权衡。

链接: https://arxiv.org/abs/2510.01135
作者: Zhaolin Gao,Joongwon Kim,Wen Sun,Thorsten Joachims,Sid Wang,Richard Yuanzhe Pang,Liang Tan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Prompt Curriculum Learning (PCL), a lightweight reinforcement learning (RL) algorithm that selects intermediate-difficulty prompts using a learned value model to post-train language models. Since post-training LLMs via RL remains sensitive to batching and prompt selection strategies, we first conduct a series of systematic experiments where we (1) determine the optimal training batch size that balances generation efficiency and gradient quality and (2) establish the importance of focusing on prompts of intermediate difficulty for the policy. We build upon these results to design PCL, which identifies prompts of intermediate difficulty for the current policy in an on-policy manner by using a value model that is concurrently updated based on the current policy. By focusing on informative prompts that yield high effective ratios, PCL achieves either the highest performance or requires significantly less time to reach comparable performance to its counterparts. Compared to rollout-based filtering methods, PCL avoids costly rollouts and achieves 12.1\times and 16.9\times faster speed on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR, respectively. We further demonstrate that our value model accurately predicts prompt difficulty and allows PCL to focus on progressively more challenging prompts during RL. Our results present a new methodology that delivers improved tradeoff between upper-bound performance and efficiency for reasoning-focused RL.
zh

[NLP-13] A Practitioners Guide to Multi-turn Agent ic Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为智能体(agent)在多轮强化学习(multi-turn reinforcement learning, RL)训练过程中,现有框架和设计选择缺乏系统性分析与统一指导的问题。其关键解决方案在于将训练设计空间解构为三个相互关联的核心支柱:环境(environment)、奖励(reward)和策略(policy),并通过在TextWorld、ALFWorld和SWE-Gym等不同任务域中的实证研究,提炼出一套可复用的训练配方(training recipe)。该配方明确指出:简单环境即可提供泛化能力信号;密集的回合级奖励虽能加速训练但依赖特定RL算法以保障性能与稳定性;且奖励稀疏性与策略梯度方法(如PPO、GRPO、RLOO)之间存在交互影响,同时给出了在固定预算下最优监督微调(Supervised Fine-tuning, SFT)到RL训练比例的确定方法,从而实现跨三者的协同优化,推动多轮代理强化学习的研究与应用落地。

链接: https://arxiv.org/abs/2510.01132
作者: Ruiyi Wang,Prithviraj Ammanabrolu
机构: University of California, San Diego (加州大学圣地亚哥分校); NVIDIA (英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study what actually works and what doesn’t for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars – environment, reward, and policy – and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent’s policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: this https URL
zh

[NLP-14] Research on the Integration of Embodied Intelligence and Reinforcement Learning in Textual Domains

【速读】: 该论文旨在解决传统文本处理方法在智能性与决策优化能力方面的局限性,试图通过融合具身智能(embodied intelligence)的感知与行动优势与强化学习(reinforcement learning)的决策优化能力,提升文本处理任务的智能化水平。其解决方案的关键在于提出了一种新颖的集成模型,该模型在理论分析和实验验证中均展现出对多种文本处理任务的有效性,充分体现了其应用潜力。

链接: https://arxiv.org/abs/2510.01076
作者: Haonan Wang,Junfeng Sun,Mingjia Zhao,Wei Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:This article addresses embodied intelligence and reinforcement learning integration in the field of text processing, aiming to enhance text handling with more intelligence on the basis of embodied intelligence’s perception and action superiority and reinforcement learning’s decision optimization capability. Through detailed theoretical explanation and experimental exploration, a novel integration model is introduced. This model has been demonstrated to be very effective in a wide range oftext processing tasks, validating its applicative potential
zh

[NLP-15] Hybrid Dialogue State Tracking for Persian Chatbots: A Language Model-Based Approach

【速读】: 该论文旨在解决传统基于规则的对话状态追踪(Dialogue State Tracking, DST)在开放域和多轮对话场景下适应性不足、难以实现类人交互体验的问题。其解决方案的关键在于提出一种混合式DST模型,融合规则方法与语言模型:利用BERT进行槽位填充(slot filling)和意图识别,XGBoost用于意图验证,GPT负责对话状态更新,同时引入在线代理(online agents)实现实时答案生成。该架构在波斯语多轮对话数据集上进行了评估,显著提升了准确性和对话连贯性,验证了混合方法在增强DST能力方面的有效性。

链接: https://arxiv.org/abs/2510.01052
作者: Samin Mahdipour Aghabagher,Saeedeh Momtazi
机构: Amirkabir University of Technology (伊朗阿米尔卡比尔理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 1 figure. Submitted to Natural Language Engineering

点击查看摘要

Abstract:Dialogue State Tracking (DST) is an essential element of conversational AI with the objective of deeply understanding the conversation context and leading it toward answering user requests. Due to high demands for open-domain and multi-turn chatbots, the traditional rule-based DST is not efficient enough, since it cannot provide the required adaptability and coherence for human-like experiences in complex conversations. This study proposes a hybrid DST model that utilizes rule-based methods along with language models, including BERT for slot filling and intent detection, XGBoost for intent validation, GPT for DST, and online agents for real-time answer generation. This model is uniquely designed to be evaluated on a comprehensive Persian multi-turn dialogue dataset and demonstrated significantly improved accuracy and coherence over existing methods in Persian-based chatbots. The results demonstrate how effectively a hybrid approach may improve DST capabilities, paving the way for conversational AI systems that are more customized, adaptable, and human-like.
zh

[NLP-16] GEM: A Gym for Agent ic LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)从静态数据集训练向基于经验的学习范式转变过程中缺乏标准化环境模拟器的问题。当前LLM代理需通过与复杂环境交互来习得技能,但缺乏统一、高效且可扩展的仿真平台支持这一转型。解决方案的关键在于提出GEM(General Experience Maker),一个专为LLM时代设计的开源环境模拟器,其核心优势包括:提供类OpenAI Gym的标准环境-代理接口,支持异步向量化执行以实现高吞吐量,并具备灵活的封装机制便于扩展;同时内置多样化环境、集成工具及单文件示例脚本,兼容五种主流强化学习(Reinforcement Learning, RL)训练框架,并基于REINFORCE结合Return Batch Normalization(ReBN)构建了24个环境上的基线,该方法相较于GRPO更适配密集回合奖励场景并优化信用分配。此外,GEM还可作为评估工具使用,从而推动代理型LLM研究的加速发展。

链接: https://arxiv.org/abs/2510.01051
作者: Zichen Liu,Anya Sims,Keyu Duan,Changyu Chen,Simon Yu,Xiangxin Zhou,Haotian Xu,Shaopan Xiong,Bo Liu,Chenmien Tan,Chuen Yang Beh,Weixun Wang,Hao Zhu,Weiyan Shi,Diyi Yang,Michael Shieh,Yee Whye Teh,Wee Sun Lee,Min Lin
机构: Sea AI Lab (海AI实验室); NUS (新加坡国立大学); Oxford (牛津大学); SMU (南洋理工大学); Stanford (斯坦福大学); Northeastern (东北大学); OpenRLHF (OpenRLHF); ROLL (ROLL); RL2 (RL2)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which – unlike GRPO – is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.
zh

[NLP-17] Interpreting Language Models Through Concept Descriptions: A Survey EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中机制可解释性(mechanistic interpretability)的核心挑战,即揭示神经网络决策过程的内在机制,并明确单个模型组件(如神经元、注意力头)及抽象表示(如稀疏自编码器提取的特征)的功能角色。其解决方案的关键在于系统性地梳理和总结生成式AI(Generative AI)驱动的概念描述方法——通过强大的生成模型为模型组件提供开放词汇、自然语言层面的概念描述,从而提升模型透明度。论文进一步分析了评估这些概念描述的自动化与人工指标的发展趋势,并指出当前研究亟需更严谨、因果导向的评估范式以推动领域进步。

链接: https://arxiv.org/abs/2510.01048
作者: Nils Feldhus,Laura Kopf
机构: BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); Technische Universität Berlin (柏林工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at The Eight Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), co-located with EMNLP 2025

点击查看摘要

Abstract:Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles of individual model components such as neurons and attention heads, as well as model abstractions such as the learned sparse features extracted by Sparse Autoencoders (SAEs). A rapidly growing line of work tackles this challenge by using powerful generator models to produce open-vocabulary, natural language concept descriptions for these components. In this paper, we provide the first survey of the emerging field of concept descriptions for model components and abstractions. We chart the key methods for generating these descriptions, the evolving landscape of automated and human metrics for evaluating them, and the datasets that underpin this research. Our synthesis reveals a growing demand for more rigorous, causal evaluation. By outlining the state of the art and identifying key challenges, this survey provides a roadmap for future research toward making models more transparent.
zh

[NLP-18] Authentic Discrete Diffusion Model

【速读】: 该论文旨在解决传统伪离散扩散(Pseudo-Discrete Diffusion, PDD)方法在处理离散数据时存在的局限性,即依赖连续潜在空间或掩码策略进行扩散建模,导致无法有效保留原始离散结构的特性。其解决方案的关键在于提出一种真实离散扩散(Authentic Discrete Diffusion, ADD)框架,通过在 one-hot 空间中直接建模扩散过程,利用浮点编码的 one-hot 类别数据作为输入,并引入时间步条件交叉熵损失(timestep-conditioned cross-entropy loss),使生成模型输出与原始 one-hot 标签对齐。该设计不仅保持了扩散模型的核心特性,还实现了判别式学习与生成式学习之间的协同优化,在图像分类和图像描述生成任务上均展现出优越性能。

链接: https://arxiv.org/abs/2510.01047
作者: Xiao Li,Jiaqi Zhang,Shuxiang Zhang,Tianshui Chen,Liang Lin,Guangrun Wang
机构: Sun Yat-sen University (中山大学); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室); X-Era AI Lab; Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose an Authentic Discrete Diffusion (ADD) framework that fundamentally redefines prior pseudo-discrete approaches by preserving core diffusion characteristics directly in the one-hot space through a suite of coordinated mechanisms. Unlike conventional “pseudo” discrete diffusion (PDD) methods, ADD reformulates the diffusion input by directly using float-encoded one-hot class data, without relying on diffusing in the continuous latent spaces or masking policies. At its core, a timestep-conditioned cross-entropy loss is introduced between the diffusion model’s outputs and the original one-hot labels. This synergistic design establishes a bridge between discriminative and generative learning. Our experiments demonstrate that ADD not only achieves superior performance on classification tasks compared to the baseline, but also exhibits excellent text generation capabilities on Image captioning. Extensive ablations validate the measurable gains of each component.
zh

[NLP-19] Syntax-Guided Diffusion Language Models with User-Integrated Personalization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本时普遍存在的泛化性过强、结构多样性不足的问题,从而限制个性化表达能力。其核心解决方案是提出一种语法引导的扩散语言模型(syntax-guided diffusion language model),关键在于引入结构监督与个性化条件控制机制,通过级联式框架先生成语法指导信息再进行条件文本生成,并进一步发展为非级联架构以提升结构与内容的一致性;同时设计共享表示机制实现跨用户的信息融合,支持高保真风格生成与零样本泛化推理,显著提升了文本的流畅性、多样性及风格忠实度。

链接: https://arxiv.org/abs/2510.01028
作者: Ruqian Zhang,Yijiao Zhang,Juan Shen,Zhongyi Zhu,Annie Qu
机构: Fudan University (复旦大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Large language models have made revolutionary progress in generating human-like text, yet their outputs often tend to be generic, exhibiting insufficient structural diversity, which limits personalized expression. Recent advances in diffusion models have opened new opportunities for improving language generation beyond the limitations of autoregressive paradigms. In this work, we propose a syntax-guided diffusion language model that integrates structural supervision and personalized conditioning to enhance text quality, diversity, and controllability. We introduce a cascaded framework that generates syntactic guidance before conditional text generation, and further generalize it to a novel noncascaded architecture for better alignment between structure and content. By incorporating syntactic information in the generating process, the proposed model better captures the lexical and structural characteristics of stylistic sentence construction. To enable fine-grained personalization, we develop a shared representation mechanism that facilitates information integration across users, supporting both faithful stylistic generation and generalizable zero-shot inference. Extensive experiments on multiple tasks demonstrate the superiority of our approach in fluency, diversity, and stylistic fidelity. Further qualitative analyses highlight its interpretability and flexibility in learning personalized patterns.
zh

[NLP-20] Shape Happens: Automatic Feature Manifold Discovery in LLM s via Supervised Multi-Dimensional Scaling

【速读】: 该论文试图解决语言模型(Language Models, LMs)中概念表征的几何结构不明确、缺乏通用性的问题,即如何自动发现并理解模型潜在空间中不同特征所对应的多维流形结构。此前研究多聚焦于特定特征的特定几何形式,难以推广。其解决方案的关键在于提出一种模型无关的监督多维缩放方法(Supervised Multi-Dimensional Scaling, SMDS),能够自动识别不同语义特征在潜在空间中形成的几何结构(如圆、线、簇等),并通过时间推理任务验证了这些结构具有稳定性、可解释性和动态响应能力,从而揭示了特征流形在模型推理中的功能作用,支持基于实体的结构化表征推理模型。

链接: https://arxiv.org/abs/2510.01025
作者: Federico Tiblias,Irina Bigoulaeva,Jingcheng Niu,Simone Balloccu,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Technical University of Darmstadt (达姆施塔特工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The linear representation hypothesis states that language models (LMs) encode concepts as directions in their latent space, forming organized, multidimensional manifolds. Prior efforts focus on discovering specific geometries for specific features, and thus lack generalization. We introduce Supervised Multi-Dimensional Scaling (SMDS), a model-agnostic method to automatically discover feature manifolds. We apply SMDS to temporal reasoning as a case study, finding that different features form various geometric structures such as circles, lines, and clusters. SMDS reveals many insights on these structures: they consistently reflect the properties of the concepts they represent; are stable across model families and sizes; actively support reasoning in models; and dynamically reshape in response to context changes. Together, our findings shed light on the functional role of feature manifolds, supporting a model of entity-based reasoning in which LMs encode and transform structured representations.
zh

[NLP-21] Improving Code Localization with Repository Memory

【速读】: 该论文旨在解决代码定位(code localization)任务中语言代理(language agent)缺乏长期记忆的问题,即现有方法在处理每个实例时均从零开始,忽略了代码库的历史演化信息。解决方案的关键在于引入基于提交历史(commit history)的非参数化记忆机制,通过分析代码库的演化模式,提取近期历史提交、关联问题以及活跃模块的功能摘要,从而增强代理对代码库上下文的理解能力。实验表明,这种记忆增强策略显著提升了LocAgent框架在SWE-bench-verified和SWE-bench-live基准上的性能,使代理更接近人类开发者在长期任务中积累经验的能力。

链接: https://arxiv.org/abs/2510.01003
作者: Boshi Wang,Weijian Xu,Yunsheng Li,Mei Gao,Yujia Xie,Huan Sun,Dongdong Chen
机构: The Ohio State University (俄亥俄州立大学); Microsoft
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Code localization is a fundamental challenge in repository-level software engineering tasks such as bug fixing. While existing methods equip language agents with comprehensive tools/interfaces to fetch information from the repository, they overlook the critical aspect of memory, where each instance is typically handled from scratch assuming no prior repository knowledge. In contrast, human developers naturally build long-term repository memory, such as the functionality of key modules and associations between various bug types and their likely fix locations. In this work, we augment language agents with such memory by leveraging a repository’s commit history - a rich yet underutilized resource that chronicles the codebase’s evolution. We introduce tools that allow the agent to retrieve from a non-parametric memory encompassing recent historical commits and linked issues, as well as functionality summaries of actively evolving parts of the codebase identified via commit patterns. We demonstrate that augmenting such a memory can significantly improve LocAgent, a state-of-the-art localization framework, on both SWE-bench-verified and the more recent SWE-bench-live benchmarks. Our research contributes towards developing agents that can accumulate and leverage past experience for long-horizon tasks, more closely emulating the expertise of human developers.
zh

[NLP-22] It Takes Two: Your GRPO Is Secretly DPO

【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在实际应用中因需要较大群体规模(group size)以保证训练稳定性而导致的计算开销过高的问题。传统观点认为,较大的群体规模是实现精确统计估计、确保训练稳定性的必要条件。论文的关键创新在于将GRPO重新诠释为一种对比学习(contrastive learning)形式,从而揭示其与Direct Preference Optimization (DPO)之间的本质联系,并由此提出最小化配置——两轮采样情形(2-GRPO),即仅使用两个rollout进行优化。理论分析和实证结果表明,2-GRPO在性能上可媲美16-GRPO,同时仅需1/8的rollouts并使训练时间减少超过70%,突破了对GRPO必须依赖大群体规模的传统认知。

链接: https://arxiv.org/abs/2510.00977
作者: Yihong Wu,Liheng Ma,Lei Ding,Muzhi Li,Xinyu Wang,Kejia Chen,Zhan Su,Zhanguang Zhang,Chenyang Huang,Yingxue Zhang,Mark Coates,Jian-Yun Nie
机构: Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学); Mila - Quebec AI Institute (魁北克人工智能研究所); University of Manitoba (曼尼托巴大学); The Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (Amii) (阿尔伯塔机器智能研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO’s empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.
zh

[NLP-23] Analyzing Dialectical Biases in LLM s for Knowledge and Reasoning Benchmarks EMNLP

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理非标准英语方言时性能显著下降的问题,尤其是针对代表性不足的英语方言变体在多项选择题问答任务中准确率降低的现象。研究表明,这种性能下降可归因于特定语法结构的差异,其中三个关键语法规则——存在句中的“it”(existential “it”)、零系词结构(zero copula)以及“y’all”代词——能够解释多数方言情境下的性能损失。解决方案的关键在于未来研究应聚焦于识别并缓解这些高影响的个体语法结构所引发的偏见,而非仅从整体语料库层面进行优化。

链接: https://arxiv.org/abs/2510.00962
作者: Eileen Pan,Anna Seo Gyeong Choi,Maartje ter Hoeve,Skyler Seto,Allison Koenecke
机构: Cornell University (康奈尔大学); Apple (苹果公司); Cornell Tech (康奈尔技术学院)
类目: Computation and Language (cs.CL)
备注: EMNLP Findings 2025, 12 pages, 11 tables, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying “standard” American English language questions as non-“standard” dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-“standard” English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential “it”, zero copula, and y’all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.
zh

[NLP-24] Making not Taking the Best of N

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)生成质量评估与利用中存在的“选择性浪费”问题,即传统Best-of-N(BoN)方法仅选取单一最优生成结果,忽略了其他候选样本中潜在的有用信息,导致资源浪费和多样性损失。其解决方案的关键在于提出Fusion-of-N(FusioN)方法:利用一个通用的大语言模型判别器(LLM judge)将多个候选生成结果中的最相关信息进行融合,从而合成出更高质量的最终输出。该方法突破了零和选择范式,转向协作式整合,显著提升了测试时扩展(test-time scaling)和合成数据生成(synthetic data generation)两个场景下的性能表现,展现出对多样性和潜在能力的有效利用。

链接: https://arxiv.org/abs/2510.00931
作者: Ammar Khairi,Daniel D’souza,Marzieh Fadaee,Julia Kreutzer
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.
zh

[NLP-25] Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

【速读】: 该论文旨在解决基础模型在专家级物理推理任务(如奥林匹克级别物理问题)中表现不足的问题,其核心挑战在于如何有效提升模型对复杂、多模态物理问题的理解与求解能力。解决方案的关键在于引入PhoPile——一个专为奥林匹克级别物理设计的高质量多模态数据集,该数据集包含图表、公式等关键信息,能够系统性地支持基于检索增强生成(Retrieval-Augmented Generation, RAG)的物理推理研究。通过在PhoPile上对多种大语言模型(Large Language Models, LLMs)和大视觉语言模型(Large Multimodal Models, LMMs)进行基准测试,研究发现将检索机制与物理知识库结合可显著提升模型性能,从而为未来在RAG框架下优化物理推理提供了重要方向。

链接: https://arxiv.org/abs/2510.00919
作者: Shunfeng Zheng,Yudi Zhang,Meng Fang,Zihan Zhang,Zhitan Wu,Mykola Pechenizkiy,Ling Chen
机构: AAII, University of Technology Sydney (悉尼科技大学); Eindhoven University of Technology (埃因霍温理工大学); University of Liverpool (利物浦大学); University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.
zh

[NLP-26] Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLM s

【速读】: 该论文旨在解决跨语言信息检索(Cross-lingual Information Retrieval, CLIR)中如何高效、准确地从多种语言文档中检索与用户查询相关的资源的问题。传统方法通常依赖翻译技术将查询或文档映射到同一语言进行单语言检索,但存在翻译误差和语言间语义不对齐等问题。论文指出,当前解决方案的关键在于从翻译驱动范式转向基于嵌入(embedding)的方法,并充分利用多语言大语言模型(Multilingual Large Language Models, MLLMs),通过跨语言表示对齐实现更鲁棒的语义匹配与答案生成能力,从而提升跨语言检索性能并推动系统向更具包容性和适应性的方向发展。

链接: https://arxiv.org/abs/2510.00908
作者: Roksana Goworek,Olivia Macmillan-Scott,Eda B. Özyiğit
机构: The Alan Turing Institute (艾伦·图灵研究所); Queen Mary University of London (伦敦玛丽女王大学); University College London (伦敦大学学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query. Research in this area has typically framed the task as monolingual retrieval augmented by translation, treating retrieval methods and cross-lingual capabilities in isolation. Both monolingual and cross-lingual retrieval usually follow a pipeline of query expansion, ranking, re-ranking and, increasingly, question answering. Recent advances, however, have shifted from translation-based methods toward embedding-based approaches and leverage multilingual large language models (LLMs), for which aligning representations across languages remains a central challenge. The emergence of cross-lingual embeddings and multilingual LLMs has introduced a new paradigm, offering improved retrieval performance and enabling answer generation. This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques. It presents a structured account of core CLIR components, evaluation practices, and available resources. Persistent challenges such as data imbalance and linguistic variation are identified, while promising directions are suggested for advancing equitable and effective cross-lingual information retrieval. By situating CLIR within the broader landscape of information retrieval and multilingual language processing, this work not only reviews current capabilities but also outlines future directions for building retrieval systems that are robust, inclusive, and adaptable.
zh

[NLP-27] Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在学术写作中广泛应用所引发的署名完整性与学术出版物可靠性问题,特别是现有检测方法在细粒度文本片段定位、校准能力不足以及跨学科和不同生成模型泛化性能差等方面的局限性。其解决方案的关键在于提出一种结构感知的框架 Sci-SpanDet,通过结合章节条件化的风格建模与多层次对比学习来捕捉人类与AI写作风格的细微差异并降低主题依赖性,从而提升跨领域鲁棒性;同时引入 BIO-CRF 序列标注与基于指针的边界解码及置信度校准机制,实现精确的片段级检测与可靠的概率估计。

链接: https://arxiv.org/abs/2510.00890
作者: Zhen Yin,Shenghua Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) in scientific writing raises serious concerns regarding authorship integrity and the reliability of scholarly publications. Existing detection approaches mainly rely on document-level classification or surface-level statistical cues; however, they neglect fine-grained span localization, exhibit weak calibration, and often fail to generalize across disciplines and generators. To address these limitations, we present Sci-SpanDet, a structure-aware framework for detecting AI-generated scholarly texts. The proposed method combines section-conditioned stylistic modeling with multi-level contrastive learning to capture nuanced human-AI differences while mitigating topic dependence, thereby enhancing cross-domain robustness. In addition, it integrates BIO-CRF sequence labeling with pointer-based boundary decoding and confidence calibration to enable precise span-level detection and reliable probability estimates. Extensive experiments on a newly constructed cross-disciplinary dataset of 100,000 annotated samples generated by multiple LLM families (GPT, Qwen, DeepSeek, LLaMA) demonstrate that Sci-SpanDet achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36. Furthermore, it shows strong resilience under adversarial rewriting and maintains balanced accuracy across IMRaD sections and diverse disciplines, substantially surpassing existing baselines. To ensure reproducibility and to foster further research on AI-generated text detection in scholarly documents, the curated dataset and source code will be publicly released upon publication.
zh

[NLP-28] HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)任务中普遍存在幻觉(hallucination)问题,即模型生成与输入文档不一致或无依据的陈述,从而降低其在真实场景中的可信度。解决方案的关键在于提出一个40亿参数的小型推理模型(Small Reasoning Model, SRM)——HalluGuard,其通过三个核心机制实现:(i) 基于FineWeb构建的领域无关合成数据集,并经多阶段清洗与重构以提升质量;(ii) 生成结构化的“文档-主张”对,包含真实(grounded)和幻觉(hallucinated)两类样本;(iii) 利用基于优势比偏好优化(Odds Ratio Preference Optimization)的偏好微调策略,将大模型的推理能力蒸馏至小型模型中。实验表明,HalluGuard在RAGTruth子集上达到84.0%的平衡准确率(BAcc),媲美参数量更大的专用模型,且在全基准测试中达到75.7% BAcc,与GPT-4o相当,验证了其高效性与高可靠性。

链接: https://arxiv.org/abs/2510.00880
作者: Loris Bergeron,Ioana Buhnila,Jérôme François,Radu State
机构: Banque de Luxembourg (卢森堡银行); Center for Data Science in Humanities, Chosun University (朝鲜大学人文数据科学中心); ATILF, University of Lorraine–CNRS (洛林大学-法国国家科学研究中心ATILF实验室); SnT, University of Luxembourg (卢森堡大学SnT)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.
zh

[NLP-29] he data-quality illusion: Rethinking Classifier-based quality filtering for LLM Pretraining

【速读】: 该论文旨在解决大规模语言模型预训练中因训练数据质量参差不齐而导致性能受限的问题,核心挑战在于现有数据过滤方法(如基于分类器的质量过滤,Classifier-based Quality Filtering, CQF)是否真正提升了数据质量。其解决方案的关键在于对CQF机制进行深入分析,发现尽管CQF能提升下游任务表现,但其本质是通过二分类器为每个文档打分并保留高分样本,这一过程实际上也隐式地过滤了原本高质量的数据集,从而导致语言建模任务在高质量数据上的表现并未改善。研究进一步对比了CQF与通过随机token置换生成的合成高质量数据的训练效果,揭示出两者趋势迥异,质疑了CQF能否捕捉到有意义的数据质量概念。

链接: https://arxiv.org/abs/2510.00866
作者: Thiziri Nait Saada,Louis Bethune,Michal Klein,David Grangier,Marco Cuturi,Pierre Ablin
机构: Apple(苹果)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 20 figures, 2 tables, preprint

点击查看摘要

Abstract:Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier’s score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.
zh

[NLP-30] Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLM s

【速读】: 该论文旨在解决搜索增强型大语言模型(search-augmented large language models)在复杂多跳推理(multi-hop reasoning)任务中可靠性不足的问题。其核心挑战包括分解错误(decomposition errors)、检索缺失(retrieval missing)和推理错误(reasoning errors),任一环节的失败都可能导致最终答案偏离正确方向。解决方案的关键在于提出一种名为可擦除强化学习(Erasable Reinforcement Learning, ERL)的新框架,该框架能显式识别推理链中的错误步骤,将其擦除并原位重生成新的推理路径,从而阻断缺陷逻辑的传播,显著提升模型在多跳推理任务中的鲁棒性。

链接: https://arxiv.org/abs/2510.00861
作者: Ziliang Wang,Kang An,Xuhui Zheng,Faqiang Qian,Weikun Zhang,Cijun Ouyang,Jialu Cai,Yuhang Wang,Yichao Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.
zh

[NLP-31] ManagerBench: Evaluating the Safety-Prag matism Trade-off in Autonomous LLM s

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在演变为自主代理(autonomous agents)过程中,因操作目标与人类安全价值冲突而导致的安全性评估不足问题。现有安全基准主要关注生成有害内容(如毒性文本),但忽视了模型在追求任务目标时可能采取有害行为的现实风险。解决方案的关键在于引入ManagerBench——一个基于人类验证的管理场景基准,其中每个场景迫使模型在达成运营目标的“务实但有害”行动与“安全但低效”行动之间做出选择;同时设置对照组(仅对无生命物体造成潜在伤害)以区分模型的务实倾向与过度保守行为。实证表明,前沿LLMs在此类情境中普遍存在安全-务实权衡失调现象,其根源并非无法识别危害,而是优先级判断错误,从而凸显出该基准在评估代理核心能力(即在目标与对齐价值观冲突时作出安全决策)上的必要性与挑战性。

链接: https://arxiv.org/abs/2510.00857
作者: Adi Simhi,Jonathan Herzig,Martin Tutek,Itay Itzhak,Idan Szpektor,Yonatan Belinkov
机构: Technion – IIT (以色列理工学院); Google Research (谷歌研究); University of Zagreb (萨格勒布大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model’s pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models’ harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark code available at this https URL.
zh

[NLP-32] Can World Models Benefit VLMs for World Dynamics?

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在多帧推理和空间关系理解能力上的局限性,尤其是在缺乏显式时序建模的情况下,如何利用视频预训练中获得的运动一致性先验来增强通用视觉理解能力。其核心解决方案是将视频扩散模型(video diffusion model)重新设计为生成式编码器(generative encoder),通过单步去噪操作提取视觉潜在表示(visual embedding),从而构建出世界语言模型(World-Language Models, WorldLMs)。其中最具创新性的变体Dynamic Vision Aligner (DyVA) 表现出显著提升的空间推理能力和单图多帧推理能力,其性能优势可归因于从视频预训练中继承的运动一致性内化机制。这一方法为未来基于世界模型先验的通用视觉学习范式提供了新的研究方向。

链接: https://arxiv.org/abs/2510.00855
作者: Kevin Zhang,Kuangzhi Ge,Xiaowei Chi,Renrui Zhang,Shaojun Shi,Zhen Dong,Sirui Han,Shanghang Zhang
机构: Peking University (北京大学); Hong Kong University of Science and Technology (香港科技大学); Chinese University of Hong Kong (香港中文大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM’s inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.
zh

[NLP-33] Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

【速读】: 该论文试图解决当前机制可解释性(Mechanistic Interpretability, MI)研究中缺乏科学严谨性的问题,特别是现有方法如电路发现(circuit discovery)所得结果的稳定性与可靠性不足。其解决方案的关键在于将解释方法视为统计估计器,通过系统性的稳定性分析来评估其方差和鲁棒性;文中以最先进的电路发现方法EAP-IG为例,采用输入重采样、提示改写、超参数变化及因果分析中注入噪声等多种受控扰动手段进行验证,揭示了该方法在结构上存在高方差且对超参数敏感,从而提出应常规报告稳定性指标,以推动可解释性研究向更严谨、统计基础更扎实的方向发展。

链接: https://arxiv.org/abs/2510.00845
作者: Maxime Méloux,Maxime Peyrard,François Portet
机构: Université Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CNRS (法国国家科学研究中心); Grenoble INP (格勒诺布尔国立理工学院); LIG (信息与图形实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of trustworthy artificial intelligence requires moving beyond black-box performance metrics toward an understanding of models’ internal computations. Mechanistic Interpretability (MI) aims to meet this need by identifying the algorithmic mechanisms underlying model behaviors. Yet, the scientific rigor of MI critically depends on the reliability of its findings. In this work, we argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators, subject to questions of variance and robustness. To illustrate this statistical framing, we present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG. We evaluate its variance and robustness through a comprehensive suite of controlled perturbations, including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise within the causal analysis itself. Across a diverse set of models and tasks, our results demonstrate that EAP-IG exhibits high structural variance and sensitivity to hyperparameters, questioning the stability of its findings. Based on these results, we offer a set of best-practice recommendations for the field, advocating for the routine reporting of stability metrics to promote a more rigorous and statistically grounded science of interpretability.
zh

[NLP-34] Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM -based Machine Translation

【速读】: 该论文旨在解决检索增强型大语言模型(Retrieval-Augmented LLM-based Machine Translation, REAL-MT)在噪声检索环境下可靠性不足的问题,尤其是在知识密集型翻译任务中,如习语翻译。其关键解决方案是提出了一种噪声合成框架与新的评估指标,系统性地评估REAL-MT在不同资源语言对下的鲁棒性;并通过实验证明,低资源语言对因更依赖检索上下文,在噪声下性能下降显著,且生成的译文常出现语义混乱;尽管大推理模型(Large Reasoning Models, LRMs)具备更强的推理能力,却未能有效纠正错误,并更易受噪声干扰,表现为注意力偏移至噪声内容、置信度虚高但准确率下降,揭示出当前方法存在校准不良问题。为缓解此问题,研究进一步探索了无需训练和微调两种策略,虽提升了鲁棒性,但以牺牲干净环境下的性能为代价,凸显了鲁棒性与准确性之间的根本权衡。

链接: https://arxiv.org/abs/2510.00829
作者: Yanming Sun,Runzhe Zhan,Chi Seng Cheang,Han Wu,Xuebo Liu,Yuyao Niu,Fengying Ye,Kaixin Lan,Lidia S. Chao,Derek F. Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:\textbfREtrieval-\textbfAugmented \textbfLLM-based \textbfMachine \textbfTranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.
zh

[NLP-35] Family Matters: Language Transfer and Merging for Adapting Small LLM s to Faroese

【速读】: 该论文旨在解决小规模、高效的大语言模型(Large Language Models, LLMs)在低资源北日耳曼语种法罗语(Faroese)上的适配问题。其关键解决方案在于:首先基于英语模型,在相关斯堪的纳维亚语言(如冰岛语和丹麦语)上进行继续预训练,采用单独训练或合并融合的方式;随后在法罗语上进行微调(fine-tuning),并对比全量微调与参数高效微调方法LoRA(Low-Rank Adaptation)的效果。实验表明,迁移自相关语言对性能提升至关重要,且最优源语言因任务而异——冰岛语更利于提升语言准确性,丹麦语则有助于增强文本理解能力;同时,LoRA在提升语言可接受性方面优于全量微调,而全量微调在下游任务中更能保持模型能力并显著提升理解性能。

链接: https://arxiv.org/abs/2510.00810
作者: Jenny Kunz,Iben Nyholm Debess,Annika Simonsen
机构: Linköping University (林雪平大学); University of the Faroe Islands (法罗群岛大学); University of Iceland (冰岛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate how to adapt small, efficient LLMs to Faroese, a low-resource North Germanic language. Starting from English models, we continue pre-training on related Scandinavian languages, either individually or combined via merging, before fine-tuning on Faroese. We compare full fine-tuning with parameter-efficient tuning using LoRA, evaluating their impact on both linguistic accuracy and text comprehension. Due to the lack of existing Faroese evaluation data, we construct two new minimal-pair benchmarks from adapted and newly collected datasets and complement them with human evaluations by Faroese linguists. Our results demonstrate that transfer from related languages is crucial, though the optimal source language depends on the task: Icelandic enhances linguistic accuracy, whereas Danish boosts comprehension. Similarly, the choice between full fine-tuning and LoRA is task-dependent: LoRA improves linguistic acceptability and slightly increases human evaluation scores on the base model, while full fine-tuning yields stronger comprehension performance and better preserves model capabilities during downstream fine-tuning.
zh

[NLP-36] What You See is What You Ask: Evaluating Audio Descriptions EMNLP2025

【速读】: 该论文旨在解决当前自动音频描述(Audio Description, AD)生成方法在评估和实用性上的局限性问题。现有研究多基于几秒的片段进行生成与评估,且仅以单一参考文本作为标准,忽略了AD创作本身的高度主观性,导致无法真实反映其对盲人及低视力(Blind and Low Vision, BLV)用户理解剧情和感知视觉细节的实际帮助。论文的关键解决方案是提出ADQA基准,这是一个面向几分钟长、语义连贯视频段落的问答式评测框架,包含视觉欣赏(Visual Appreciation, VA)和叙事理解(Narrative Understanding, NU)两类问题,从而系统量化AD对BLV用户的认知价值。实验表明,当前AD生成模型显著落后于人工撰写版本,ADQA为未来研究提供了更贴近实际需求的评估标准与公开排行榜。

链接: https://arxiv.org/abs/2510.00808
作者: Divy Kala,Eshika Khandelwal,Makarand Tapaswi
机构: CVIT, IIIT Hyderabad, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025 Main Track Long Paper

点击查看摘要

Abstract:Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.
zh

[NLP-37] From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling

【速读】: 该论文旨在解决合成语音感知质量评估中依赖人工主观评分(如平均意见分,Mean Opinion Score, MOS)所导致的评价标准不一致和可重复性差的问题。其解决方案的关键在于构建了一个统一基准MOS-RMBench,将多样化的MOS数据集转化为偏好比较(preference-comparison)设置,从而实现跨数据集的严谨评估;在此基础上,系统比较了标量奖励模型、半标量奖励模型和生成式奖励模型(Generative Reward Models, GRMs)三种范式,并提出一种基于MOS差异的自适应奖励函数——MOS-aware GRM,使模型能够根据样本对的难度动态调整奖励强度,显著提升对细微质量差异的判别能力,缩小与标量模型在困难样本上的性能差距。

链接: https://arxiv.org/abs/2510.00743
作者: Yifei Cao,Changhao Jiang,Jiabao Zhuang,Jiajun Sun,Ming Zhang,Zhiheng Xi,Hui Li,Shihan Dou,Yuran Wang,Yunke Zhang,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Honor Device Co., Ltd
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.
zh

[NLP-38] ALARB: An Arabic Legal Argument Reasoning Benchmark

【速读】: 该论文旨在解决阿拉伯语大语言模型(Large Language Models, LLMs)在法律领域中多步推理能力评估与提升的问题,特别是针对开放场景下缺乏高质量、结构化数据集的现状。现有阿拉伯语基准多集中于知识密集型任务(如检索和理解),但对复杂法律推理链条的建模仍存在显著空白。解决方案的关键在于构建ALARB数据集——包含超过13,000个沙特商业法院案例,每个案例均标注事实、法庭推理过程、判决结果及引用法规条款,并据此定义了三项挑战性任务:判决预测、多步法律论证链补全和基于案情的事实相关法规识别。通过在该数据集上对代表性开源与闭源阿拉伯语LLMs进行指令微调(instruction tuning),实验表明,仅用一个12B参数模型即可显著提升判决预测与阿拉伯语判决生成性能,达到与GPT-4o相当的水平,验证了ALARB在推动阿拉伯语法律推理模型发展的关键作用。

链接: https://arxiv.org/abs/2510.00694
作者: Harethah Abu Shairah,Somayah AlHarbi,Abdulaziz AlHussein,Sameer Alsabea,Omar Shaqaqi,Hebah AlShamlan,Omar Knio,George Turkiyyah
机构: King Abdullah University of Science and Technology (KAUST); THIQAH
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted paper at ArabicNLP 2025

点击查看摘要

Abstract:We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well as the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset’s utility for instruction tuning. Notably, we show that instruction-tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.
zh

[NLP-39] Inclusive Easy-to-Read Generation for Individuals with Cognitive Impairments ECAI2025

【速读】: 该论文旨在解决为认知障碍人群提供可访问信息的难题,特别是针对手动生成易读文本(Easy-to-Read, ETR)效率低、成本高且难以规模化的问题。其关键解决方案包括:构建首个符合欧洲ETR指南的ETR-fr数据集,实现对预训练语言模型(PLMs)和大语言模型(LLMs)的参数高效微调(parameter-efficient fine-tuning),并提出一个结合自动指标与人工评估(基于36项问题的评估表)的多维度评价框架,从而在保证输出质量的同时提升模型对跨领域文本的适应能力。

链接: https://arxiv.org/abs/2510.00691
作者: François Ledoyen,Gaël Dias,Alexis Lechervy,Jeremie Pantin,Fabrice Maurel,Youssef Chahir,Elisa Gouzonnat,Mélanie Berthelot,Stanislas Moravac,Armony Altinier,Amy Khairalla
机构: Université Caen Normandie (卡昂诺曼底大学); ENSICAEN (法国卡昂国立高等工程师学院); CNRS (法国国家科学研究中心); Normandie Univ (诺曼底大学); GREYC UMR 6072 (GREYC UMR 6072 实验室); CRISCO UR 4255 (CRISCO UR 4255 实验室); Koena SAS (Koena SAS 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ECAI 2025

点击查看摘要

Abstract:Ensuring accessibility for individuals with cognitive impairments is essential for autonomy, self-determination, and full citizenship. However, manual Easy-to-Read (ETR) text adaptations are slow, costly, and difficult to scale, limiting access to crucial information in healthcare, education, and civic life. AI-driven ETR generation offers a scalable solution but faces key challenges, including dataset scarcity, domain adaptation, and balancing lightweight learning of Large Language Models (LLMs). In this paper, we introduce ETR-fr, the first dataset for ETR text generation fully compliant with European ETR guidelines. We implement parameter-efficient fine-tuning on PLMs and LLMs to establish generative baselines. To ensure high-quality and accessible outputs, we introduce an evaluation framework based on automatic metrics supplemented by human assessments. The latter is conducted using a 36-question evaluation form that is aligned with the guidelines. Overall results show that PLMs perform comparably to LLMs and adapt effectively to out-of-domain texts.
zh

[NLP-40] Stochastic Self-Organization in Multi-Agent Systems

【速读】: 该论文旨在解决多智能体系统(Multi-agent Systems, MAS)中因固定通信结构导致协作效率低下、难以适应动态任务需求的问题。现有方法依赖预设拓扑、外部判别器或复杂优化策略,增加了系统复杂性且缺乏灵活性。其解决方案的关键在于提出一种响应条件型自组织框架(SelfOrg),通过代理在每轮协作中独立生成响应并利用近似Shapley值评估同伴贡献,构建一个有向无环图(DAG)来动态调控信息传播路径,从而实现高贡献代理向其他代理稳定高效地传递信息。该机制无需额外训练或监督,能自动适应代理响应的随机性,显著提升弱模型场景下的鲁棒性,并理论上证明了多代理协同可提高正确性概率,且正确响应自然主导信息流。

链接: https://arxiv.org/abs/2510.00685
作者: Nurbek Tastan,Samuel Horvath,Karthik Nandakumar
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Michigan State University (MSU)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM. However, this potential can only be realized when the collaboration mechanism between agents is optimized. Specifically, optimizing the communication structure between agents is critical for fruitful collaboration. Most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or employ external LLM judges, thereby adding to the complexity. In this work, we introduce a response-conditioned framework that adapts communication on-the-fly. Agents independently generate responses to the user query and assess peer contributions using an approximation of the Shapley value. A directed acyclic graph (DAG) is then constructed to regulate the propagation of the responses among agents, which ensures stable and efficient message transmission from high-contributing agents to others. This graph is dynamically updated based on the agent responses from the previous collaboration round. Since the proposed framework enables the self-organization of agents without additional supervision or training, we refer to it as SelfOrg. The SelfOrg framework goes beyond task- and query-level optimization and takes into account the stochastic nature of agent responses. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse. We also theoretically show that multiple agents increase the chance of correctness and that the correct responses naturally dominate the information flow.
zh

[NLP-41] Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

【速读】: 该论文旨在解决多语言和跨语言稀疏检索(Learned Sparse Retrieval, LSR)在非英语场景下性能受限的问题,尤其是现有LSR方法难以有效扩展至多语言环境。其核心挑战在于如何在保持检索效率的同时实现跨语言语义对齐,并避免因投影到英语词汇空间导致的语义坍缩(semantic collapse)及罕见实体信息丢失。解决方案的关键在于提出MILCO架构,通过一个多语言连接器(multilingual connector)将不同语言的查询与文档映射至共享的英语词汇空间,并采用两阶段训练策略:第一阶段进行稀疏对齐预训练以增强表示透明性,第二阶段结合对比学习提升效果;此外创新性地引入LexEcho头(LexEcho head),利用特殊标记[ECHO]获取源语言视角,从而增强英语表示的鲁棒性。实验表明,MILCO在标准多语言基准上超越主流密集、稀疏及多向量基线模型,且支持动态效率优化(如后处理剪枝),显著提升了多语言LSR的性能与实用性。

链接: https://arxiv.org/abs/2510.00671
作者: Thong Nguyen,Yibin Lei,Jia-Huei Ju,Eugene Yang,Andrew Yates
机构: University of Amsterdam (阿姆斯特丹大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions.
zh

[NLP-42] Facilitating Cognitive Accessibility with LLM s: A Multi-Task Approach to Easy-to-Read Text Generation EMNLP2025

【速读】: 该论文旨在解决为认知障碍人群提供可访问信息的难题,具体聚焦于手动创建易读文本(Easy-to-Read, ETR)所面临的耗时与资源密集问题。其核心解决方案是利用大语言模型(Large Language Models, LLMs)自动化生成ETR内容,关键创新在于提出一种多任务学习(Multi-Task Learning, MTL)框架,联合训练模型完成文本摘要、文本简化和ETR生成三个任务。通过引入两种策略——基于检索增强生成(Retrieval-Augmented Generation, RAG)的上下文学习方法和基于LoRA(Low-Rank Adaptation)的参数高效微调方法,实验证明多任务设置在所有配置下均优于单任务基线,其中RAG策略在跨领域场景中表现更优,而MTL-LoRA在领域内配置中效果最佳。

链接: https://arxiv.org/abs/2510.00662
作者: François Ledoyen,Gaël Dias,Jeremie Pantin,Alexis Lechervy,Fabrice Maurel,Youssef Chahir
机构: Université Caen Normandie (卡昂诺曼底大学); ENSICAEN (卡昂国立高等工程师学院); CNRS (法国国家科学研究中心); Normandie Univ (诺曼底大学联盟); GREYC UMR 6072 (GREYC UMR 6072 研究所); Koena SAS (Koena SAS 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025

点击查看摘要

Abstract:Simplifying complex texts is essential for ensuring equitable access to information, especially for individuals with cognitive impairments. The Easy-to-Read (ETR) initiative offers a framework for making content accessible to the neurodivergent population, but the manual creation of such texts remains time-consuming and resource-intensive. In this work, we investigate the potential of large language models (LLMs) to automate the generation of ETR content. To address the scarcity of aligned corpora and the specificity of ETR constraints, we propose a multi-task learning (MTL) approach that trains models jointly on text summarization, text simplification, and ETR generation. We explore two different strategies: multi-task retrieval-augmented generation (RAG) for in-context learning, and MTL-LoRA for parameter-efficient fine-tuning. Our experiments with Mistral-7B and LLaMA-3-8B, based on ETR-fr, a new high-quality dataset, demonstrate the benefits of multi-task setups over single-task baselines across all configurations. Moreover, results show that the RAG-based strategy enables generalization in out-of-domain settings, while MTL-LoRA outperforms all learning strategies within in-domain configurations.
zh

[NLP-43] MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation ACM-MM2025

【速读】: 该论文旨在解决盲人及低视力用户在获取在线图像信息时面临的障碍,即生成高质量、上下文相关的替代文本(alt-text)的问题。现有基于大视觉语言模型(MLLMs)的方法受限于用户生成的alt-text标注噪声大、标准不一致以及模型对上下文敏感性不足等问题,导致性能提升有限。传统监督微调(SFT)方法依赖精确的目标标注,难以应对实际中普遍存在的标注错误。为此,作者提出多维跨模态直接偏好优化(MCM-DPO),其核心在于无需精确标注即可通过学习偏好对(preference pairs)来识别更优的alt-text选项,从而从单样本、成对和多偏好三个维度优化文本、视觉及跨模态一致性。实验表明,MCM-DPO显著优于DPO与SFT,成为alt-text生成的新基准。

链接: https://arxiv.org/abs/2510.00647
作者: Jinlan Fu,Shenzhen Huangfu,Hao Fei,Yichong Huang,Xiaoyu Shen,Xipeng Qiu,See-Kiong Ng
机构: National University of Singapore(新加坡国立大学); Fudan University(复旦大学); Harbin Institute of Technology(哈尔滨工业大学); Eastern Institute of Technology(东华理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs’ insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: this https URL
zh

[NLP-44] Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中键值缓存(Key-Value Cache, KV cache)内存消耗过大的问题。现有基于注意力分数的KV缓存压缩方法存在两大实践限制:一是压缩时无法获取未来token的注意力分数,二是现代实现如Flash Attention不显式生成完整的注意力矩阵,导致历史注意力分数不可用。解决方案的关键在于提出一种无需训练的压缩方法——Expected Attention,其通过预测未来查询对每个KV对的注意力分布来估计其重要性,利用LLM激活的分布特性以闭式形式计算预期注意力分数,从而实现对KV对的合理排序与剪枝,最小化对残差流的影响,在保持性能不变的前提下实现高效压缩。该方法在预填充(prefilling)和解码(decoding)阶段均表现优异,并已开源KVPress库支持多种压缩方法的实现与基准测试。

链接: https://arxiv.org/abs/2510.00636
作者: Alessio Devoto,Maximilian Jeblick,Simon Jégou
机构: Sapienza University of Rome (罗马大学); NVIDIA (英伟达)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory consumption of the Key-Value (KV) cache represents a major bottleneck for efficient large language model inference. While attention-score-based KV cache pruning shows promise, it faces critical practical limitations: attention scores from future tokens are unavailable during compression, and modern implementations like Flash Attention do not materialize the full attention matrix, making past scores inaccessible. To overcome these challenges, we introduce \textbfExpected Attention, a training-free compression method that estimates KV pairs importance by predicting how future queries will attend to them. Our approach leverages the distributional properties of LLM activations to compute expected attention scores in closed form for each KV pair. These scores enable principled ranking and pruning of KV pairs with minimal impact on the residual stream, achieving effective compression without performance degradation. Importantly, our method operates seamlessly across both prefilling and decoding phases, consistently outperforming state-of-the-art baselines in both scenarios. Finally, \textbfwe release KVPress, a comprehensive library to enable researchers to implement and benchmark KV cache compression methods, already including more than 20 techniques .
zh

[NLP-45] nyidie Syllabification corpus creation and deep learning applications

【速读】: 该论文旨在解决低资源语言Tenyidie语的音节切分(syllabification)问题,这是自然语言处理(Natural Language Processing, NLP)中的一项基础任务,对后续如形态分析、词性标注和机器翻译等应用具有重要意义。由于该语言缺乏相关研究与数据资源,作者首次构建了包含10,120个音节化单词的语料库,并采用深度学习方法进行建模,关键解决方案在于使用LSTM、BLSTM、BLSTM+CRF及Encoder-decoder四种神经网络架构在80:10:10划分的数据集上进行训练与测试,其中BLSTM模型在测试集上达到99.21%的最高准确率,显著提升了音节切分性能,为Tenyidie语及其他类似低资源语言的NLP研究提供了可复用的数据基础与技术范式。

链接: https://arxiv.org/abs/2510.00629
作者: Teisovi Angami,Kevisino Khate
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:The Tenyidie language is a low-resource language of the Tibeto-Burman family spoken by the Tenyimia Community of Nagaland in the north-eastern part of India and is considered a major language in Nagaland. It is tonal, Subject-Object-Verb, and highly agglutinative in nature. Being a low-resource language, very limited research on Natural Language Processing (NLP) has been conducted. To the best of our knowledge, no work on syllabification has been reported for this language. Among the many NLP tasks, syllabification or syllabication is an important task in which the given word syllables are identified. The contribution of this work is the creation of 10,120 syllabified Tenyidie words and the application of the Deep Learning techniques on the created corpus. In this paper, we have applied LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder deep learning architectures on our created dataset. In our dataset split of 80:10:10 (train:validation:test) set, we achieved the highest accuracy of 99.21% with BLSTM model on the test set. This work will find its application in numerous other NLP applications, such as morphological analysis, part-of-speech tagging, machine translation, etc, for the Tenyidie Language. Keywords: Tenyidie; NLP; syllabification; deep learning; LSTM; BLSTM; CRF; Encoder-decoder Comments: 17 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.00629 [cs.CL] (or arXiv:2510.00629v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.00629 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-46] Hearing the Order: Investigating Selection Bias in Large Audio-Language Models ICASSP2026

【速读】: 该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在处理有序选项任务时可能存在的选择偏差(selection bias)问题,即模型预测结果受答案选项顺序影响的现象,这会显著降低模型评估的可靠性。解决方案的关键在于引入基于排列(permutation-based)的策略,通过随机化或系统性变换选项顺序来缓解这种偏差,在多数情况下可有效提升模型输出的一致性和公平性。

链接: https://arxiv.org/abs/2510.00628
作者: Yu-Xiang Lin,Chen-An Li,Sheng-Lun Wei,Po-Chun Chen,Hsin-Hsi Chen,Hung-yi Lee
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: The first two authors contributed equally. Submitted to ICASSP 2026

点击查看摘要

Abstract:Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.
zh

[NLP-47] When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models ICASSP2026

【速读】: 该论文旨在解决大规模音频-语言模型(Large Audio-Language Models, LALMs)在真实噪声环境中鲁棒性不足的问题,特别是当输入中包含与文本推理任务无关的音频信息(如静音、合成噪声或环境声)时,模型性能如何受到影响。研究发现,即使是非信息性音频也会显著降低文本推理准确率并增加预测波动,且干扰强度随音频持续时间、幅度和解码温度升高而加剧;值得注意的是,静音对输出稳定性的破坏作用与合成噪声相当。解决方案的关键在于采用自一致性(self-consistency)策略以提升稳定性,尽管其代价是计算开销增加,而提示工程(prompting)则效果有限。这揭示了跨模态干扰是当前LALMs的核心鲁棒性挑战,并强调需发展高效的多模态融合机制以保障在无关输入存在下的推理性能。

链接: https://arxiv.org/abs/2510.00626
作者: Chen-An Li,Tzu-Han Lin,Hung-yi Lee
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 5 pages; submitted to ICASSP 2026

点击查看摘要

Abstract:Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.
zh

[NLP-48] HARPA: A Testability-Driven Literature-Grounded Framework for Research Ideation

【速读】: 该论文旨在解决自动化科学发现(Automated Scientific Discovery, ASD)中两个核心问题:一是现有工具难以生成既可测试又基于文献的假设;二是现有创意生成工具无法根据先前实验结果进行自适应调整。解决方案的关键在于提出HARPA系统,其通过模仿人类研究人员的构思流程实现突破:首先利用文献挖掘识别新兴研究趋势,继而探索假设设计空间,并最终通过定位研究空白和论证设计选择来聚焦于具体且可验证的假设。此外,HARPA引入一个基于先前实验结果训练的奖励模型,动态评估新假设的可行性与潜力,从而显著提升假设的可行性和文献基础性,相较未训练基线提升约28%的假设质量得分。

链接: https://arxiv.org/abs/2510.00620
作者: Rosni Vasu,Peter Jansen,Pao Siangliulue,Cristina Sarasua,Abraham Bernstein,Peter Clark,Bhavana Dalvi Mishra
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages (main), 65 pages total

点击查看摘要

Abstract:While there has been a surge of interest in automated scientific discovery (ASD), especially with the emergence of LLMs, it remains challenging for tools to generate hypotheses that are both testable and grounded in the scientific literature. Additionally, existing ideation tools are not adaptive to prior experimental outcomes. We developed HARPA to address these challenges by incorporating the ideation workflow inspired by human researchers. HARPA first identifies emerging research trends through literature mining, then explores hypothesis design spaces, and finally converges on precise, testable hypotheses by pinpointing research gaps and justifying design choices. Our evaluations show that HARPA-generated hypothesis-driven research proposals perform comparably to a strong baseline AI-researcher across most qualitative dimensions (e.g., specificity, novelty, overall quality), but achieve significant gains in feasibility(+0.78, p 0.05 , bootstrap) and groundedness (+0.85, p 0.01 , bootstrap) on a 10-point Likert scale. When tested with the ASD agent (CodeScientist), HARPA produced more successful executions (20 vs. 11 out of 40) and fewer failures (16 vs. 21 out of 40), showing that expert feasibility judgments track with actual execution success. Furthermore, to simulate how researchers continuously refine their understanding of what hypotheses are both testable and potentially interesting from experience, HARPA learns a reward model that scores new hypotheses based on prior experimental outcomes, achieving approx. a 28% absolute gain over HARPA’s untrained baseline scorer. Together, these methods represent a step forward in the field of AI-driven scientific discovery.
zh

[NLP-49] ACON: Optimizing Context Compression for Long-horizon LLM Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为智能体(agent)部署于动态现实环境时,因长期交互历史与环境观测积累导致上下文长度急剧增长所带来的内存开销大、计算效率低的问题。解决方案的关键在于提出一种统一的智能体上下文优化框架(Agent Context Optimization, ACON),其核心机制是通过自然语言空间中的压缩指导优化(compression guideline optimization):利用LLM分析“完整上下文成功而压缩上下文失败”的轨迹对,识别失败原因并迭代更新压缩策略;同时引入蒸馏技术将优化后的压缩模块小型化,显著降低额外计算负担。实验表明,ACON可在保持任务性能的前提下减少26%-54%的峰值令牌占用,并使小型模型在长程任务中性能提升最高达46%。

链接: https://arxiv.org/abs/2510.00615
作者: Minki Kang,Wei-Ning Chen,Dongge Han,Huseyin A. Inan,Lukas Wutschitz,Yanzhi Chen,Robert Sim,Saravan Rajmohan
机构: KAIST(韩国科学技术院); Microsoft; University of Cambridge(剑桥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.
zh

[NLP-50] Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors

【速读】: 该论文旨在解决现有针对检索增强生成(Retrieval-Augmented Generation, RAG)系统的数据投毒攻击在实际应用中扩展性差的问题,即攻击者需为每个目标短语重新优化中毒文档,导致计算成本高昂。其解决方案的关键在于提出一种模块化攻击框架Eyes-on-Me,将对抗性文档分解为可复用的注意力吸引器(Attention Attractors)和聚焦区域(Focus Regions):吸引器被优化以引导注意力至聚焦区域,从而插入语义诱饵或恶意指令;攻击者可基于少量被实证识别出与攻击成功率强相关的注意力头,对新目标实现近乎零成本的适应。该方法显著提升了攻击成功率(平均从21.9%提升至57.8%),并验证了单个优化吸引器可在未见黑盒检索器和生成器上直接迁移,揭示了注意力集中度与模型输出间的强关联,为RAG系统的安全性和可解释性研究提供了重要启示。

链接: https://arxiv.org/abs/2510.00586
作者: Yen-Shan Chen,Sian-Yao Huang,Cheng-Lin Yang,Yun-Nung Chen
机构: CyCraft AI Lab (台湾); National Taiwan University (台湾)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Existing data poisoning attacks on retrieval-augmented generation (RAG) systems scale poorly because they require costly optimization of poisoned documents for each target phrase. We introduce Eyes-on-Me, a modular attack that decomposes an adversarial document into reusable Attention Attractors and Focus Regions. Attractors are optimized to direct attention to the Focus Region. Attackers can then insert semantic baits for the retriever or malicious instructions for the generator, adapting to new targets at near zero cost. This is achieved by steering a small subset of attention heads that we empirically identify as strongly correlated with attack success. Across 18 end-to-end RAG settings (3 datasets \times 2 retrievers \times 3 generators), Eyes-on-Me raises average attack success rates from 21.9 to 57.8 (+35.9 points, 2.6 \times over prior work). A single optimized attractor transfers to unseen black box retrievers and generators without retraining. Our findings establish a scalable paradigm for RAG data poisoning and show that modular, reusable components pose a practical threat to modern AI systems. They also reveal a strong link between attention concentration and model outputs, informing interpretability research.
zh

[NLP-51] SAGE-LD: Towards Scalable and Generalizable End-to-End Language Diarization via Simulated Data Augmentation

【速读】: 该论文旨在解决传统语音语言聚类(language diarization)方法在多语言场景下因数据稀缺和架构优化不足而导致的泛化能力差的问题,尤其是在真实世界中存在大量语言混用(code-switching)的情况。其解决方案的关键在于提出一种基于可学习查询(learnable query-based)架构的神经网络模型,并通过大规模模拟语言混用数据进行预训练,从而实现对任意语言组合的统一建模与高效识别。该方法显著提升了多语言环境下的性能,在多个基准测试中相较以往方法相对提升达23%至52%。

链接: https://arxiv.org/abs/2510.00582
作者: Sangmin Lee,Woongjib Choi,Jihyun Kim,Hong-Goo Kang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:In this paper, we present a neural spoken language diarization model that supports an unconstrained span of languages within a single framework. Our approach integrates a learnable query-based architecture grounded in multilingual awareness, with large-scale pretraining on simulated code-switching data. By jointly leveraging these two components, our method overcomes the limitations of conventional approaches in data scarcity and architecture optimization, and generalizes effectively to real-world multilingual settings across diverse environments. Experimental results demonstrate that our approach achieves state-of-the-art performance on several language diarization benchmarks, with a relative performance improvement of 23% to 52% over previous methods. We believe that this work not only advances research in language diarization but also establishes a foundational framework for code-switching speech technologies.
zh

[NLP-52] CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLM s

【速读】: 该论文旨在解决当前Chain-of-Thought (CoT) prompting在大型语言模型(Large Language Models, LLMs)中实现多步推理时存在的成本高、效率低的问题,尤其是在基于上下文学习(in-context learning)和微调(fine-tuning)的方法中。其解决方案的关键在于提出CoT向量(CoT Vectors),即一种紧凑的表示形式,用于编码任务通用的多步推理知识;进一步通过引入可学习的CoT向量(Learnable CoT Vectors),并在教师-学生框架下进行优化,以提升推理引导的稳定性与鲁棒性。实验表明,该方法在多个基准测试中性能优于现有基线,并达到与参数高效微调相当的效果,同时显著减少可训练参数数量,为理解LLMs中多步推理的功能组织提供了新视角。

链接: https://arxiv.org/abs/2510.00579
作者: Li Li,Ziyi Wang,Yongliang Wu,Jianfei Cai,Xu Yang
机构: School of Computer Science & Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China; Data Science & AI Department at Faculty of IT, Monash University, Australia
类目: Computation and Language (cs.CL)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher-student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.
zh

[NLP-53] ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的搜索代理在强化学习(Reinforcement Learning, RL)训练过程中因依赖稀疏或规则奖励而导致的错误推理路径难以修正的问题。其解决方案的关键在于提出一种名为 ReSeek 的自校正框架,该框架通过引入一个动态的自我纠正机制,使代理能够在推理过程中调用特殊的 JUDGE 行动来评估当前信息并重新规划搜索策略;同时设计了一个密集且具有指导性的过程奖励函数,将奖励分解为用于衡量事实准确性的真实性奖励(correctness reward)和用于评估信息对查询实际价值的效用奖励(utility reward),从而提升任务成功率与路径忠实度。

链接: https://arxiv.org/abs/2510.00568
作者: Shiyu Li,Yang Tang,Yifan Wang,Peiming Li,Xi Chen
机构: Tencent (腾讯); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.
zh

[NLP-54] Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对中文网络迷因(Meme)的理解能力不足的问题,特别是其在解释迷因语义、识别起源及在上下文中恰当使用迷因方面的局限性。解决方案的关键在于构建了一个名为CHIME的中文网络迷因解释数据集,该数据集包含大量流行短语型迷因及其详细标注信息(如含义、来源、示例句子和类型),并设计了两项评估任务:一是要求模型解释迷因、识别来源并生成例句;二是通过多项选择题测试模型在具体语境中选择最合适的迷因填空的能力。实验表明,尽管LLMs在部分任务上具备一定能力,但在文化与语言高度复杂的迷因类型中表现显著下降,且难以准确识别迷因起源,揭示了当前模型在理解网络亚文化内容上的系统性短板。

链接: https://arxiv.org/abs/2510.00567
作者: Yubo Xie,Chenkai Wang,Zongyang Ma,Fahui Miao
机构: Shanghai Maritime University (上海海事大学); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); Xi’an Jiaotong Liverpool University (西安交通大学利物浦大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main Conference. 22 pages, 3 figures, 13 tables. GitHub: this http URL

点击查看摘要

Abstract:Large language models (LLMs) are trained on vast amounts of text from the Internet, but do they truly understand the viral content that rapidly spreads online – commonly known as memes? In this paper, we introduce CHIME, a dataset for CHinese Internet Meme Explanation. The dataset comprises popular phrase-based memes from the Chinese Internet, annotated with detailed information on their meaning, origin, example sentences, types, etc. To evaluate whether LLMs understand these memes, we designed two tasks. In the first task, we assessed the models’ ability to explain a given meme, identify its origin, and generate appropriate example sentences. The results show that while LLMs can explain the meanings of some memes, their performance declines significantly for culturally and linguistically nuanced meme types. Additionally, they consistently struggle to provide accurate origins for the memes. In the second task, we created a set of multiple-choice questions (MCQs) requiring LLMs to select the most appropriate meme to fill in a blank within a contextual sentence. While the evaluated models were able to provide correct answers, their performance remains noticeably below human levels. We have made CHIME public and hope it will facilitate future research on computational meme understanding.
zh

[NLP-55] hinkBrake: Mitigating Overthinking in Tool Reasoning

【速读】: 该论文旨在解决小规模推理模型(Small Reasoning Models, SRMs)在工具调用过程中存在的“过度思考”(overthinking)问题,即模型在达到正确工具参数配置后仍继续推理并错误地覆盖初始正确结果。其核心解决方案是引入一种无需训练的解码启发式方法——ThinkBrake,该方法通过监测句边界处“/think”标记与当前最高概率token之间的对数概率差值(log-probability margin),当该差值变小时触发提前终止,从而有效减少冗余推理,提升工具调用准确性并显著降低Token消耗(最多减少25%)。

链接: https://arxiv.org/abs/2510.00546
作者: Minjae Oh,Sangjun Song,Seungkyu Lee,Sungmin Jo,Yohan Jo
机构: Seoul National University (首尔国立大学); Korea University (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject /think at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8% to 94.2% while reducing tokens by 80-94%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between /think and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL’s single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25%, outperforming various baselines.
zh

[NLP-56] GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

【速读】: 该论文旨在解决基于视觉-语言模型的图形用户界面(GUI)智能体在处理高分辨率截图和长任务周期时存在的推理效率低下、内存占用过高问题,尤其是传统关键值(KV)缓存机制因图像内容冗余而难以有效压缩。解决方案的关键在于:首先发现GUI任务中注意力稀疏性在所有Transformer层中均保持一致,从而提出一种简单且优于复杂分层策略的均匀预算分配方法;在此基础上,设计了无需重新训练的GUI-KV缓存压缩方法,其核心创新包括两个方面——空间显著性引导(通过隐藏状态L2范数增强注意力分数以保留语义重要视觉标记)和时间冗余评分(将前序帧键投影至当前帧键子空间以优先剔除冗余历史),从而在低缓存预算下实现接近全缓存精度的性能表现。

链接: https://arxiv.org/abs/2510.00536
作者: Kung-Hsiang Huang,Haoyi Qiu,Yutong Dai,Caiming Xiong,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames’ keys onto the current frame’s key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.
zh

[NLP-57] Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在后训练大型语言模型(Large Language Models, LLMs)时普遍存在的泛化能力不足问题。作者指出,传统负对数似然(Negative Log Likelihood, NLL)作为默认训练目标,在SFT场景下可能不再最优,因其假设模型从零开始训练,而实际后训练中模型已具备任务相关的先验知识且标注数据可能存在长尾和噪声。解决方案的关键在于提出一个基于概率的通用目标函数家族,并通过系统实验与消融研究揭示了一个决定目标行为的核心维度——“模型能力连续体”(model-capability continuum)。在此连续体上,靠近模型能力强的一端,偏向先验的目标(如 -p、-p^10 及阈值变体)优于NLL;而在模型能力弱的一端,NLL表现更优;中间区域则无单一目标占优。理论分析进一步阐明了不同目标在该连续体上的权衡机制,为依据模型能力动态适配训练目标提供了原则性依据。

链接: https://arxiv.org/abs/2510.00526
作者: Gaotang Li,Ruizhong Qiu,Xiusi Chen,Heng Ji,Hanghang Tong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., -p , -p^10 , thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at this https URL.
zh

[NLP-58] EuroSpeech: A Multilingual Speech Corpus NEURIPS2025

【速读】: 该论文旨在解决多语言语音识别(Speech Recognition)模型在多数语言上性能不佳的问题,其根源在于现有多语言数据集对大多数语言的训练数据不足。解决方案的关键在于提出了一种可扩展的数据集构建流水线,该流水线包含鲁棒的媒体检索组件和一种两阶段对齐算法,能够有效处理非逐字转录文本与长时音频的对齐问题;通过该方法从22个欧洲议会的录音中提取了超过61,000小时的高质量对齐语音片段,显著提升了各语言的数据覆盖度,并在微调现有自动语音识别(ASR)模型时实现了平均41.8%的词错误率降低,验证了该方案的有效性。

链接: https://arxiv.org/abs/2510.00514
作者: Samuel Pfisterer,Florian Grötschla,Luca A. Lanzendörfer,Florian Yan,Roger Wattenhofer
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmark

点击查看摘要

Abstract:Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.
zh

[NLP-59] JoyAgent -JDGenie: Technical Report on the GAIA

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为自主代理在复杂现实任务中部署时,系统设计缺乏统一性、鲁棒性和适应性的问题。其解决方案的关键在于提出一种通用型代理架构(generalist agent architecture),通过三个核心组件的系统级集成实现性能提升:一是融合规划与执行代理并引入批评模型投票机制的集体多代理框架(collective multi-agent framework),增强决策可靠性;二是构建涵盖工作记忆、语义记忆和程序记忆的分层记忆系统(hierarchical memory system),支持长期知识存储与动态调用;三是优化工具套件(refined tool suite),覆盖搜索、代码执行与多模态解析功能,提升任务执行能力。实证结果表明,该架构在综合基准测试中显著优于开源基线,并逼近专有系统的性能,验证了系统级整合对打造可扩展、鲁棒且自适应的AI助手的重要性。

链接: https://arxiv.org/abs/2510.00510
作者: Jiarun Liu,Shiyue Xu,Shangkun Liu,Yang Li,Wen Liu,Min Liu,Xiaoqing Zhou,Hanmin Wang,Shilin Jia,zhen Wang,Shaohua Tian,Hanhao Li,Junbo Zhang,Yongli Yu,Peng Cao,Haofen Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models are increasingly deployed as autonomous agents for complex real-world tasks, yet existing systems often focus on isolated improvements without a unifying design for robustness and adaptability. We propose a generalist agent architecture that integrates three core components: a collective multi-agent framework combining planning and execution agents with critic model voting, a hierarchical memory system spanning working, semantic, and procedural layers, and a refined tool suite for search, code execution, and multimodal parsing. Evaluated on a comprehensive benchmark, our framework consistently outperforms open-source baselines and approaches the performance of proprietary systems. These results demonstrate the importance of system-level integration and highlight a path toward scalable, resilient, and adaptive AI assistants capable of operating across diverse domains and tasks.
zh

[NLP-60] Copy-Paste to Mitigate Large Language Model Hallucinations

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中生成内容与外部上下文不一致的问题,即“上下文忠实性”(contextual faithfulness)不足导致的幻觉(hallucination),从而影响大语言模型(Large Language Models, LLMs)输出的可靠性。其核心解决方案是提出 CopyPasteLLM,通过两阶段高复制度响应偏好训练实现,关键在于设计三种提示方法以显著提升生成响应对检索上下文的复制程度;实验证明,更高的复制度可有效降低不忠实幻觉,且该方法仅需少量训练样本(365条)即可在 FaithEval、ConFiQA 和 PubMedQA 等基准上超越现有最优基线模型,准确率提升达 12.2% 至 24.5%,同时揭示了 CopyPasteLLM 在生成过程中重新校准了对内部参数化知识与外部知识的依赖权重。

链接: https://arxiv.org/abs/2510.00508
作者: Yongchao Long,Xian Wu,Yingying Zhang,Xianbin Wen,Yuxi Zhou,Shenda Hong
机构: Tianjin University of Technology (天津理工大学); Peking University (北京大学); Tencent Jarvis Lab (腾讯混元实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to generate contextually grounded responses, contextual faithfulness remains challenging as LLMs may not consistently trust provided context, leading to hallucinations that undermine reliability. We observe an inverse correlation between response copying degree and context-unfaithful hallucinations on RAGTruth, suggesting that higher copying degrees reduce hallucinations by fostering genuine contextual belief. We propose CopyPasteLLM, obtained through two-stage high-copying response preference training. We design three prompting methods to enhance copying degree, demonstrating that high-copying responses achieve superior contextual faithfulness and hallucination control. These approaches enable a fully automated pipeline that transforms generated responses into high-copying preference data for training CopyPasteLLM. On FaithEval, ConFiQA and PubMedQA, CopyPasteLLM achieves best performance in both counterfactual and original contexts, remarkably with 12.2% to 24.5% accuracy improvements on FaithEval over the best baseline, while requiring only 365 training samples – 1/50th of baseline data. To elucidate CopyPasteLLM’s effectiveness, we propose the Context-Parameter Copying Capturing algorithm. Interestingly, this reveals that CopyPasteLLM recalibrates reliance on internal parametric knowledge rather than external knowledge during generation. All codes are available at this https URL
zh

[NLP-61] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)驱动的智能体(Agent)在动态环境和多样化任务中缺乏有效评估手段的问题。现有基于静态数据集的评测方法无法充分反映智能体在真实交互场景下的推理、协作与工具使用能力,而现有的生成式AI(Generative AI)合成数据方法主要面向语言模型训练与评估,难以适配需多步交互的网页操作等复杂任务。解决方案的关键在于提出Graph2Eval——一个基于知识图谱(Knowledge Graph)的自动化任务生成框架,通过从多源外部数据构建知识图谱作为任务空间,利用子图采样、任务模板和元路径(meta-path)将语义关系转化为结构化任务,并结合节点可达性、LLM评分与相似性分析的多阶段过滤机制确保任务质量与可执行性,从而实现对单智能体、多智能体及网络智能体的端到端综合评估。

链接: https://arxiv.org/abs/2510.00507
作者: Yurun Chen,Xavier Hu,Yuhan Liu,Ziqi Wang,Zeyi Liao,Lin Chen,Feng Wei,Yuxi Qian,Bo Zheng,Keting Yin,Shengyu Zhang
机构: Zhejiang University (浙江大学); Xiamen University (厦门大学); The Ohio State University (俄亥俄州立大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures

点击查看摘要

Abstract:As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents’ reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.
zh

[NLP-62] MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

【速读】: 该论文旨在解决传统语音对话系统中因依赖文本中间表示而导致的语调线索丢失与表达力受限问题,以及现有端到端方法仍需文本引导所造成的根本性瓶颈。其解决方案的关键在于提出MOSS-Speech,一种真正意义上的语音到语音的大语言模型(speech-to-speech large language model),通过结合基于模态的层分割架构(modality-based layer-splitting architecture)与冻结预训练策略(frozen pre-training strategy),在保留预训练文本大语言模型(text LLMs)推理能力与知识的基础上,赋予模型原生语音理解与生成能力,从而实现无需文本中介的高效、高保真语音交互。

链接: https://arxiv.org/abs/2510.00499
作者: Xingjian Zhao,Zhe Xu,Luozhijie Jin,Yang Wang,Hanfu Chen,Yaozhou Jiang,Ke Chen,Ruixiao Li,Mingshu Chen,Ruiming Wang,Wenbo Zhang,Yiyang Zhang,Donghua Yu,Yang Gao,Xiaogui Yang,Yitian Gong,Yuanfan Xu,Qinyuan Cheng,Zhaoye Fei,Shimin Li,Yaqian Zhou,Xuanjing Huang,Xipeng Qiu
机构: Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); MOSI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
zh

[NLP-63] Agent -ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

【速读】: 该论文旨在解决当前多模态智能体在图形用户界面(GUI)中自主交互能力受限的问题,特别是其在面对复杂或分布外任务时可靠性不足的挑战。研究表明,现有模型可能依赖机械记忆而非系统性推理,导致泛化能力薄弱。解决方案的关键在于提出Agent-ScanKit这一系统性的探针框架,通过视觉引导、文本引导和结构引导三种正交的探针范式,在不访问模型内部机制的前提下,量化记忆与推理各自对任务完成的贡献,从而揭示多模态智能体的真实能力本质。

链接: https://arxiv.org/abs/2510.00496
作者: Pengzhou Cheng,Lingzhong Dong,Zeng Wu,Zongru Wu,Xiangru Tang,Chengwei Qin,Zhuosheng Zhang,Gongshen Liu
机构: Shanghai Jiao Tong University (上海交通大学); Yale University (耶鲁大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbfAgent-ScanKit, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.
zh

[NLP-64] Agent Fine-tuning through Distillation for Domain-specific LLM s in Microdomains

【速读】: 该论文旨在解决生成式 AI(Generative AI)在特定技术微领域(microdomain)中进行自主推理时性能不足的问题,尤其是传统基于上下文学习的少样本提示方法因输入过长和计算成本高而难以高效部署。解决方案的关键在于采用代理微调(agent fine-tuning)策略,通过使用来自领域手册的JP1专用数据集以及由大语言模型自身蒸馏出的推理轨迹对模型进行训练,使其内化特定领域的程序性推理能力和知识;同时在推理阶段引入检索增强生成(retrieval-augmented generation)和上下文-答案提取器,显著提升了决策准确性和搜索效率,在JP1认证考试题上相较基础模型实现了14%的性能提升。

链接: https://arxiv.org/abs/2510.00482
作者: Yawen Xue,Masaya Tsunokake,Yuta Koreeda,Ekant Muljibhai Amin,Takashi Sumiyoshi,Yasuhiro Sogawa
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AIxB 2025

点击查看摘要

Abstract:Agentic large language models (LLMs) have become prominent for autonomously interacting with external environments and performing multi-step reasoning tasks. Most approaches leverage these capabilities via in-context learning with few-shot prompts, but this often results in lengthy inputs and higher computational costs. Agent fine-tuning offers an alternative by enabling LLMs to internalize procedural reasoning and domain-specific knowledge through training on relevant data and demonstration trajectories. While prior studies have focused on general domains, their effectiveness in specialized technical microdomains remains unclear. This paper explores agent fine-tuning for domain adaptation within Hitachi’s JP1 middleware, a microdomain for specialized IT operations. We fine-tuned LLMs using JP1-specific datasets derived from domain manuals and distilled reasoning trajectories generated by LLMs themselves, enhancing decision making accuracy and search efficiency. During inference, we used an agentic prompt with retrieval-augmented generation and introduced a context-answer extractor to improve information relevance. On JP1 certification exam questions, our method achieved a 14% performance improvement over the base model, demonstrating the potential of agent fine-tuning for domain-specific reasoning in complex microdomains.
zh

[NLP-65] Enhancing Rating Prediction with Off-the-Shelf LLM s Using In-Context User Reviews EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在用户偏好个性化中的应用问题,特别是针对Likert-scale评分预测这一回归任务的性能表现尚未被充分探索的问题。当前主流方法多集中于分类或排序任务,而评分预测需结合语言理解与数学推理能力,具有较高挑战性。解决方案的关键在于利用用户撰写的评论作为上下文信息(in-context information),显著提升LLMs在评分预测上的准确性,其效果可媲美传统矩阵分解等方法,尤其在冷启动场景下展现出潜力;此外,研究发现具体物品相关的评论比泛化的偏好描述更有效,并且通过提示LLMs先生成假设性评论,能进一步增强评分预测性能。

链接: https://arxiv.org/abs/2510.00449
作者: Koki Ryu,Hitomi Yanaka
机构: The University of Tokyo (东京大学); Riken (理化学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 PALS Workshop

点击查看摘要

Abstract:Personalizing the outputs of large language models (LLMs) to align with individual user preferences is an active research area. However, previous studies have mainly focused on classification or ranking tasks and have not considered Likert-scale rating prediction, a regression task that requires both language and mathematical reasoning to be solved effectively. This task has significant industrial applications, but the utilization of LLMs remains underexplored, particularly regarding the capabilities of off-the-shelf LLMs. This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information. Through comprehensive experiments with eight models across three datasets, we demonstrate that user-written reviews significantly improve the rating prediction performance of LLMs. This result is comparable to traditional methods like matrix factorization, highlighting the potential of LLMs as a promising solution for the cold-start problem. We also find that the reviews for concrete items are more effective than general preference descriptions that are not based on any specific item. Furthermore, we discover that prompting LLMs to first generate a hypothetical review enhances the rating prediction performance. Our code is available at this https URL.
zh

[NLP-66] LongCodeZip: Compress Long Context for Code Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文代码生成任务中面临的高API成本和生成延迟问题,同时克服现有上下文裁剪技术(如LLMLingua)因忽略代码特有的结构与依赖关系而导致的性能下降。其解决方案的关键在于提出一种名为LongCodeZip的即插即用式代码压缩框架,采用双阶段策略:首先通过条件困惑度(conditional perplexity)对函数级代码块进行粗粒度压缩并保留最相关函数;其次基于困惑度对保留函数进行细粒度分块,并在自适应token预算下选择最优子集以最大化信息相关性。该方法在代码补全、摘要生成和问答等任务中实现了最高达5.6倍的压缩比,且不损害任务性能,从而显著提升LLMs在真实大规模代码场景中的效率与可扩展性。

链接: https://arxiv.org/abs/2510.00446
作者: Yuling Shi,Yichun Qian,Hongyu Zhang,Beijun Shen,Xiaodong Gu
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted to ASE 2025. Code available at this https URL

点击查看摘要

Abstract:Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.
zh

[NLP-67] okMem: Tokenized Procedural Memory for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在任务执行中对提示(prompt)的高度依赖问题,包括提示重复读取导致的效率低下、跨任务扩展性差以及缺乏模块化复用机制。其解决方案的关键在于提出TokMem——一种分词化的程序记忆(tokenized procedural memory),将重复出现的操作过程编码为紧凑且可训练的嵌入向量(embeddings),每个记忆标记(memory token)同时携带程序地址和控制信号,从而以固定大小的开销实现目标行为的精准引导。该方法保持骨干模型冻结,支持持续适应新程序而不干扰已有知识,显著优于基于检索增强生成(retrieval-augmented generation)的方法,并在原子回忆和组合式函数调用任务上展现出更高的效率与可扩展性。

链接: https://arxiv.org/abs/2510.00444
作者: Zijun Wu,Yongchang Hao,Lili Mou
机构: University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (阿爾伯塔機器智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models rely heavily on prompts to specify tasks, recall knowledge and guide reasoning. However, this reliance is inefficient as prompts must be re-read at each step, scale poorly across tasks, and lack mechanisms for modular reuse. We introduce TokMem, a tokenized procedural memory that stores recurring procedures as compact, trainable embeddings. Each memory token encodes both an address to a procedure and a control signal that steers generation, enabling targeted behavior with constant-size overhead. To support continual adaptation, TokMem keeps the backbone model frozen, allowing new procedures to be added without interfering with existing ones. We evaluate TokMem on 1,000 tasks for atomic recall, and on function-calling tasks for compositional recall, where it consistently outperforms retrieval-augmented generation while avoiding repeated context overhead, and fine-tuning with far fewer parameters. These results establish TokMem as a scalable and modular alternative to prompt engineering and fine-tuning, offering an explicit procedural memory for LLMs.
zh

[NLP-68] Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

【速读】: 该论文旨在解决当前评估生成式 AI(Generative AI)在回答患者健康问题时表现的难题,特别是针对医院住院相关问题的自动化评价方法缺乏可靠性和可扩展性的问题。现有金标准——人工专家评审——虽准确但耗时费力,难以规模化;而现有的自动化指标则常与人类判断不一致且依赖具体语境。论文的关键解决方案在于设计一套精心构建的自动化评估框架,通过引入由临床医生撰写的参考答案作为锚点,对28个AI系统在100个患者案例中的2800条响应进行三维度评估(是否准确回答问题、是否合理使用临床记录证据、是否恰当运用通用医学知识),结果显示自动化排名与专家评分高度一致,证明了自动化评估在规模比较和促进医患沟通方面的可行性与有效性。

链接: https://arxiv.org/abs/2510.00436
作者: Sarvesh Soni,Dina Demner-Fushman
机构: National Library of Medicine, National Institutes of Health (国家医学图书馆,美国国立卫生研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses–human expert review–is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched expert ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.
zh

[NLP-69] AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

【速读】: 该论文旨在解决现有稀疏自编码器(Sparse Autoencoders, SAEs)在解释大语言模型(Large Language Models, LLMs)时存在的结构性局限性问题,即其通过非负性约束强制稀疏性的正则化机制导致单个特征无法表示双向概念(如“男性”与“女性”),从而将语义轴分割为冗余的独立特征,限制了表征的完整性。解决方案的关键在于提出一种基于ℓ₀稀疏约束的新变体——AbsTopK SAE,该方法采用对最大绝对值激活进行硬阈值处理的方式,保留正负双向激活,从而能够捕捉更丰富的双向概念表示;实验表明,AbsTopK在重建保真度、可解释性和概念编码能力上均优于传统SAE,并可媲美甚至超越需要标注数据的监督方法Difference-in-Mean。

链接: https://arxiv.org/abs/2510.00404
作者: Xudong Zhu,Mohammad Mahdi Khalili,Zhihui Zhu
机构: The Ohio State University (俄亥俄州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the \ell_0 sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.
zh

[NLP-70] GDLNN: Marriage of Programming Language and Neural Networks for Accurate and Easy-to-Explain Graph Classification

【速读】: 该论文旨在解决图分类任务中模型可解释性不足与性能瓶颈的问题。现有主流图神经网络(Graph Neural Networks, GNNs)虽然在图分类上表现良好,但其黑箱特性限制了对预测结果的可信解释,且在部分数据集上存在精度提升空间。解决方案的关键在于提出一种名为GDLNN的新颖图机器学习架构,其核心创新是引入一个基于领域特定编程语言(Domain-Specific Programming Language, GDL)的GDL层,用于生成既表达能力强又具有可解释性的图表示。该设计使得传统模型解释技术可直接应用于GDLNN的预测过程,从而在保持高分类准确率的同时显著提升模型透明度,并且整体计算成本(包括解释开销)较低。

链接: https://arxiv.org/abs/2510.00374
作者: Minseok Jeon,Seunghyun Park
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present GDLNN, a new graph machine learning architecture, for graph classification tasks. GDLNN combines a domain-specific programming language, called GDL, with neural networks. The main strength of GDLNN lies in its GDL layer, which generates expressive and interpretable graph representations. Since the graph representation is interpretable, existing model explanation techniques can be directly applied to explain GDLNN’s predictions. Our evaluation shows that the GDL-based representation achieves high accuracy on most graph classification benchmark datasets, outperforming dominant graph learning methods such as GNNs. Applying an existing model explanation technique also yields high-quality explanations of GDLNN’s predictions. Furthermore, the cost of GDLNN is low when the explanation cost is included.
zh

[NLP-71] Navigating the Synchrony-Stability Frontier in Adaptive Chatbots

【速读】: 该论文旨在解决自适应聊天机器人在模仿用户语言风格时面临的“即时同步性”与“长期人格稳定性”之间的设计权衡问题(即:过度模仿可能导致对话代理显得不稳定或谄媚,而缺乏适应性则削弱用户互动体验)。其解决方案的关键在于提出一个基于8维风格向量和“基础+增量”(base+delta)提示架构的计算评估框架,并系统比较多种显式适配策略(如无上限、限制上限、指数移动平均、死区及混合策略),识别出帕累托前沿上的高效政策(如EMA+Cap混合策略),在仅小幅降低同步性(-17%)的前提下显著提升稳定性(+62%),同时通过量化“提示可读性”指标(prompt legibility)减少指令波动和语气突变,从而增强系统的可解释性和可维护性。

链接: https://arxiv.org/abs/2510.00339
作者: T. James Brandt
机构: University of Minnesota (明尼苏达大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: pages; 9 tables; 7 figures; code analysis artifact: this https URL under review at ACM IUI 2026

点击查看摘要

Abstract:Adaptive chatbots that mimic a user’s linguistic style can build rapport and engagement, yet unconstrained mimicry risks an agent that feels unstable or sycophantic. We present a computational evaluation framework that makes the core design tension explicit: balancing moment-to-moment linguistic synchrony against long-term persona stability. Using an 8-dimensional style vector and a closed-loop “base+delta” prompting architecture, we simulate and compare explicit adaptation policies - Uncapped, Cap, Exponential Moving Average (EMA), Dead-Band, and Hybrids - on a human-log dataset. Our analysis maps a clear Pareto frontier: bounded policies achieve substantial gains in stability at a modest cost to synchrony. For example, a Hybrid (EMA+Cap) raises stability from 0.542 to 0.878 (+62%) while reducing synchrony by only 17%. We confirm this trade-off through large-scale replications on three public corpora (DailyDialog, Persona-Chat, EmpatheticDialogues) and LLM-in-the-loop validation across two model families. Furthermore, we quantify “prompt legibility,” showing that frontier policies reduce instruction churn and cut jarring register flips (major tone changes) from 0.254 to 0.092, yielding systems that are easier to reason about and maintain. Taken together, our framework provides a general evaluation harness for style adaptation; a systematic ablation that identifies Pareto-efficient policies; robust validation across diverse datasets and models; and novel legibility metrics linking policy choices to system maintainability.
zh

[NLP-72] CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage

【速读】: 该论文旨在解决安全运营中心(Security Operations Center, SOC)因每日接收数以万计的告警而产生的告警疲劳问题,其中仅有少量为真实攻击,传统检测流程脆弱且缺乏上下文,而现有基于大语言模型(Large Language Model, LLM)的方法通常依赖单一模型完成日志解析、上下文检索与告警判定全流程,难以应对企业数据噪声并缺乏可解释性。其解决方案的关键在于提出CORTEX——一种多智能体LLM架构,通过专业化分工协作实现高风险告警的精准研判:行为分析代理(behavior-analysis agent)处理活动序列,证据收集代理(evidence-gathering agents)调用外部系统获取信息,推理代理(reasoning agent)整合结果生成可审计决策,从而显著降低误报率并提升调查质量。

链接: https://arxiv.org/abs/2510.00311
作者: Bowen Wei,Yuan Shen Tay,Howard Liu,Jinhao Pan,Kun Luo,Ziwei Zhu,Chris Jordan
机构: George Mason University (乔治梅森大学); Fluency Security
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts, with only a small fraction corresponding to genuine attacks. This overload creates alert fatigue, leading to overlooked threats and analyst burnout. Classical detection pipelines are brittle and context-poor, while recent LLM-based approaches typically rely on a single model to interpret logs, retrieve context, and adjudicate alerts end-to-end – an approach that struggles with noisy enterprise data and offers limited transparency. We propose CORTEX, a multi-agent LLM architecture for high-stakes alert triage in which specialized agents collaborate over real evidence: a behavior-analysis agent inspects activity sequences, evidence-gathering agents query external systems, and a reasoning agent synthesizes findings into an auditable decision. To support training and evaluation, we release a dataset of fine-grained SOC investigations from production environments, capturing step-by-step analyst actions and linked tool outputs. Across diverse enterprise scenarios, CORTEX substantially reduces false positives and improves investigation quality over state-of-the-art single-agent LLMs.
zh

[NLP-73] o-MEGA: Optimized Methods for Explanation Generation and Analysis

【速读】: 该论文旨在解决基于Transformer的自然语言处理(Natural Language Processing, NLP)模型在透明性和可信度方面面临的挑战,尤其是在语义匹配任务中如何选择最优的可解释AI(Explainable AI, XAI)方法及其配置的问题。其解决方案的关键在于提出一个名为o-mega的超参数优化工具,该工具能够自动搜索并识别在事实核查流水线中最具效果的XAI方法及其参数组合,从而系统性提升自动化事实核查系统的可解释性与透明度,增强关键应用场景(如虚假信息检测)中AI决策的信任度。

链接: https://arxiv.org/abs/2510.00288
作者: Ľuboš Kriš,Jaroslav Kopčan,Qiwei Peng,Andrej Ridzik,Marcel Veselý,Martin Tamajka
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of transformer-based language models has revolutionized NLP domain while simultaneously introduced significant challenges regarding model transparency and trustworthiness. The complexity of achieving explainable systems in this domain is evidenced by the extensive array of explanation methods and evaluation metrics developed by researchers. To address the challenge of selecting optimal explainability approaches, we present \textbf\texttto-mega, a hyperparameter optimization tool designed to automatically identify the most effective explainable AI methods and their configurations within the semantic matching domain. We evaluate o-mega on a post-claim matching pipeline using a curated dataset of social media posts paired with refuting claims. Our tool systematically explores different explainable methods and their hyperparameters, demonstrating improved transparency in automated fact-checking systems. As a result, such automated optimization of explanation methods can significantly enhance the interpretability of claim-matching models in critical applications such as misinformation detection, contributing to more trustworthy and transparent AI systems.
zh

[NLP-74] ReEvalMed: Rethinking Medical Report Evaluation by Aligning Metrics with Real-World Clinical Judgment

【速读】: 该论文试图解决当前自动生成的放射科报告在现有评估指标下得分较高,但缺乏临床医生信任的问题,揭示了现有评估指标在衡量生成报告质量时存在根本性缺陷。其解决方案的关键在于提出一种基于临床实践的元评估(Meta-Evaluation)框架,该框架定义了涵盖临床一致性与关键指标能力(如区分度、鲁棒性和单调性)的临床基准准则,并通过一个细粒度标注数据集(包含真实报告与重写报告对,标注错误类型、临床重要性标签及解释)系统评估现有指标,发现其在理解临床语义方面的局限性,例如无法区分临床显著错误、对无害变化过度惩罚以及在不同错误严重程度下缺乏一致性。该框架为构建更可靠的临床导向评估方法提供了指导。

链接: https://arxiv.org/abs/2510.00280
作者: Ruochen Li,Jun Li,Bailiang Jian,Kun Yuan,Youxiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Massachusetts Boston (波士顿大学); University of Strasbourg (斯特拉斯堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatically generated radiology reports often receive high scores from existing evaluation metrics but fail to earn clinicians’ trust. This gap reveals fundamental flaws in how current metrics assess the quality of generated reports. We rethink the design and evaluation of these metrics and propose a clinically grounded Meta-Evaluation framework. We define clinically grounded criteria spanning clinical alignment and key metric capabilities, including discrimination, robustness, and monotonicity. Using a fine-grained dataset of ground truth and rewritten report pairs annotated with error types, clinical significance labels, and explanations, we systematically evaluate existing metrics and reveal their limitations in interpreting clinical semantics, such as failing to distinguish clinically significant errors, over-penalizing harmless variations, and lacking consistency across error severity levels. Our framework offers guidance for building more clinically reliable evaluation methods.
zh

[NLP-75] SafePassage: High-Fidelity Information Extraction with Black Box LLM s

【速读】: 该论文旨在解决黑箱大语言模型(Large Language Models, LLMs)在信息抽取(Information Extraction, IE)任务中缺乏可解释性和可信度的问题,即模型生成的信息可能未真实反映文档内容,导致幻觉(hallucination)现象。解决方案的关键在于提出“安全段落”(safe passage)的概念——即由LLM生成且同时满足两个条件的上下文:一是与原文内容保持语义一致(grounded),二是与抽取的信息逻辑自洽。为此,作者设计了三步流程SafePassage:首先使用LLM提取结构化实体及其上下文,接着通过字符串级全局对齐器匹配文档与生成内容,最后利用评分模型识别不安全段落。实验表明,该方法可将幻觉减少高达85%,且与人工判断高度一致,具备评估LLM性能的潜力;此外,微调小规模任务数据的Transformer编码器在识别不安全段落方面甚至优于LLM评分模型,显著降低标注成本。

链接: https://arxiv.org/abs/2510.00276
作者: Joe Barrow,Raj Patel,Misha Kharkovski,Ben Davies,Ryan Schmitt
机构: Pattern Data
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Black box large language models (LLMs) make information extraction (IE) easy to configure, but hard to trust. Unlike traditional information extraction pipelines, the information “extracted” is not guaranteed to be grounded in the document. To prevent this, this paper introduces the notion of a “safe passage”: context generated by the LLM that is both grounded in the document and consistent with the extracted information. This is operationalized via a three-step pipeline, SafePassage, which consists of: (1) an LLM extractor that generates structured entities and their contexts from a document, (2) a string-based global aligner, and (3) a scoring model. Results show that using these three parts in conjunction reduces hallucinations by up to 85% on information extraction tasks with minimal risk of flagging non-hallucinations. High agreement between the SafePassage pipeline and human judgments of extraction quality mean that the pipeline can be dually used to evaluate LLMs. Surprisingly, results also show that using a transformer encoder fine-tuned on a small number of task-specific examples can outperform an LLM scoring model at flagging unsafe passages. These annotations can be collected in as little as 1-2 hours.
zh

[NLP-76] Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction EMNLP

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在文本分类任务中表现不足的问题,尤其是针对需要精细语义区分的文本修订(text revision)分类任务。由于此类任务依赖于对细微文本差异的识别,而标准微调方法往往因标注数据稀缺且昂贵难以有效训练,论文提出了一种即插即用的逐层参数高效微调(Layer-wise Parameter-Efficient Fine-Tuning, PEFT)框架——IR-Tuning。其核心创新在于动态选择具有较高梯度范数分布的模型层进行微调,同时冻结冗余层,从而在小规模修订语料上实现快速收敛、低GPU内存消耗和高分类准确性。

链接: https://arxiv.org/abs/2510.00268
作者: Zhexiong Liu,Diane Litman
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In The Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have shown extraordinary success across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they often struggle to categorize nuanced texts. One such example is text revision, which involves nuanced edits between pairs of texts. Although simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are exceptionally expensive and scarce in the community. To address this issue, we introduce a plug-and-play layer-wise parameter-efficient fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of important LLM layers that are dynamically selected based on their gradient norm distribution, while freezing those of redundant layers. Extensive experiments suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse text revisions, while achieving fast convergence, low GPU memory consumption, and effectiveness on small revision corpora.
zh

[NLP-77] Judging with Confidence: Calibrating Autoraters to Preference Distributions

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)对齐人类价值观时依赖自动化评分器(autoraters)所面临的可靠性问题,其核心挑战在于这些评分器通常基于离散偏好标签进行训练,从而强制将单一“真实”答案映射到本应具有主观性、模糊性或复杂性的任务上。解决方案的关键在于提出一个通用框架,用于校准概率化自动评分器以匹配目标人群的完整偏好分布;具体而言,通过两种学习方法实现:1)针对密集概率标签的直接监督微调;2)针对稀疏二元标签的强化学习方法。实证结果表明,采用分布匹配目标微调后的评分器能生成更符合目标偏好分布的语义化概率预测,在校准性和位置偏差方面显著改善,同时保持客观任务性能。

链接: https://arxiv.org/abs/2510.00263
作者: Zhuohang Li,Xiaowei Li,Chengyu Huang,Guowang Li,Katayoon Goshvadi,Bo Dai,Dale Schuurmans,Paul Zhou,Hamid Palangi,Yiwen Song,Palash Goyal,Murat Kantarcioglu,Bradley A. Malin,Yuan Xue
机构: Google(谷歌); Vanderbilt University (范德比尔特大学); Cornell University (康奈尔大学); Google DeepMind; University of Alberta (阿尔伯塔大学); Virginia Tech (弗吉尼亚理工大学); Scale AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters’'. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.
zh

[NLP-78] Retrieval-Augmented Generation for Electrocardiogram-Language Models ICASSP2026

【速读】: 该论文旨在解决生成式心电图-语言模型(Electrocardiogram-Language Models, ELMs)在自然语言生成(Natural Language Generation, NLG)过程中缺乏可靠知识支撑的问题,从而减少幻觉并提升生成内容的准确性与相关性。其解决方案的关键在于引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,将外部知识库中的相关信息检索并融合到ELMs的生成流程中,使模型输出更 grounded 且更具领域适应性。作者首次提出了面向ELMs的开源RAG流水线,并通过三个公开数据集上的实验验证了该方法的有效性,同时提供了基线和消融研究以揭示ELMs设计中的关键因素。

链接: https://arxiv.org/abs/2510.00261
作者: Xiaoyu Song,William Han,Tony Chen,Chaojing Duan,Michael A. Rosenberg,Emerson Liu,Ding Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 5 pages, 2 figures; Submitted to ICASSP 2026

点击查看摘要

Abstract:Interest in generative Electrocardiogram-Language Models (ELMs) is growing, as they can produce textual responses conditioned on ECG signals and textual queries. Unlike traditional classifiers that output label probabilities, ELMs are more versatile, supporting domain-specific tasks (e.g., waveform analysis, diagnosis, prognosis) as well as general tasks (e.g., open-ended questions, dialogue). Retrieval-Augmented Generation (RAG), widely used in Large Language Models (LLMs) to ground LLM outputs in retrieved knowledge, helps reduce hallucinations and improve natural language generation (NLG). However, despite its promise, no open-source implementation or systematic study of RAG pipeline design for ELMs currently exists. To address this gap, we present the first open-source RAG pipeline for ELMs, along with baselines and ablation studies for NLG. Experiments on three public datasets show that ELMs with RAG consistently improves performance over non-RAG baselines and highlights key ELM design considerations. Our code is available at: this https URL.
zh

[NLP-79] ASER: Translation Assessment via Systematic Evaluation and Reasoning

【速读】: 该论文旨在解决自动化翻译质量评估(Translation Quality Assessment, TQA)中现有指标准确率不足且缺乏可解释性的问题。传统自动评估指标如BLEU、METEOR等依赖于参考译文,而新兴的无参考评估方法在准确性上仍存在局限;同时,基于大型语言模型(Large Language Models, LLMs)的评估方法虽取得一定进展,但其决策过程缺乏透明度。解决方案的关键在于提出TASER(Translation Assessment via Systematic Evaluation and Reasoning),利用大型推理模型(Large Reasoning Models, LRMs)的显式推理能力,通过结构化提示模板(structured prompting templates)实现分步、系统化的翻译质量评估。实验表明,TASER在有参考和无参考场景下均达到最先进性能,并因明确的推理链提供了可解释性,从而显著提升了评估的准确性与透明度。

链接: https://arxiv.org/abs/2510.00255
作者: Monishwaran Maheswaran,Marco Carini,Christian Federmann,Tony Diaz
机构: University of California, Berkeley (加州大学伯克利分校); Apple (苹果公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.
zh

[NLP-80] BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)偏见缓解方法评估不一致的问题,即现有研究使用多样化的基线和指标进行性能比较,导致结果难以横向对比;同时,现有评估多依赖于模型在有偏与无偏上下文下的概率差异,忽略了用户实际交互场景中对公平、安全响应的需求。其解决方案的关键在于提出BiasFreeBench这一实证基准,通过将现有数据集统一重构为查询-响应格式,在两种测试场景(多选问答与开放式多轮问答)下系统比较八种主流偏见缓解技术(涵盖四类提示法与四类训练法),并引入响应级度量指标Bias-Free Score,用于量化模型输出的公平性、安全性与反刻板印象程度,从而实现跨方法的一致性评估并弥合评估指标与真实应用场景之间的差距。

链接: https://arxiv.org/abs/2510.00232
作者: Xin Xu,Xunzhi He,Churan Zhi,Ruizhe Chen,Julian McAuley,Zexue He
机构: UC San Diego (加州大学圣地亚哥分校); Columbia University (哥伦比亚大学); Zhejiang University (浙江大学); MIT-IBM Watson Lab (麻省理工学院-IBM沃森实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs’ probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs’ probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
zh

[NLP-81] houghtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space

【速读】: 该论文旨在解决当前基于Transformer的模型在推理阶段扩展计算资源时存在的局限性问题,即现有方法依赖于训练模型生成显式的思维链(chain-of-thought)token来实现计算扩展,但这类方法无法在预训练阶段应用,且仅限于串行生成自然语言形式的推理路径。解决方案的关键在于提出Thoughtbubbles——一种通过在潜在空间中学习分支或删除残差流(residual streams)来实现并行自适应计算的Transformer变体。该机制允许高复杂度任务形成“气泡”状的克隆残差流以进行额外思考,且整个行为可在仅使用语言建模损失的情况下于预训练阶段被学习,从而首次实现了推理时自适应计算能力的端到端预训练。

链接: https://arxiv.org/abs/2510.00219
作者: Houjun Liu,Shikhar Murty,Christopher D. Manning,Róbert Csordás
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a “bubble” of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.
zh

[NLP-82] Personalized Reasoning : Just-In-Time Personalization and Why LLM s Fail At It

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)在人机交互场景中因缺乏个性化推理能力而导致的偏好对齐失效问题,尤其是在冷启动或隐私受限条件下,模型无法识别用户偏好差异、主动获取偏好信息并动态调整推理过程。其解决方案的关键在于提出一种名为PREFDISCO的评估方法,该方法通过心理学基础构建具有稀疏偏好的虚拟人格(persona),将静态基准测试转化为交互式个性化任务,从而量化模型在不同用户情境下生成差异化推理链的能力。这一框架揭示了现有LLM在个性化推理上的系统性不足——29.0%的简单个性化尝试反而比通用响应更差,表明个性化推理需专门设计而非自然涌现,为教育、医疗等高个性化需求领域提供了可测量的研究前沿与改进方向。

链接: https://arxiv.org/abs/2510.00177
作者: Shuyue Stella Li,Avinandan Bose,Faeze Brahman,Simon Shaolei Du,Pang Wei Koh,Maryam Fazel,Yulia Tsvetkov
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 57 pages, 6 figures

点击查看摘要

Abstract:Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user’s needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don’t know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly – a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs’ interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.
zh

[NLP-83] PrimeX: A Dataset of Worldview Opinion and Explanation EMNLP2025

【速读】: 该论文旨在解决如何通过引入个体信念体系信息来提升语言模型对用户个性化理解与对齐的问题。其核心解决方案在于构建了一个名为PrimeX的数据集,该数据集包含858名美国居民的公共意见调查数据,并额外整合了两个维度的信念信息:受访者对其特定观点的书面解释(belief explanations)以及普里马尔世界信念量表(Primal World Belief survey)所测量的世界观(worldview)。研究表明,这些额外的信念信息能够显著增强语言模型在意见预测等任务中的个性化表现,为自然语言处理(NLP)和心理学研究提供了新的交叉路径。

链接: https://arxiv.org/abs/2510.00174
作者: Rik Koncel-Kedziorski,Brihi Joshi,Tim Paek
机构: Apple(苹果); University of Southern California(南加州大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:As the adoption of language models advances, so does the need to better represent individual users to the model. Are there aspects of an individual’s belief system that a language model can utilize for improved alignment? Following prior research, we investigate this question in the domain of opinion prediction by developing PrimeX, a dataset of public opinion survey data from 858 US residents with two additional sources of belief information: written explanations from the respondents for why they hold specific opinions, and the Primal World Belief survey for assessing respondent worldview. We provide an extensive initial analysis of our data and show the value of belief explanations and worldview for personalizing language models. Our results demonstrate how the additional belief information in PrimeX can benefit both the NLP and psychological research communities, opening up avenues for further study.
zh

[NLP-84] DRBench: A Realistic Benchmark for Enterprise Deep Research

【速读】: 该论文旨在解决当前AI代理(AI agents)在企业场景中执行复杂、开放式深度研究任务时缺乏有效评估基准的问题。现有基准多聚焦于简单问答或仅依赖公开网络的查询,无法充分模拟企业环境中需融合公共网络与私有知识库(如邮件、聊天记录、云文件系统等)的真实多步骤研究需求。解决方案的关键在于提出DRBench——一个基于真实用户角色和企业情境设计的基准测试集,涵盖15个跨领域(如销售、网络安全、合规等)的深度研究任务,通过人机协同的合成流程生成并验证任务质量,并从信息召回、事实准确性及报告结构化程度三个维度对AI代理进行系统评估。该基准推动了对企业级深度研究能力的量化分析,为模型选择与优化提供了清晰路径。

链接: https://arxiv.org/abs/2510.00172
作者: Amirhossein Abaskohi,Tianyi Chen,Miguel Muñoz-Mármol,Curtis Fox,Amrutha Varshini Ramesh,Étienne Marcotte,Xing Han Lù,Nicolas Chapados,Spandana Gella,Christopher Pal,Alexandre Drouin,Issam H. Laradji
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at this https URL.
zh

[NLP-85] AMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding

【速读】: 该论文旨在解决程序性活动理解(procedural activity understanding)中的挑战,即如何让智能体在无需训练的情况下,通过多模态信息(如视觉与语言)协同推理来支持人类完成从日常任务(如组装家具)到专业场景(如生物实验)的复杂操作。其解决方案的关键在于提出了一种名为TAMA(Tool-Augmented Multimodal Agent)的新框架,该框架通过引入多媒体返回工具(multimedia-returning tools)代理式灵活工具选择机制(agentic flexible tool selection),实现无需训练的多模态推理能力,从而显著提升视觉-语言模型(如GPT-5和MiMo-VL)在程序性问答任务中的性能。

链接: https://arxiv.org/abs/2510.00161
作者: Kimihiro Hasegawa,Wiradee Imrattanatrai,Masaki Asada,Ken Fukuda,Teruko Mitamura
机构: Carnegie Mellon University (卡内基梅隆大学); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所)
类目: Computation and Language (cs.CL)
备注: 21 pages. Code: this https URL

点击查看摘要

Abstract:Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.
zh

[NLP-86] Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval

【速读】: 该论文旨在解决当前双编码器检索模型(Dual-Encoder Retriever)中基于噪声对比估计(Noise Contrastive Estimation, NCE)的对比损失(Contrastive Loss)在训练过程中对分数分离质量不敏感、且与排序性能指标AUC(Area Under the ROC Curve)无关的问题,从而导致下游任务如检索增强生成(Retrieval-Augmented Generation, RAG)中的校准不足和性能不佳。解决方案的关键在于提出一种新的训练目标——MW损失(MW loss),其通过最大化曼惠特尼U统计量(Mann-Whitney U statistic),等价于直接优化AUC;该损失函数通过对正负样本得分差的二元交叉熵最小化,强制正确排序,并提供理论保证:MW损失能直接上界AUC,使优化目标更贴近实际检索目标。实验证明,使用MW损失训练的检索器在AUC和标准检索指标上均显著优于传统对比损失方法,展现出更好的校准性和判别能力,适用于高风险场景如RAG。

链接: https://arxiv.org/abs/2510.00137
作者: Nima Sheikholeslami,Erfan Hosseini,Patrice Bechard,Srivatsava Daruru,Sai Rajeswar
机构: ServiceNow (ServiceNow)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dual-encoder retrievers depend on the principle that relevant documents should score higher than irrelevant ones for a given query. Yet the dominant Noise Contrastive Estimation (NCE) objective, which underpins Contrastive Loss, optimizes a softened ranking surrogate that we rigorously prove is fundamentally oblivious to score separation quality and unrelated to AUC. This mismatch leads to poor calibration and suboptimal performance in downstream tasks like retrieval-augmented generation (RAG). To address this fundamental limitation, we introduce the MW loss, a new training objective that maximizes the Mann-Whitney U statistic, which is mathematically equivalent to the Area under the ROC Curve (AUC). MW loss encourages each positive-negative pair to be correctly ranked by minimizing binary cross entropy over score differences. We provide theoretical guarantees that MW loss directly upper-bounds the AoC, better aligning optimization with retrieval goals. We further promote ROC curves and AUC as natural threshold free diagnostics for evaluating retriever calibration and ranking quality. Empirically, retrievers trained with MW loss consistently outperform contrastive counterparts in AUC and standard retrieval metrics. Our experiments show that MW loss is an empirically superior alternative to Contrastive Loss, yielding better-calibrated and more discriminative retrievers for high-stakes applications like RAG.
zh

[NLP-87] Direct Token Optimization: A Self-contained Approach to Large Language Model Unlearning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“机器遗忘”(machine unlearning)的问题,即如何在不进行完整重新训练的前提下,彻底移除模型对特定训练数据子集(称为forget set)的记忆,同时保持模型整体性能不受显著影响。现有方法通常依赖辅助语言模型、保留原始训练数据或商用AI服务,这不仅不切实际,还可能引入隐私风险。论文提出了一种自包含的解决方案——直接令牌优化(Direct Token Optimization, DTO),其核心在于对输入序列中的令牌(token)进行分类:目标令牌(target tokens)用于优化遗忘目标,非目标令牌(non-target tokens)则用于维持模型性能。通过这种分层优化机制,DTO无需外部资源即可实现高效且高质量的遗忘,实验表明其在多个基准数据集上相比最新基线方法在遗忘质量上提升达16.8倍,同时保持与原模型相当的性能水平。

链接: https://arxiv.org/abs/2510.00125
作者: Hong kyu Lee,Ruixuan Liu,Li Xiong
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine unlearning is an emerging technique that removes the influence of a subset of training data (forget set) from a model without full retraining, with applications including privacy protection, content moderation, and model correction. The key challenge lies in ensuring that the model completely forgets the knowledge of the forget set without compromising its overall utility. Existing unlearning methods for large language models (LLMs) often utilize auxiliary language models, retain datasets, or even commercial AI services for effective unlearning and maintaining the model utility. However, dependence on these external resources is often impractical and could potentially introduce additional privacy risks. In this work, we propose direct token optimization (DTO), a novel self-contained unlearning approach for LLMs that directly optimizes the token level objectives and eliminates the need for external resources. Given a sequence to unlearn, we identify two categories of tokens: target tokens, which capture critical knowledge for unlearning, and the remaining non-target tokens, which are crucial for maintaining the model utility. The former are used to optimize the unlearning objective, while the latter serve to preserve the model’s performance. The experimental results show that the proposed DTO achieves up to 16.8 \times improvement in forget quality on several benchmark datasets than the latest baselines while maintaining a comparable level of model utility.
zh

[NLP-88] ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models NEURIPS

【速读】: 该论文旨在解决大型推理语言模型(Large Reasoning Language Models, LRLMs)在复杂推理任务中因“过度思考”(overthinking)现象导致的计算效率低下问题,同时兼顾推理质量与推理成本之间的平衡难题。其解决方案的关键在于提出一种无需训练的自适应推理抑制方法(Adaptive Reasoning Suppression, ARS),该方法通过引入多检查点确定性估计机制和渐进式抑制阈值,在动态识别冗余推理步骤的基础上实现高效抑制,从而在不牺牲甚至提升准确率的前提下显著降低token消耗、延迟和能耗,实验表明该方法在多个模型架构和数学推理基准上可实现最高达57.9%的能量节省。

链接: https://arxiv.org/abs/2510.00071
作者: Dongqi Zheng
机构: Purdue University (普渡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by 39th NeurIPS - Foundations of Reasoning in Language Models

点击查看摘要

Abstract:Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbfAdaptive Reasoning Suppression (ARS), a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.
zh

[NLP-89] Linear Regression in p-adic metric spaces

【速读】: 该论文旨在解决传统机器学习方法在处理具有层次结构(hierarchical)的数据时存在的局限性问题,尤其是依赖欧几里得度量(Euclidean metrics)无法有效捕捉离散、分支式层级关系的缺陷。其解决方案的关键在于引入p进制度量空间(p-adic metric spaces)作为理论基础,利用其天然契合层次结构的数学特性,提出并证明了一个核心定理:在p进制度量下,最小化数据点到n维平面的p进制距离和的平面必须至少通过n+1个数据点——这一性质与欧几里得回归中连续插值的本质形成鲜明对比,凸显了p进制度量对离散层级数据更本质的适配性。该理论成果进一步推导出多项式拟合和差值多项式根的结构性结论,并在自然语言处理中验证了其在分析层级分类体系和建模语法形态方面的实际价值。

链接: https://arxiv.org/abs/2510.00043
作者: Gregory D. Baker,Scott McCallum,Dirk Pattinson
机构: Australian National University (澳大利亚国立大学); Macquarie University (麦考瑞大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Number Theory (math.NT)
备注:

点击查看摘要

Abstract:Many real-world machine learning problems involve inherently hierarchical data, yet traditional approaches rely on Euclidean metrics that fail to capture the discrete, branching nature of hierarchical relationships. We present a theoretical foundation for machine learning in p-adic metric spaces, which naturally respect hierarchical structure. Our main result proves that an n-dimensional plane minimizing the p-adic sum of distances to points in a dataset must pass through at least n + 1 of those points – a striking contrast to Euclidean regression that highlights how p-adic metrics better align with the discrete nature of hierarchical data. As a corollary, a polynomial of degree n constructed to minimise the p-adic sum of residuals will pass through at least n + 1 points. As a further corollary, a polynomial of degree n approximating a higher degree polynomial at a finite number of points will yield a difference polynomial that has distinct rational roots. We demonstrate the practical significance of this result through two applications in natural language processing: analyzing hierarchical taxonomies and modeling grammatical morphology. These results suggest that p-adic metrics may be fundamental to properly handling hierarchical data structures in machine learning. In hierarchical data, interpolation between points often makes less sense than selecting actual observed points as representatives.
zh

[NLP-90] IA aplicada al análisis del conflicto Irán-Israel: Mapeo de discursos en YouTube

【速读】: 该论文旨在解决国际冲突中数字话语的不对称性与算法偏见问题,特别是聚焦于2025年6月伊朗与以色列冲突期间YouTube平台上的舆论分布及其背后的叙事结构。其解决方案的关键在于采用混合方法论设计,融合自然语言处理(Natural Language Processing, NLP)与机器学习模型(如BERT和XLM-RoBERTa)进行大规模评论分类,并结合批判性媒体分析与人工标注,实现统计稳健性与语境理解的协同整合,从而揭示被主流媒体遮蔽的叙事主体(如伊朗)及算法对特定话语的放大效应。

链接: https://arxiv.org/abs/2510.00021
作者: Alvaro Vallejo Ramírez
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: in Spanish language

点击查看摘要

Abstract:Purpose. This study analyzes the digital representation of the Iran-Israel conflict that occurred in June 2025, based on 120,000 comments posted on YouTube. It sought to identify discursive positions regarding the actors involved and to examine how media and algorithmic biases shape digital conversations. Methodology. A mixed-methods design with triangulation was adopted. In the quantitative phase, natural language processing techniques and machine learning models (BERT and XLM-RoBERTa) were used to classify comments into ten categories. In the qualitative phase, a critical analysis of media context and ideological narratives was conducted, complemented by manual annotation and supervised training. This strategy enabled the integration of statistical robustness with contextual understanding. Results and conclusions. The findings reveal a clear overrepresentation of pro-Palestinian and anti-United States/Israel discourses, while pro-United States and anti-Palestinian positions were marginal. Iran, usually rendered invisible in global media, emerged as a central actor in the digital conversation during the conflict, suggesting a narrative shift away from previous hegemonic frameworks. Likewise, the results confirm the influence of algorithmic biases in amplifying certain discourses while limiting others. Original contributions. This work combines computational analysis and philosophical critique for the study of digital controversies, providing a methodological framework replicable in geopolitical contexts. It is one of the first Spanish-language studies to map, through artificial intelligence and critical analysis, discourses on an international conflict on YouTube, highlighting asymmetries and narrative disputes that are often overlooked.
zh

[NLP-91] Unpacking Musical Symbolism in Online Communities: Content-Based and Network-Centric Approaches

【速读】: 该论文旨在解决音乐符号在在线社区中如何被生产与传播的问题,尤其关注音频特征(如能量、舞曲性)与歌词语义结构之间的关联及其随时间演变的规律。其解决方案的关键在于构建一个可复现的融合音乐信息检索(MIR)与轻量级网络分析的方法流程:首先量化声学属性的时间趋势,其次建模词频显著性与共现关系,最后按流派进行情绪(mood)画像。该方法通过整合音频描述符(如能量、舞曲性、响度等)和完整歌词文本,揭示了主流音乐中情绪表达的系统性差异(如R&B情绪最高,拉丁/雷鬼顿虽舞曲性强但情绪较低),并指出商业化倾向偏好节奏感强且强度适中的作品,以促进集体参与。这一集成框架对社交感知推荐或社区扩散研究具有适用性,且对模态稀疏问题具备鲁棒性。

链接: https://arxiv.org/abs/2510.00006
作者: Kajwan Ziaoddini
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper examines how musical symbolism is produced and circulated in online communities by combining content-based music analysis with a lightweight network perspective on lyrics. Using a curated corpus of 275 chart-topping songs enriched with audio descriptors (energy, danceability, loudness, liveness, valence, acousticness, speechiness, popularity) and full lyric transcripts, we build a reproducible pipeline that (i) quantifies temporal trends in sonic attributes, (ii) models lexical salience and co-occurrence, and (iii) profiles mood by genre. We find a decade-long decline in energy (79 - 58) alongside a rise in danceability (59 - 73); valence peaks in 2013 (63) and dips in 2014-2016 (42) before partially recovering. Correlation analysis shows strong coupling of energy with loudness (r = 0.74) and negative associations for acousticness with both energy (r = -0.54) and loudness (r = -0.51); danceability is largely orthogonal to other features (|r| 0.20). Lyric tokenization (114k tokens) reveals a pronoun-centric lexicon “I/you/me/my” and a dense co-occurrence structure in which interpersonal address anchors mainstream narratives. Mood differs systematically by style: RB exhibits the highest mean valence (96), followed by K-Pop/Pop (77) and Indie/Pop (70), whereas Latin/Reggaeton is lower (37) despite high danceability. Read through a subcultural identity lens, these patterns suggest the mainstreaming of previously peripheral codes and a commercial preference for relaxed yet rhythmically engaging productions that sustain collective participation without maximal intensity. Methodologically, we contribute an integrated MIR-plus-network workflow spanning summary statistics, correlation structure, lexical co-occurrence matrices, and genre-wise mood profiling that is robust to modality sparsity and suitable for socially aware recommendation or community-level diffusion studies.
zh

[NLP-92] Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting

【速读】: 该论文旨在解决流式语音识别中基于Transformer的编码器在块处理(block processing)模式下编码延迟(encoding latency)较高的问题。现有研究多聚焦于优化解码端的发射延迟(emission latency),而对编码阶段的延迟改善关注较少。解决方案的关键在于提出一种名为Spiralformer的新编码器架构,通过结合层跳过(layer dropping)与提前退出(early exiting)机制,在小块偏移(small shift)条件下高效计算:采用循环跳过策略并以螺旋方式在每个块中移动已计算的层,从而在保持整体计算成本相近的前提下,实现所有层的完整计算。实验表明,该方法在Librispeech和CSJ数据集上分别将平均token发射延迟降低了21.6%和7.0%。

链接: https://arxiv.org/abs/2510.00982
作者: Emiru Tsunoo,Hayato Futami,Yosuke Kashiwagi,Siddhant Arora,Shinji Watanabe
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted for ASRU 2025

点击查看摘要

Abstract:For streaming speech recognition, a Transformer-based encoder has been widely used with block processing. Although many studies addressed improving emission latency of transducers, little work has been explored for improving encoding latency of the block processing. We seek to reduce latency by frequently emitting a chunk with a small shift rather than scarce large-chunk emissions, resulting in higher computational costs. To efficiently compute with the small chunk shift, we propose a new encoder, Spiralformer, tailored for block processing by combining layer dropping and early exiting. We skip layer computation in a cyclic manner and shift the computed layer in each block spirally, which completes computation for all the layers over the block processing. Experimentally, we observed that our method achieved 21.6% reduction in the averaged token emission delay in Librispeech, and 7.0% in CSJ, compared with the baseline with similar computational cost and word error rates.
zh

[NLP-93] QSearchNet: A Quantum Walk Search Framework for Link Prediction

【速读】: 该论文旨在解决图结构中链接预测(link prediction)这一基础问题,即在复杂系统(如社交网络和生物网络)中准确识别潜在的连接关系。传统启发式方法虽能捕捉部分局部拓扑特征,但在整合局部与全局结构信息及适应复杂依赖关系方面存在局限。其解决方案的关键在于提出一种受量子计算启发的框架QSearchNet,核心机制为基于离散时间量子行走(Discrete-Time Quantum Walk, DTQW)的动力学模拟与格罗弗振幅放大(Grover’s amplitude amplification)相结合,通过量子反射和类Oracle相位翻转操作,实现多跳依赖关系的自适应优先级排序,并放大对应潜在连接的结构性路径,从而提升在真实场景下对困难负样本的预测性能。

链接: https://arxiv.org/abs/2510.00325
作者: Priyank Dubey
机构: 未知
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Link prediction is one of the fundamental problems in graph theory, critical for understanding and forecasting the evolution of complex systems like social and biological networks. While classical heuristics capture certain aspects of graph topology, they often struggle to optimally integrate local and global structural information or adapt to complex dependencies. Quantum computing offers a powerful alternative by leveraging superposition for simultaneous multi-path exploration and interference-driven integration of both local and global graph features. In this work, we introduce QSearchNet, a quantum-inspired framework based on Discrete-Time Quantum Walk (DTQW) dynamics and Grover’s amplitude amplification. QSearchNet simulates a topology-aware quantum evolution to propagate amplitudes across multiple nodes simultaneously. By aligning interference patterns through quantum reflection and oracle-like phase-flip operation, it adaptively prioritizes multi-hop dependencies and amplifies structurally relevant paths corresponding to potential connections. Experiments on diverse real-world networks demonstrate competitive performance, particularly with hard negative samples under realistic evaluation conditions.
zh

[NLP-94] WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities

【速读】: 该论文旨在解决脑电图(EEG)信号在多模态大语言模型(MLLM)中跨模态表示学习的挑战,即EEG信号同时编码认知过程与内在神经状态,导致配对数据模态之间存在不匹配问题。其解决方案的关键在于通过枢纽式探究揭示了这些模态间的互补关系,并提出将EEG信号及其对应模态映射到统一语义空间,从而实现泛化解读;同时构建了首个用于指令微调的跨任务EEG数据集WaveMind-Instruct-338k,使模型具备鲁棒分类性能及灵活开放的对话能力,支持四项下游任务。

链接: https://arxiv.org/abs/2510.00032
作者: Ziyi Zeng,Zhenyang Cai,Yixi Cai,Xidong Wang,Junying Chen,Rongsheng Wang,Yipeng Liu,Siqi Cai,Benyou Wang,Zhiguo Zhang,Haizhou Li
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG) interpretation using multimodal large language models (MLLMs) offers a novel approach for analyzing brain signals. However, the complex nature of brain activity introduces critical challenges: EEG signals simultaneously encode both cognitive processes and intrinsic neural states, creating a mismatch in EEG paired-data modality that hinders effective cross-modal representation learning. Through a pivot investigation, we uncover complementary relationships between these modalities. Leveraging this insight, we propose mapping EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation. To fully enable conversational capabilities, we further introduce WaveMind-Instruct-338k, the first cross-task EEG dataset for instruction tuning. The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations across four downstream tasks, thereby offering valuable insights for both neuroscience research and the development of general-purpose EEG models.
zh

计算机视觉

[CV-0] IMAGEdit: Let Any Subject Transform

【速读】:该论文旨在解决多主体视频编辑中缺乏训练自由性、多模态条件引导不足以及掩码边界纠缠等问题,尤其在任意数量目标主体的外观修改过程中保持非目标区域不变。解决方案的关键在于提出一个无需训练的框架IMAGEdit,其核心由两个模块构成:一是提示引导的多模态对齐模块,用于生成鲁棒的多模态条件和精确的掩码运动序列;二是基于先验的掩码重定位模块,用于缓解视频中掩码边界的纠缠问题。通过将生成的掩码序列输入预训练的掩码驱动视频生成模型,实现高效且高质量的视频编辑,同时具备良好的泛化能力和与多种掩码驱动视频生成模型的兼容性。

链接: https://arxiv.org/abs/2510.01186
作者: Fei Shen,Weihao Xu,Rui Yan,Dong Zhang,Xiangbo Shu,Jinhui Tang
机构: National University of Singapore (新加坡国立大学); Nanjing University of Science and Technology (南京理工大学); Hong Kong University of Science and Technology (香港科技大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present IMAGEdit, a training-free framework for any number of video subject editing that manipulates the appearances of multiple designated subjects while preserving non-target regions, without finetuning or retraining. We achieve this by providing robust multimodal conditioning and precise mask sequences through a prompt-guided multimodal alignment module and a prior-based mask retargeting module. We first leverage large models’ understanding and generation capabilities to produce multimodal information and mask motion sequences for multiple subjects across various types. Then, the obtained prior mask sequences are fed into a pretrained mask-driven video generation model to synthesize the edited video. With strong generalization capability, IMAGEdit remedies insufficient prompt-side multimodal conditioning and overcomes mask boundary entanglement in videos with any number of subjects, thereby significantly expanding the applicability of video editing. More importantly, IMAGEdit is compatible with any mask-driven video generation model, significantly improving overall performance. Extensive experiments on our newly constructed multi-subject benchmark MSVBench verify that IMAGEdit consistently surpasses state-of-the-art methods. Code, models, and datasets are publicly available at this https URL.
zh

[CV-1] EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory

【速读】:该论文旨在解决现有世界模型在长时间跨度下难以保持空间一致性的问题,尤其是在复杂环境中进行长期、连续的3D场景探索时,视频生成往往缺乏几何一致性和视觉真实性。其解决方案的关键在于提出EvoWorld框架,通过将全景视频生成与可演化的显式3D记忆(explicit 3D memory)相结合:首先利用具有细粒度视点控制能力的视频生成器预测未来帧,随后采用前馈式插件式Transformer演化3D重建结构,并最终基于几何重投影条件合成未来场景;这种机制使3D重建作为空间引导信息,投射至目标视角以提供丰富的空间线索,从而显著提升生成内容的视觉真实感和几何一致性,实现长时程的空间一致世界建模。

链接: https://arxiv.org/abs/2510.01183
作者: Jiahao Wang,Luoxin Ye,TaiMing Lu,Junfei Xiao,Jiahan Zhang,Yuxiang Guo,Xijun Liu,Rama Chellappa,Cheng Peng,Alan Yuille,Jieneng Chen
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at: this https URL

点击查看摘要

Abstract:Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene’s 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.
zh

[CV-2] Audio Driven Real-Time Facial Animation for Social Telepresence SIGGRAPH

【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)中实时驱动高保真三维人脸动画的延迟问题,尤其针对社交互动场景下对低延迟和高质量面部表情生成的需求。其核心挑战在于如何在保证自然表情丰富性的同时实现毫秒级响应(如15ms GPU时间),并处理连续音频流的帧级稳定性和一致性。解决方案的关键在于两个创新:一是采用在线Transformer架构消除对未来输入的依赖,从而支持真正的实时推理;二是设计蒸馏流水线将迭代去噪过程压缩为单步操作,显著加速扩散模型的推理速度。这一架构使得系统在保持与离线最优方法相当甚至更优的动画精度基础上,实现100至1000倍的推理加速,适用于多模态场景(如情绪条件或VR头显中的眼动传感器)。

链接: https://arxiv.org/abs/2510.01176
作者: Jiye Lee,Chenghui Li,Linh Tran,Shih-En Wei,Jason Saragih,Alexander Richard,Hanbyul Joo,Shaojie Bai
机构: Seoul National University (首尔国立大学); Meta (Meta)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注: SIGGRAPH Asia 2025. Project page: this https URL

点击查看摘要

Abstract:We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.
zh

[CV-3] EditTrack: Detecting and Attributing AI-assisted Image Editing

【速读】:该论文旨在解决图像编辑检测与归属(image-editing detection and attribution)问题,即判断可疑图像是否由特定基础图像通过AI编辑模型生成,并进一步识别具体的编辑模型。现有方法仅能判断图像是否为AI生成或编辑,无法追溯其来源图像及具体编辑工具。解决方案的关键在于提出EditTrack框架,基于对编辑过程的四个关键观察,引入一种新颖的再编辑策略,并结合精心设计的相似性度量指标,实现对可疑图像来源和编辑模型的精准识别。实验表明,EditTrack在五个主流编辑模型和六个数据集上均显著优于五种基线方法。

链接: https://arxiv.org/abs/2510.01173
作者: Zhengyuan Jiang,Yuyang Zhang,Moyang Guo,Neil Zhenqiang Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we formulate and study the problem of image-editing detection and attribution: given a base image and a suspicious image, detection seeks to determine whether the suspicious image was derived from the base image using an AI editing model, while attribution further identifies the specific editing model responsible. Existing methods for detecting and attributing AI-generated images are insufficient for this problem, as they focus on determining whether an image was AI-generated/edited rather than whether it was edited from a particular base image. To bridge this gap, we propose EditTrack, the first framework for this image-editing detection and attribution problem. Building on four key observations about the editing process, EditTrack introduces a novel re-editing strategy and leverages carefully designed similarity metrics to determine whether a suspicious image originates from a base image and, if so, by which model. We evaluate EditTrack on five state-of-the-art editing models across six datasets, demonstrating that it consistently achieves accurate detection and attribution, significantly outperforming five baselines.
zh

[CV-4] Strategic Fusion of Vision Language Models: Shapley-Credited Context-Aware Dawid-Skene for Multi-Label Tasks in Autonomous Driving

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在自动驾驶(Autonomous Vehicle, AV)系统中因幻觉(hallucination)导致的可靠性不足问题,尤其是在安全关键型决策流水线中的多标签理解任务。解决方案的关键在于提出一种基于博弈论的融合方法——Shapley-credited Context-Aware Dawid-Skene with Agreement(SCD-SA),其核心机制包括:(1) 从标注历史中学习每个模型在不同标签和上下文条件下的可靠性;(2) 在推理阶段将每模型输出转化为受共识约束的对数似然比,并结合情境先验与基于Shapley值的团队信用更新的公共声誉状态;(3) 生成校准且可阈值化的后验概率,从而既增强可靠模型间的共识,又保留单一模型的独特正确信号,并具备适应性漂移的能力。

链接: https://arxiv.org/abs/2510.01126
作者: Yuxiang Feng,Keyang Zhang,Hassane Ouchouid,Ashwil Kaniamparambil,Ioannis Souflas,Panagiotis Angeloudis
机构: Imperial College London (帝国理工学院); ELM Europe
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages

点击查看摘要

Abstract:Large vision-language models (VLMs) are increasingly used in autonomous-vehicle (AV) stacks, but hallucination limits their reliability in safety-critical pipelines. We present Shapley-credited Context-Aware Dawid-Skene with Agreement, a game-theoretic fusion method for multi-label understanding of ego-view dashcam video. It learns per-model, per-label, context-conditioned reliabilities from labelled history and, at inference, converts each model’s report into an agreement-guardrailed log-likelihood ratio that is combined with a contextual prior and a public reputation state updated via Shapley-based team credit. The result is calibrated, thresholdable posteriors that (i) amplify agreement among reliable models, (ii) preserve uniquely correct single-model signals, and (iii) adapt to drift. To specialise general VLMs, we curate 1,000 real-world dashcam clips with structured annotations (scene description, manoeuvre recommendation, rationale) via an automatic pipeline that fuses HDD ground truth, vehicle kinematics, and YOLOv11 + BoT-SORT tracking, guided by a three-step chain-of-thought prompt; three heterogeneous VLMs are then fine-tuned with LoRA. We evaluate with Hamming distance, Micro-Macro-F1, and average per-video latency. Empirically, the proposed method achieves a 23% reduction in Hamming distance, 55% improvement in Macro-F1, and 47% improvement in Micro-F1 when comparing with the best single model, supporting VLM fusion as a calibrated, interpretable, and robust decision-support component for AV pipelines.
zh

[CV-5] Instant4D: 4D Gaussian Splatting in Minutes NEURIPS25

【速读】:该论文旨在解决从非标定、随意拍摄的视频中高效重建4D场景的问题,传统方法因优化速度慢和参数估计复杂而难以实用。其关键解决方案是提出Instant4D系统,利用原生4D表示(native 4D representation)实现快速单目重建:首先通过深度视觉SLAM(deep visual SLAM)完成几何恢复,再结合网格剪枝(grid pruning)优化场景表示,在保持几何完整性的前提下将模型体积压缩至原始的10%以下;同时引入简化的4D高斯表示(streamlined 4D Gaussian representation),在保证基准测试性能的同时实现30倍加速,训练时间缩短至两分钟以内,整体可在10分钟内完成单段视频(如Dycheck数据集或典型200帧视频)的重建,展现出良好的泛化能力。

链接: https://arxiv.org/abs/2510.01119
作者: Zhanpeng Luo,Haoxi Ran,Li Lu
机构: University of Pittsburgh (匹兹堡大学); Carnegie Mellon University (卡内基梅隆大学); Sichuan Univeristy (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 25

点击查看摘要

Abstract:Dynamic view synthesis has seen significant advances, yet reconstructing scenes from uncalibrated, casual video remains challenging due to slow optimization and complex parameter estimation. In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30x speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. Our method reconstruct a single video within 10 minutes on the Dycheck dataset or for a typical 200-frame video. We further apply our model to in-the-wild videos, showcasing its generalizability. Our project website is published at this https URL.
zh

[CV-6] ReSWD: ReSTIRd not shaken. Combining Reservoir Sampling and Sliced Wasserstein Distance for Variance Reduction

【速读】:该论文旨在解决Sliced Wasserstein Distance (SWD) 在高维分布匹配任务中因蒙特卡洛估计器方差过高而导致梯度噪声大、优化收敛缓慢的问题。其解决方案的关键在于提出Reservoir SWD (ReSWD),通过将加权水库采样(Weighted Reservoir Sampling)引入SWD,实现对优化过程中有信息量的投影方向的自适应保留,从而在保持无偏性的同时显著降低梯度方差,提升优化稳定性与效率。

链接: https://arxiv.org/abs/2510.01061
作者: Mark Boss,Andreas Engelhardt,Simon Donné,Varun Jampani
机构: Stability AI(Stability.AI); University of Tübingen(图宾根大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Distribution matching is central to many vision and graphics tasks, where the widely used Wasserstein distance is too costly to compute for high dimensional distributions. The Sliced Wasserstein Distance (SWD) offers a scalable alternative, yet its Monte Carlo estimator suffers from high variance, resulting in noisy gradients and slow convergence. We introduce Reservoir SWD (ReSWD), which integrates Weighted Reservoir Sampling into SWD to adaptively retain informative projection directions in optimization steps, resulting in stable gradients while remaining unbiased. Experiments on synthetic benchmarks and real-world tasks such as color correction and diffusion guidance show that ReSWD consistently outperforms standard SWD and other variance reduction baselines. Project page: this https URL
zh

[CV-7] KeySG: Hierarchical Keyframe-Based 3D Scene Graphs

【速读】:该论文旨在解决当前3D场景图(3D scene graphs)在构建过程中语义受限于预定义关系集,且在大规模环境中序列化后易超出大语言模型(LLM)上下文窗口的问题。解决方案的关键在于提出KeySG框架,其将3D场景表示为由楼层、房间、物体及功能元素组成的分层图结构,并通过从关键帧中提取多模态信息来增强节点表征,从而避免显式建模对象间的边关系;同时,利用分层检索增强生成(RAG)管道从场景图中高效提取相关上下文,有效支持复杂查询处理并缓解大规模场景图的可扩展性问题。

链接: https://arxiv.org/abs/2510.01049
作者: Abdelrhman Werby,Dennis Rotondi,Fabio Scaparro,Kai O. Arras
机构: Socially Intelligent Robotics Lab, Institute for Artificial Intelligence, University of Stuttgart (斯图加特大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM’s context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across four distinct benchmarks --including 3D object segmentation and complex query retrieval-- KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.
zh

[CV-8] Activation-Deactivation: A General Framework for Robust Post-hoc Explainable AI

【速读】:该论文旨在解决黑箱可解释性方法在图像分类器决策解释中依赖于遮挡输入部分所生成的变异图像(mutants)的问题,这类方法常导致分布外(out-of-distribution)图像,从而影响解释质量,并且遮挡值的选择通常需要领域知识。其解决方案的关键在于提出一种新颖的前向传播范式——激活-去激活(Activation-Deactivation, AD),通过关闭与遮挡区域相对应的模型部分来移除被遮挡特征对模型决策的影响,而非直接修改输入。文中进一步设计了ConvAD机制,作为即插即用模块嵌入任意训练好的卷积神经网络(CNN),无需额外训练即可实现更鲁棒的解释,理论证明该机制不改变原网络的决策过程,并在多个数据集和架构上验证其优于传统遮挡方法,在鲁棒性方面提升达62.5%。

链接: https://arxiv.org/abs/2510.01038
作者: Akchunya Chanchal,David A. Kelly,Hana Chockler
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint: Under Review

点击查看摘要

Abstract:Black-box explainability methods are popular tools for explaining the decisions of image classifiers. A major drawback of these tools is their reliance on mutants obtained by occluding parts of the input, leading to out-of-distribution images. This raises doubts about the quality of the explanations. Moreover, choosing an appropriate occlusion value often requires domain knowledge. In this paper we introduce a novel forward-pass paradigm Activation-Deactivation (AD), which removes the effects of occluded input features from the model’s decision-making by switching off the parts of the model that correspond to the occlusions. We introduce ConvAD, a drop-in mechanism that can be easily added to any trained Convolutional Neural Network (CNN), and which implements the AD paradigm. This leads to more robust explanations without any additional training or fine-tuning. We prove that the ConvAD mechanism does not change the decision-making process of the network. We provide experimental evaluation across several datasets and model architectures. We compare the quality of AD-explanations with explanations achieved using a set of masking values, using the proxies of robustness, size, and confidence drop-off. We observe a consistent improvement in robustness of AD explanations (up to 62.5%) compared to explanations obtained with occlusions, demonstrating that ConvAD extracts more robust explanations without the need for domain knowledge.
zh

[CV-9] Secure and reversible face anonymization with diffusion models

【速读】:该论文旨在解决面部图像在计算机视觉算法处理过程中存在的隐私与安全风险问题,尤其是现有匿名化方法难以在高保真图像生成与可逆性之间取得良好平衡的挑战。解决方案的关键在于提出一种基于扩散模型的安全且可逆的面部匿名化方法,通过将秘密密钥(secret key)嵌入扩散模型的潜在人脸表示中,并利用面部掩码约束生成过程以保留无关身份特征,从而确保只有持有正确密钥的授权方才能实现原始人脸的精确恢复。该方法在保持图像高质量的同时,显著降低了匿名后图像与原始图像之间的视觉相似性。

链接: https://arxiv.org/abs/2510.01031
作者: Pol Labarbarie,Vincent Itier,William Puech
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Face images processed by computer vision algorithms contain sensitive personal information that malicious actors can capture without consent. These privacy and security risks highlight the need for effective face anonymization methods. Current methods struggle to propose a good trade-off between a secure scheme with high-quality image generation and reversibility for later person authentication. Diffusion-based approaches produce high-quality anonymized images but lack the secret key mechanism to ensure that only authorized parties can reverse the process. In this paper, we introduce, to our knowledge, the first secure, high-quality reversible anonymization method based on a diffusion model. We propose to combine the secret key with the latent faces representation of the diffusion model. To preserve identity-irrelevant features, generation is constrained by a facial mask, maintaining high-quality images. By using a deterministic forward and backward diffusion process, our approach enforces that the original face can be recovered with the correct secret key. We also show that the proposed method produces anonymized faces that are less visually similar to the original faces, compared to other previous work.
zh

[CV-10] owards Adversarial Training under Hyperspectral Images

【速读】:该论文旨在解决深度学习模型在高光谱图像分类任务中对对抗攻击的脆弱性问题,这一问题会带来严重的安全风险。现有增强鲁棒性的方法多依赖于定制化网络结构,难以扩展且对强攻击防御效果有限。论文的关键解决方案是将对抗训练(adversarial training)引入高光谱领域,并提出一种新型高光谱对抗训练方法AT-RA,其核心在于通过数据增强提升光谱信息多样性并保障空间平滑性,从而有效缓解对抗噪声对光谱语义信息的破坏,显著提升模型在对抗攻击下的鲁棒性(如AutoAttack下提升21.34%)和干净样本准确率(提升2.68%)。

链接: https://arxiv.org/abs/2510.01014
作者: Weihua Zhang,Chengze Jiang,Jie Gui,Lu Dong
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have revealed that hyperspectral classification models based on deep learning are highly vulnerable to adversarial attacks, which pose significant security risks. Although several approaches have attempted to enhance adversarial robustness by modifying network architectures, these methods often rely on customized designs that limit scalability and fail to defend effectively against strong attacks. To address these challenges, we introduce adversarial training to the hyperspectral domain, which is widely regarded as one of the most effective defenses against adversarial attacks. Through extensive empirical analyses, we demonstrate that while adversarial training does enhance robustness across various models and datasets, hyperspectral data introduces unique challenges not seen in RGB images. Specifically, we find that adversarial noise and the non-smooth nature of adversarial examples can distort or eliminate important spectral semantic information. To mitigate this issue, we employ data augmentation techniques and propose a novel hyperspectral adversarial training method, termed AT-RA. By increasing the diversity of spectral information and ensuring spatial smoothness, AT-RA preserves and corrects spectral semantics in hyperspectral images. Experimental results show that AT-RA improves adversarial robustness by 21.34% against AutoAttack and 18.78% against PGD-50 while boosting benign accuracy by 2.68%.
zh

[CV-11] ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型评估中缺乏多维度、可解释性反馈的问题。现有方法通常依赖单一标量指标衡量图像质量,难以提供全面且具诊断意义的评价,限制了其在强化学习驱动的偏好对齐(preference alignment)中的应用效果。解决方案的关键在于提出ImageDoctor框架,该框架从四个互补维度——合理性(plausibility)、语义一致性(semantic alignment)、美学(aesthetics)和整体质量(overall quality)——对T2I生成结果进行系统评估,并通过像素级缺陷热力图(heatmap)定位问题区域,实现细粒度反馈;同时引入“看-想-预测”(look-think-predict)范式提升模型细节敏感性和推理能力,最终在多个数据集上展现出与人类偏好高度一致的评估性能,并作为密集奖励信号显著优于传统标量奖励模型(提升10%生成质量)。

链接: https://arxiv.org/abs/2510.01010
作者: Yuxiang Guo,Jiang Liu,Ze Wang,Hao Chen,Ximeng Sun,Yang Zhao,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum
机构: Johns Hopkins University (约翰霍普金斯大学); AMD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a “look-think-predict” paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality – achieving an improvement of 10% over scalar-based reward models.
zh

[CV-12] POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

【速读】:该论文旨在解决大语言视觉模型(Large Vision Language Models, LVLMs)在视频问答(Video Question Answering, VQA)任务中对长视频处理能力受限的问题,特别是当上下文窗口仅能容纳约50秒视频时,难以有效利用更长时间跨度的视频信息。其解决方案的关键在于提出一种数据高效的数据处理管道POVQA,通过将每秒视频压缩为单个时序聚合图像(采用运动模糊与加权平均等池化方法),并结合轻量级监督策略对LVLM进行微调。具体而言,作者构建了1 fps输入源,并基于QWEN-2.5-VL 7B模型使用两轮目标监督(包含推理链和最终答案)进行SFT(Supervised Fine Tuning)和DPO(Direct Preference Optimization)训练,在自建的ReasonVQA数据集上显著提升性能指标(如F1分数从0.212提升至0.543),且在不同池化方式下均保持鲁棒性,表明该方法能有效增强模型对时间证据的总结能力。

链接: https://arxiv.org/abs/2510.01009
作者: Ashim Dahal,Ankit Ghimire,Saydul Akbar Murad,Nick Rahimi
机构: University of Southern Mississippi (南密西西比大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.
zh

[CV-13] xtCAM: Explaining Class Activation Map with Text

【速读】:该论文旨在解决深度视觉模型(deep vision models)在高风险应用场景中因缺乏可解释性而影响可信度的问题,尤其针对类激活映射(Class Activation Mapping, CAM)方法无法提供语义层面洞察的局限性。解决方案的关键在于提出TextCAM框架,通过融合CAM的空间定位能力与视觉语言模型(Vision-Language Models, VLMs)的语义对齐特性,利用CLIP嵌入和线性判别分析提取通道级语义表示,并结合CAM权重生成具有空间位置和视觉属性双重信息的文本描述,从而实现既精准定位关注区域又明确说明决策依据的可解释性输出。

链接: https://arxiv.org/abs/2510.01004
作者: Qiming Zhao,Xingjian Li,Xiaoyu Cao,Xiaolong Wu,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable success across domains but remain difficult to interpret, limiting their trustworthiness in high-stakes applications. This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants working by highlighting spatial regions that drive predictions. We figure out that CAM provides little semantic insight into what attributes underlie these activations. To address this limitation, we propose TextCAM, a novel explanation framework that enriches CAM with natural languages. TextCAM combines the precise spatial localization of CAM with the semantic alignment of vision-language models (VLMs). Specifically, we derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights to produce textual descriptions of salient visual evidence. This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision. We further extend TextCAM to generate feature channels into semantically coherent groups, enabling more fine-grained visual-textual explanations. Experiments on ImageNet, CLEVR, and CUB demonstrate that TextCAM produces faithful and interpretable rationales that improve human understanding, detect spurious correlations, and preserve model fidelity.
zh

[CV-14] SoftCFG: Uncertainty-guided Stable Guidance for Visual autoregressive Model

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型中使用无分类器引导(Classifier-Free Guidance, CFG)时面临的两个核心问题:引导信号衰减(guidance diminishing)和过度引导(over-guidance)。前者指随着解码过程推进,条件与非条件输出之间的差异迅速消失;后者则表现为强条件信息破坏图像的视觉一致性。解决方案的关键在于提出SoftCFG,一种基于不确定性的推理方法,通过在序列所有token上分布自适应扰动,使每个生成token根据其置信度贡献加权引导信号,从而保持引导信号在整个生成过程中持续有效,并缓解文本引导与视觉上下文之间的冲突。此外,为稳定长序列生成,论文进一步引入步骤归一化(Step Normalization)以限制SoftCFG累积扰动,该方法无需训练、对模型结构无依赖,且可无缝集成至现有AR生成流程中。

链接: https://arxiv.org/abs/2510.00996
作者: Dongli Xu,Aleksei Tiulpin,Matthew B. Blaschko
机构: KU Leuven (鲁汶大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256 among autoregressive models.
zh

[CV-15] Visual Self-Refinement for Autoregressive Models EMNLP2025

【速读】:该论文旨在解决自回归模型(autoregressive models)在处理视觉-语言数据时,由于图像信号的空间特性与文本序列的顺序依赖性之间的冲突,导致生成结果质量受限的问题。其解决方案的关键在于提出一个即插即用的精炼模块(plug-and-play refinement module),作为预训练后的后处理步骤,联合优化自回归模型生成的所有视觉token,通过利用跨token的全局上下文和关系建模,缓解序列生成过程中的误差累积问题,从而提升视觉-语言对齐能力和语义一致性。

链接: https://arxiv.org/abs/2510.00993
作者: Jiamian Wang,Ziqi Zhou,Chaithanya Kumar Mummadi,Sohail Dianat,Majid Rabbani,Raghuveer Rao,Chen Qiu,Zhiqiang Tao
机构: Rochester Institute of Technology (罗彻斯特理工学院); Bosch Research (博世研究); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by EMNLP2025

点击查看摘要

Abstract:Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language modeling under a shared sequential prediction framework. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model’s ability to produce semantically consistent results.
zh

[CV-16] A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

【速读】:该论文旨在解决图像视觉定位(visually localizing an image)中地图构建与重定位效率低的问题,即在已知相机位姿的映射图像基础上,传统方法仍需数分钟至数小时完成地图构建,难以满足实时性需求。其解决方案的关键在于提出FastForward方法,该方法通过将多张映射图像的特征锚定于三维空间来构建场景表示,并利用这些特征在单次前向传播中预测查询图像与场景之间的对应关系,从而实现快速相机位姿估计。此设计显著减少了地图准备时间,同时保持了与当前最优方法相当的定位精度,并展现出对未见场景(如大规模户外环境)的良好泛化能力。

链接: https://arxiv.org/abs/2510.00978
作者: Axel Barroso-Laguna,Tommaso Cavallari,Victor Adrian Prisacariu,Eric Brachmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.
zh

[CV-17] JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

【速读】:该论文旨在解决当前基于token的文本到图像(Text-to-Image, T2I)生成模型中,如何有效融合文本与视觉token以提升生成质量的问题。现有方法在利用自监督训练的token-centric架构时,常面临文本与图像特征对齐不足、条件控制能力弱等挑战。其解决方案的关键在于提出一个统一的多模态框架JEPA-T,该框架将图像和文本编码为离散的视觉与文本token,并通过联合嵌入预测Transformer进行处理;创新性地在特征预测器后引入交叉注意力机制用于条件去噪,同时在训练阶段注入原始文本嵌入至流匹配损失中,以增强跨模态对齐。此设计实现了任务无关的骨干网络与强条件控制之间的平衡,在保持生成通用性的同时显著提升了文本引导的图像生成性能。

链接: https://arxiv.org/abs/2510.00974
作者: Siheng Wan,Zhengtao Yao,Zhengdao Li,Junhao Dong,Yanshu Li,Yikai Li,Linshan Li,Haoyan Xu,Yijiang Li,Zhikang Dong,Huacan Wang,Jifeng Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbfJEPA-T, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based this http URL code is now available: this https URL
zh

[CV-18] InfVSR: Breaking Length Limits of Generic Video Super-Resolution

【速读】:该论文旨在解决长视频超分辨率(Video Super-Resolution, VSR)任务中面临的两大核心问题:一是传统方法在处理数千帧长序列时因多步去噪导致的计算效率低下;二是基于时间分解的方法易引入伪影和不连续性,从而限制了可扩展性。其解决方案的关键在于提出一种新颖的自回归单步扩散(autoregressive-one-step-diffusion)范式——InfVSR,通过将预训练视频扩散模型(video diffusion priors)重构为因果结构并引入滚动键值缓存(rolling KV-cache)与联合视觉引导机制,实现局部与全局一致性保持;同时,通过补丁级像素监督和跨块分布匹配,高效地将扩散过程蒸馏为单步推理,显著提升速度与稳定性。该方法在长视频上实现了高达58倍的加速,并在语义一致性等新引入的评估指标上达到最优性能。

链接: https://arxiv.org/abs/2510.00948
作者: Ziqing Zhang,Kai Liu,Zheng Chen,Xi Li,Yucong Chen,Bingnan Duan,Linghe Kong,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Meituan Inc (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be available at this https URL

点击查看摘要

Abstract:Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at this https URL.
zh

[CV-19] Looking Alike From Far to Near: Enhancing Cross-Resolution Re-Identification via Feature Vector Panning

【速读】:该论文旨在解决跨分辨率行人再识别(Cross-Resolution Re-Identification, CR-ReID)任务中因摄像头距离差异导致的低分辨率(Low-Resolution, LR)与高分辨率(High-Resolution, HR)图像匹配困难问题,该问题限制了传统ReID模型的性能。现有方法多依赖超分辨率(Super-Resolution, SR)或联合学习进行特征补偿,但存在训练与推理复杂度高、性能提升受限等瓶颈。论文的关键创新在于:通过统计分析(Canonical Correlation Analysis 和 Pearson Correlation Analysis)实证发现,行人特征空间中存在与分辨率差异相关的语义方向(semantic directions),并据此提出轻量高效的向量平移特征对齐(Vector Panning Feature Alignment, VPFA)框架,从建模分辨率特异性特征差异的新视角实现CR-ReID,显著优于现有最先进方法且具备更高效率。

链接: https://arxiv.org/abs/2510.00936
作者: Zanwu Liu,Chao Yuan,Bo Li,Xiaowei Zhang,Guanglin Niu
机构: Beihang University (北京航空航天大学); Qingdao University (青岛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In surveillance scenarios, varying camera distances cause significant differences among pedestrian image resolutions, making it hard to match low-resolution (LR) images with high-resolution (HR) counterparts, limiting the performance of Re-Identification (ReID) tasks. Most existing Cross-Resolution ReID (CR-ReID) methods rely on super-resolution (SR) or joint learning for feature compensation, which increases training and inference complexity and has reached a performance bottleneck in recent studies. Inspired by semantic directions in the word embedding space, we empirically discover that semantic directions implying resolution differences also emerge in the feature space of ReID, and we substantiate this finding from a statistical perspective using Canonical Correlation Analysis and Pearson Correlation Analysis. Based on this interesting finding, we propose a lightweight and effective Vector Panning Feature Alignment (VPFA) framework, which conducts CR-ReID from a novel perspective of modeling the resolution-specific feature discrepancy. Extensive experimental results on multiple CR-ReID benchmarks show that our method significantly outperforms previous state-of-the-art baseline models while obtaining higher efficiency, demonstrating the effectiveness and superiority of our model based on the new finding in this paper.
zh

[CV-20] Equivariant Splitting: Self-supervised learning from incomplete data

【速读】:该论文旨在解决在缺乏真实标签(ground-truth)的情况下,如何利用自监督学习(self-supervised learning)训练重建网络以应对逆问题(inverse problems)的挑战。尤其针对仅能通过单一不完整观测模型获取测量数据的场景,传统方法难以获得可靠的学习信号。其解决方案的关键在于提出一种新的等变性(equivariance)定义,并结合自监督分裂损失(splitting losses),使得重建网络能够生成无偏估计的监督损失,从而在图像修复、加速磁共振成像和压缩感知等多个任务中实现优于现有方法的性能,尤其是在前向模型高度欠定(rank-deficient)的情形下。

链接: https://arxiv.org/abs/2510.00929
作者: Victor Sechaud,Jérémy Scanvic,Quentin Barthélemy,Patrice Abry,Julián Tachella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in unbiased estimates of the supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models.
zh

[CV-21] PAL-Net: A Point-Wise CNN with Patch-Attention for 3D Facial Landmark Localization

【速读】:该论文旨在解决三维面部扫描中解剖学标志点(anatomical landmarks)手动标注耗时且依赖专家经验的问题,这对临床评估、形态测量分析及颅面研究至关重要。其解决方案的关键在于提出一个全自动深度学习流程 PAL-Net,该方法结合粗略对齐、感兴趣区域过滤以及基于注意力机制增强的patch-based点级卷积神经网络(pointwise CNN),实现对立体摄影测量面部模型上50个解剖学标志点的精准定位。该设计在保证高精度的同时具备轻量化和可扩展性,显著优于现有方法,并在不同数据集和面部区域间展现出良好的泛化能力。

链接: https://arxiv.org/abs/2510.00910
作者: Ali Shadman Yazdi,Annalisa Cappella,Benedetta Baldini,Riccardo Solazzo,Gianluca Tartaglia,Chiarella Sforza,Giuseppe Baselli
机构: University of Milan (米兰大学); IRCCS Policlinico San Donato (IRCCS圣多纳托医院); Politecnico di Milano (米兰理工大学); Fondazione IRCCS Cà Granda, Ospedale Maggiore Policlinico (IRCCS卡格兰达基金会,大都会医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41,mm and a distance-wise error of 0.38,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at this https URL
zh

[CV-22] Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification

【速读】:该论文旨在解决医学影像领域中迁移学习(transfer learning)源数据集选择缺乏系统性原则的问题,当前这一决策往往依赖研究人员的直觉而非明确方法,可能影响算法泛化能力及患者预后。其解决方案的关键在于从人机交互(HCI)视角出发,通过任务导向的调查研究揭示从业者在选择源数据集时的实际决策逻辑,发现其受任务特性、社区惯例、数据嵌入相似性或感知视觉/语义相似性等多重因素影响,且相似性评估与预期性能并不总是一致,从而挑战了传统“越相似越好”的认知。该研究强调需对这些启发式策略进行清晰定义,并借助HCI工具使其显式化和可操作化,以推动更科学、系统的源数据集选择机制。

链接: https://arxiv.org/abs/2510.00902
作者: Yucheng Lu,Hubert Dariusz Zając,Veronika Cheplygina,Amelia Jiménez-Sánchez
机构: IT University of Copenhagen (IT大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Under review

点击查看摘要

Abstract:Transfer learning is crucial for medical imaging, yet the selection of source datasets - which can impact the generalizability of algorithms, and thus patient outcomes - often relies on researchers’ intuition rather than systematic principles. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-centered HCI perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional “more similar is better” view. Participants often used ambiguous terminology, which suggests a need for clearer definitions and HCI tools to make them explicit and usable. By clarifying these heuristics, this work provides practical insights for more systematic source selection in transfer learning.
zh

[CV-23] AI-CNet3D: An Anatomically-Informed Cross-Attention Network with Multi-Task Consistency Fine-tuning for 3D Glaucoma Classification

【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)在青光眼诊断中因将三维(3D)数据压缩为二维(2D)报告而导致关键结构信息丢失的问题。其核心解决方案是提出一种新型混合深度学习模型AI-CNet3D,该模型通过在3D卷积神经网络(CNN)中引入交叉注意力机制(cross-attention),实现对视网膜上、下象限以及视盘(optic nerve head, ONH)和黄斑区结构特征的协同提取与整合;同时,设计通道注意力表示(Channel Attention REpresentations, CAREs)以可视化交叉注意力输出,并结合基于梯度加权类激活图(Grad-CAM)的一致性多任务微调策略,提升模型性能、可解释性及解剖学一致性。该方法在两个大规模数据集上验证优于当前最先进的注意力和卷积模型,且参数量减少百倍,计算效率显著提升。

链接: https://arxiv.org/abs/2510.00882
作者: Roshan Kenia,Anfei Li,Rishabh Srivastava,Kaveri A. Thakoor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Glaucoma is a progressive eye disease that leads to optic nerve damage, causing irreversible vision loss if left untreated. Optical coherence tomography (OCT) has become a crucial tool for glaucoma diagnosis, offering high-resolution 3D scans of the retina and optic nerve. However, the conventional practice of condensing information from 3D OCT volumes into 2D reports often results in the loss of key structural details. To address this, we propose a novel hybrid deep learning model that integrates cross-attention mechanisms into a 3D convolutional neural network (CNN), enabling the extraction of critical features from the superior and inferior hemiretinas, as well as from the optic nerve head (ONH) and macula, within OCT volumes. We introduce Channel Attention REpresentations (CAREs) to visualize cross-attention outputs and leverage them for consistency-based multi-task fine-tuning, aligning them with Gradient-Weighted Class Activation Maps (Grad-CAMs) from the CNN’s final convolutional layer to enhance performance, interpretability, and anatomical coherence. We have named this model AI-CNet3D (AI-`See’-Net3D) to reflect its design as an Anatomically-Informed Cross-attention Network operating on 3D data. By dividing the volume along two axes and applying cross-attention, our model enhances glaucoma classification by capturing asymmetries between the hemiretinal regions while integrating information from the optic nerve head and macula. We validate our approach on two large datasets, showing that it outperforms state-of-the-art attention and convolutional models across all key metrics. Finally, our model is computationally efficient, reducing the parameter count by one-hundred–fold compared to other attention mechanisms while maintaining high diagnostic performance and comparable GFLOPS.
zh

[CV-24] Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model

【速读】:该论文旨在解决视频超分辨率(VSR)任务中传统递归神经网络(RNN)因梯度消失、缺乏并行性及推理速度慢等问题,以及纯基于注意力机制的模型在长序列建模时存在二次计算复杂度和可扩展性差的局限。其核心解决方案是提出一种混合架构:利用移位窗口自注意力(shifted window self-attention)实现细粒度空间上下文聚合,同时结合选择性状态空间模型(selective State Space Model, SSM)如Mamba进行高效的时间传播,从而在保持线性时间复杂度的同时捕捉长程时序依赖。关键创新在于引入对齐感知的Gather-Scatter Mamba(GSM)机制,在时间窗口内将特征向中心锚定帧对齐后进行Mamba处理,再散射回原位置,有效缓解遮挡伪影并提升信息在帧间的均衡分布。

链接: https://arxiv.org/abs/2510.00862
作者: Hyun-kyu Ko,Youbin Kim,Jihyeon Park,Dongheok Park,Gyeongjin Kang,Wonjun Cho,Hyung Yi,Eunbyung Park
机构: Sungkyunkwan University (成均馆大学); Hanwha Systems; Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: \url{ this https URL }

点击查看摘要

Abstract:State Space Models (SSMs)-most notably RNNs-have historically played a central role in sequential modeling. Although attention mechanisms such as Transformers have since dominated due to their ability to model global context, their quadratic complexity and limited scalability make them less suited for long sequences. Video super-resolution (VSR) methods have traditionally relied on recurrent architectures to propagate features across frames. However, such approaches suffer from well-known issues including vanishing gradients, lack of parallelism, and slow inference speed. Recent advances in selective SSMs like Mamba offer a compelling alternative: by enabling input-dependent state transitions with linear-time complexity, Mamba mitigates these issues while maintaining strong long-range modeling capabilities. Despite this potential, Mamba alone struggles to capture fine-grained spatial dependencies due to its causal nature and lack of explicit context aggregation. To address this, we propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation. Furthermore, we introduce Gather-Scatter Mamba (GSM), an alignment-aware mechanism that warps features toward a center anchor frame within the temporal window before Mamba propagation and scatters them back afterward, effectively reducing occlusion artifacts and ensuring effective redistribution of aggregated information across all frames. The official implementation is provided at: this https URL.
zh

[CV-25] Feature Identification for Hierarchical Contrastive Learning ICASSP2026

【速读】:该论文旨在解决层次分类(hierarchical classification)任务中传统分类方法忽视不同层级类别间内在关系的问题,从而导致丢失重要监督信号。其解决方案的关键在于提出两种新颖的层次对比学习(hierarchical contrastive learning, HMLC)方法:G-HMLC利用高斯混合模型(Gaussian Mixture Model)建模高层级类别间的复杂分布与不平衡性,A-HMLC则通过注意力机制捕捉层次特异性特征,模拟人类认知过程。这两种方法均显式建模跨层级的类间关系,实现全层次上的细粒度聚类,在CIFAR100和ModelNet40数据集上线性评估性能优于现有方法2个百分点,验证了其有效性。

链接: https://arxiv.org/abs/2510.00837
作者: Julius Ott,Nastassia Vysotskaya,Huawei Sun,Lorenzo Servadei,Robert Wille
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Hierarchical classification is a crucial task in many applications, where objects are organized into multiple levels of categories. However, conventional classification approaches often neglect inherent inter-class relationships at different hierarchy levels, thus missing important supervisory signals. Thus, we propose two novel hierarchical contrastive learning (HMLC) methods. The first, leverages a Gaussian Mixture Model (G-HMLC) and the second uses an attention mechanism to capture hierarchy-specific features (A-HMLC), imitating human processing. Our approach explicitly models inter-class relationships and imbalanced class distribution at higher hierarchy levels, enabling fine-grained clustering across all hierarchy levels. On the competitive CIFAR100 and ModelNet40 datasets, our method achieves state-of-the-art performance in linear evaluation, outperforming existing hierarchical contrastive learning methods by 2 percentage points in terms of accuracy. The effectiveness of our approach is backed by both quantitative and qualitative results, highlighting its potential for applications in computer vision and beyond.
zh

[CV-26] NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution

【速读】:该论文旨在解决真实世界图像超分辨率(Real-ISR)任务中现有方法在生成质量与推理效率之间难以平衡的问题,尤其是基于预训练文本到图像(T2I)扩散模型的方法存在速度慢或输出质量低、且对输入退化程度变化敏感的局限性。解决方案的关键在于提出一种基于自回归(AR)建模的新框架Next-Scale Autoregressive Modeling (NSARM),其核心创新是利用视觉自回归模型(如Infinity)的位级下一尺度预测策略实现高效生成,并通过两阶段训练机制——先训练一个变换网络将低质量图像映射至初步尺度,再进行端到端全模型微调——显著提升了模型在不同输入退化条件下的鲁棒性和泛化能力,同时保持了快速推理速度和高质量生成效果。

链接: https://arxiv.org/abs/2510.00820
作者: Xiangtao Kong,Rongyuan Wu,Shuaizheng Liu,Lingchen Sun,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: this https URL
zh

[CV-27] PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset ICCV2025

【速读】:该论文旨在解决自然语言短语与图像中特定区域对应关系的多模态语义分割问题,尤其关注单视角图像在几何信息利用上的局限性。现有短语定位(phrase grounding)方法主要基于单视图图像,未能充分挖掘立体视觉(stereo vision)所提供的丰富几何线索。其解决方案的关键在于提出PhraseStereo数据集,该数据集通过GenStereo模型从已有单视图数据生成精确的右视角图像,从而构建包含对齐分割掩码和短语标注的立体图像对,使短语定位任务扩展至立体域,并利用深度线索提升定位精度与上下文感知能力,为融合语义与几何推理的多模态学习奠定基础。

链接: https://arxiv.org/abs/2510.00818
作者: Thomas Campagnolo,Ezio Malis,Philippe Martinet,Gaetan Bahl
机构: Centre Inria d’Universite Cote d’Azur (Inria大学蔚蓝海岸研究中心); NXP Semiconductors (恩智浦半导体)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to X-Sense Ego-Exo Sensing for Smart Mobility Workshop at ICCV 2025 Conference

点击查看摘要

Abstract:Understanding how natural language phrases correspond to specific regions in images is a key challenge in multimodal semantic segmentation. Recent advances in phrase grounding are largely limited to single-view images, neglecting the rich geometric cues available in stereo vision. For this, we introduce PhraseStereo, the first novel dataset that brings phrase-region segmentation to stereo image pairs. PhraseStereo builds upon the PhraseCut dataset by leveraging GenStereo to generate accurate right-view images from existing single-view data, enabling the extension of phrase grounding into the stereo domain. This new setting introduces unique challenges and opportunities for multimodal learning, particularly in leveraging depth cues for more precise and context-aware grounding. By providing stereo image pairs with aligned segmentation masks and phrase annotations, PhraseStereo lays the foundation for future research at the intersection of language, vision, and 3D perception, encouraging the development of models that can reason jointly over semantics and geometry. The PhraseStereo dataset will be released online upon acceptance of this work.
zh

[CV-28] From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

【速读】:该论文旨在解决当前视频生成模型中存在的物理不一致运动问题,即生成的视频中物体运动违背现实世界动力学规律。解决方案的关键在于提出一个两阶段框架TrajVLM-Gen:首先利用视觉语言模型(Vision Language Model, VLM)预测粗粒度的运动轨迹,确保其符合真实物理约束;随后通过基于注意力机制的方法,以这些轨迹为引导,对视频生成过程进行细粒度运动优化,从而实现更符合物理规律的图像到视频生成。

链接: https://arxiv.org/abs/2510.00806
作者: Fan Yang,Zhiyang Chen,Yousong Zhu,Xin Li,Jinqiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.
zh

[CV-29] Solar PV Installation Potential Assessment on Building Facades Based on Vision and Language Foundation Models

【速读】:该论文旨在解决密集城市环境中建筑立面(Building Facades)在光伏(Photovoltaic, PV)潜力评估中的难题,特别是由于复杂几何结构和语义组件导致的自动化识别与部署优化困难。解决方案的关键在于提出了一种名为SF-SPA(Semantic Facade Solar-PV Assessment)的自动化框架,其核心由四个阶段构成:几何校正以消除透视畸变、零样本语义分割实现立面元素理解、基于大语言模型(Large Language Model, LLM)的空间推理优化光伏布局,以及最终的能量仿真验证。该方法显著提升了评估效率与准确性,为区域级光伏潜力研究、城市能源规划及建筑一体化光伏(Building-Integrated Photovoltaic, BIPV)部署提供了可靠工具。

链接: https://arxiv.org/abs/2510.00797
作者: Ruyu Liu,Dongxu Zhuang,Jianhua Zhang,Arega Getaneh Abate,Per Sieverts Nielsen,Ben Wang,Xiufeng Liu
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building facades represent a significant untapped resource for solar energy generation in dense urban environments, yet assessing their photovoltaic (PV) potential remains challenging due to complex geometries and semantic com ponents. This study introduces SF-SPA (Semantic Facade Solar-PV Assessment), an automated framework that transforms street-view photographs into quantitative PV deployment assessments. The approach combines com puter vision and artificial intelligence techniques to address three key challenges: perspective distortion correction, semantic understanding of facade elements, and spatial reasoning for PV layout optimization. Our four-stage pipeline processes images through geometric rectification, zero-shot semantic segmentation, Large Language Model (LLM) guided spatial reasoning, and energy simulation. Validation across 80 buildings in four countries demonstrates ro bust performance with mean area estimation errors of 6.2% #177; 2.8% compared to expert annotations. The auto mated assessment requires approximately 100 seconds per building, a substantial gain in efficiency over manual methods. Simulated energy yield predictions confirm the method’s reliability and applicability for regional poten tial studies, urban energy planning, and building-integrated photovoltaic (BIPV) deployment. Code is available at: https:github.com/CodeAXu/Solar-PV-Installation
zh

[CV-30] MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts

【速读】:该论文旨在解决当前生成式 AI(Generative AI)中文本到图像(Text-to-Image, T2I)模型在面对语义相同但语言形式略有差异的输入提示时,难以保持图像语义一致性的问题。现有模型即使在逻辑等价的提示下,仍会产生语义不一致或错位的图像输出,暴露出其在推理与泛化能力上的不足。解决方案的关键在于提出 MetaLogic——一种无需真实图像作为参考的新型评估框架,其核心机制是基于元测试(metamorphic testing)生成语义等价但语法不同的提示对,并通过直接比较由此生成的图像对来识别语义不一致现象,从而诊断模型在逻辑理解方面的鲁棒性缺陷。该方法可细粒度分类对齐错误类型(如实体遗漏、重复或位置错位),并提供反例用于模型调试和优化。

链接: https://arxiv.org/abs/2510.00796
作者: Yifan Shen,Yangyang Shu,Hye-young Paik,Yulei Sui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICFEM 2025

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) models, especially diffusion-based architectures, have significantly improved the visual quality of generated images. However, these models continue to struggle with a critical limitation: maintaining semantic consistency when input prompts undergo minor linguistic variations. Despite being logically equivalent, such prompt pairs often yield misaligned or semantically inconsistent images, exposing a lack of robustness in reasoning and generalisation. To address this, we propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images. MetaLogic leverages metamorphic testing, generating image pairs from prompts that differ grammatically but are semantically identical. By directly comparing these image pairs, the framework identifies inconsistencies that signal failures in preserving the intended meaning, effectively diagnosing robustness issues in the model’s logic understanding. Unlike existing evaluation methods that compare a generated image to a single prompt, MetaLogic evaluates semantic equivalence between paired images, offering a scalable, ground-truth-free approach to identifying alignment failures. It categorises these alignment errors (e.g., entity omission, duplication, positional misalignment) and surfaces counterexamples that can be used for model debugging and refinement. We evaluate MetaLogic across multiple state-of-the-art T2I models and reveal consistent robustness failures across a range of logical constructs. We find that even the SOTA text-to-image models like this http URL and DALLE-3 demonstrate a 59 percent and 71 percent misalignment rate, respectively. Our results show that MetaLogic is not only efficient and scalable, but also effective in uncovering fine-grained logical inconsistencies that are overlooked by existing evaluation metrics.
zh

[CV-31] Uncertainty-Aware Concept Bottleneck Models with Enhanced Interpretability ECML-PKDD2025

【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在图像分类任务中预测性能下降以及概念预测不确定性向最终标签决策传播机制未被充分探索的问题。其解决方案的关键在于提出一种新型的、具备不确定性的可解释分类器:通过学习一组二值化的类别级概念原型(binary class-level concept prototypes),利用预测概念向量与各类别原型之间的距离同时作为分类得分和不确定性度量;这些原型不仅作为可解释的分类规则,明确指示某一类别预测所需的概念组合,还支持基于输入偏离原型程度的保形预测(conformal prediction),从而提升模型在不确定或异常输入下的鲁棒性与可解释性。

链接: https://arxiv.org/abs/2510.00773
作者: Haifei Zhang,Patrick Barry,Eduardo Brandao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for the Workshop AIMLAI at ECML-PKDD 2025

点击查看摘要

Abstract:In the context of image classification, Concept Bottleneck Models (CBMs) first embed images into a set of human-understandable concepts, followed by an intrinsically interpretable classifier that predicts labels based on these intermediate representations. While CBMs offer a semantically meaningful and interpretable classification pipeline, they often sacrifice predictive performance compared to end-to-end convolutional neural networks. Moreover, the propagation of uncertainty from concept predictions to final label decisions remains underexplored. In this paper, we propose a novel uncertainty-aware and interpretable classifier for the second stage of CBMs. Our method learns a set of binary class-level concept prototypes and uses the distances between predicted concept vectors and each class prototype as both a classification score and a measure of uncertainty. These prototypes also serve as interpretable classification rules, indicating which concepts should be present in an image to justify a specific class prediction. The proposed framework enhances both interpretability and robustness by enabling conformal prediction for uncertain or outlier inputs based on their deviation from the learned binary class-level concept prototypes.
zh

[CV-32] ZQBA: Zero Query Black-box Adversarial Attack

【速读】:该论文旨在解决黑盒对抗攻击中对大量查询或复杂模型(如扩散模型)依赖的问题,这些方法在现实场景中应用受限。其解决方案的关键在于提出一种零查询黑盒对抗攻击(Zero Query Black-box Adversarial, ZQBA),该方法不依赖于目标模型的梯度信息或额外训练,而是利用深度神经网络(Deep Neural Networks, DNNs)的特征图(feature maps)直接添加到干净图像上,生成具有强迁移性的对抗样本。此策略显著降低了攻击所需的查询次数(仅需单次查询),同时保持扰动的不可感知性(通过SSIM量化评估),从而有效削弱目标模型的分类性能,并验证了DNN在实际部署中的脆弱性。

链接: https://arxiv.org/abs/2510.00769
作者: Joana C. Costa,Tiago Roxo,Hugo Proença,Pedro R. M. Inácio
机构: sins-lab, Instituto de Telecomunicações, Universidade da Beira Interior (贝拉内陆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICAART Conference

点击查看摘要

Abstract:Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at this https URL.
zh

[CV-33] Multi-Objective Task-Aware Predictor for Image-Text Alignment

【速读】:该论文旨在解决视觉-语言模型中图像-文本对齐评估的多维偏好一致性问题,尤其在真实场景下需兼顾多种有效描述、高效推理与多目标评分能力的挑战。现有评估指标普遍存在与人类判断不一致、长序列处理能力弱、推理效率低或无法支持多目标评分等缺陷。其解决方案的关键在于提出一种即插即用的架构 MULTI-TAP(Multi-Objective Task-Aware Predictor),通过冻结预训练大视觉-语言模型(LVLM)的隐藏状态并添加轻量级岭回归层,实现对多个可解释维度的细粒度评分;同时利用基于LVLM的奖励头生成综合得分,在保持较小模型规模(7–8B参数)的前提下,性能媲美GPT-4o驱动的G-VEval,并显著优于VisionREWARD等现有方法,在新构建的EYE4ALL数据集上验证了其有效性与普适性。

链接: https://arxiv.org/abs/2510.00766
作者: Eunki Kim,Na Min An,James Thorne,Hyunjung Shim
机构: KAIST AI (KAIST人工智能); Theia Insights
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 10 figures, 21 tables

点击查看摘要

Abstract:Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.
zh

[CV-34] Defect Segmentation in OCT scans of ceramic parts for non-destructive inspection using deep learning

【速读】:该论文旨在解决陶瓷制造过程中缺陷检测的自动化与高精度问题,传统非破坏性检测(Non-destructive Testing, NDT)方法难以实现高效、准确的内部缺陷识别。解决方案的关键在于构建一个基于深度学习(Deep Learning, DL)的自动缺陷检测系统,其核心是采用U-Net架构的神经网络模型,并在人工标注的光学相干断层扫描(Optical Coherence Tomography, OCT)图像上进行训练,从而实现对孔隙、分层和夹杂物等内部缺陷的精准分割。该系统在Dice Score达到0.979,显著优于现有方法,且单体积推理时间仅为18.98秒,具备工业级实时检测的可行性。

链接: https://arxiv.org/abs/2510.00745
作者: Andrés Laveda-Martínez,Natalia P. García-de-la-Puente,Fernando García-Torres,Niels Møller Israelsen,Ole Bang,Dominik Brouczek,Niels Benson,Adrián Colomer,Valery Naranjo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures, 4 tables. Paper accepted and presented at IDEAL 2025 Conference

点击查看摘要

Abstract:Non-destructive testing (NDT) is essential in ceramic manufacturing to ensure the quality of components without compromising their integrity. In this context, Optical Coherence Tomography (OCT) enables high-resolution internal imaging, revealing defects such as pores, delaminations, or inclusions. This paper presents an automatic defect detection system based on Deep Learning (DL), trained on OCT images with manually segmented annotations. A neural network based on the U-Net architecture is developed, evaluating multiple experimental configurations to enhance its performance. Post-processing techniques enable both quantitative and qualitative evaluation of the predictions. The system shows an accurate behavior of 0.979 Dice Score, outperforming comparable studies. The inference time of 18.98 seconds per volume supports its viability for detecting inclusions, enabling more efficient, reliable, and automated quality control.
zh

[CV-35] Extreme Blind Image Restoration via Prompt-Conditioned Information Bottleneck

【速读】:该论文旨在解决极端盲图像复原(Extreme Blind Image Restoration, EBIR)问题,即当输入图像遭受严重且复合退化时,传统盲图像复原(Blind Image Restoration, BIR)方法因域间隙过大而失效,导致重建结果出现不自然伪影和细节丢失。解决方案的关键在于提出一种分解式框架:首先训练一个投影器(projector),将极低质量(Extremely Low-Quality, ELQ)图像映射到一个中间的、退化程度较低的低质量(Low-Quality, LQ)流形上;随后利用冻结的现成BIR模型对这一中间图像进行高质量(High-Quality, HQ)恢复。该方法基于信息瓶颈理论,设计了一个理论驱动的损失函数,在低质量重建与高质量先验匹配之间取得平衡,从而稳定训练过程,并支持推理时一次性的提示优化(Look Forward Once, LFO)和无需微调即可增强现有复原模型的插件式扩展能力。

链接: https://arxiv.org/abs/2510.00728
作者: Hongeun Kim,Bryan Sangwoo Kim,Jong Chul Ye
机构: KAIST AI (韩国科学技术院人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Blind Image Restoration (BIR) methods have achieved remarkable success but falter when faced with Extreme Blind Image Restoration (EBIR), where inputs suffer from severe, compounded degradations beyond their training scope. Directly learning a mapping from extremely low-quality (ELQ) to high-quality (HQ) images is challenging due to the massive domain gap, often leading to unnatural artifacts and loss of detail. To address this, we propose a novel framework that decomposes the intractable ELQ-to-HQ restoration process. We first learn a projector that maps an ELQ image onto an intermediate, less-degraded LQ manifold. This intermediate image is then restored to HQ using a frozen, off-the-shelf BIR model. Our approach is grounded in information theory; we provide a novel perspective of image restoration as an Information Bottleneck problem and derive a theoretically-driven objective to train our projector. This loss function effectively stabilizes training by balancing a low-quality reconstruction term with a high-quality prior-matching term. Our framework enables Look Forward Once (LFO) for inference-time prompt refinement, and supports plug-and-play strengthening of existing image restoration models without need for finetuning. Extensive experiments under severe degradation regimes provide a thorough analysis of the effectiveness of our work.
zh

[CV-36] DEAP DIVE: Dataset Investigation with Vision transformers for EEG evaluation ICCV2025

【速读】:该论文旨在解决如何在使用低成本脑电图(EEG)设备时,通过减少测量通道数量实现高精度情绪预测的问题。传统全导联EEG虽能提供高质量数据,但其复杂性和资源消耗限制了实际应用;而低采样率的EEG设备因通道数少导致性能下降。解决方案的关键在于:利用连续小波变换(Continuous Wavelet Transformation)将原始EEG信号转化为尺度图(scaleograms),进而输入视觉Transformer(ViT)模型进行情绪分类,仅用12个通道即实现了91.57%的四象限情绪识别准确率,显著优于现有基于32通道的最优结果(96.9%),验证了简化硬件配置下仍可获得高性能情绪预测的可能性。

链接: https://arxiv.org/abs/2510.00725
作者: Annemarie Hoffsommer,Helen Schneider,Svetlana Pavlitska,J. Marius Zöllner
机构: Karlsruhe Institute of Technology (KIT); FZI Research Center for Information Technology (FZI 研究中心信息科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ABAW Workshop at ICCV2025

点击查看摘要

Abstract:Accurately predicting emotions from brain signals has the potential to achieve goals such as improving mental health, human-computer interaction, and affective computing. Emotion prediction through neural signals offers a promising alternative to traditional methods, such as self-assessment and facial expression analysis, which can be subjective or ambiguous. Measurements of the brain activity via electroencephalogram (EEG) provides a more direct and unbiased data source. However, conducting a full EEG is a complex, resource-intensive process, leading to the rise of low-cost EEG devices with simplified measurement capabilities. This work examines how subsets of EEG channels from the DEAP dataset can be used for sufficiently accurate emotion prediction with low-cost EEG devices, rather than fully equipped EEG-measurements. Using Continuous Wavelet Transformation to convert EEG data into scaleograms, we trained a vision transformer (ViT) model for emotion classification. The model achieved over 91,57% accuracy in predicting 4 quadrants (high/low per arousal and valence) with only 12 measuring points (also referred to as channels). Our work shows clearly, that a significant reduction of input channels yields high results compared to state-of-the-art results of 96,9% with 32 channels. Training scripts to reproduce our code can be found here: this https URL.
zh

[CV-37] Deep learning motion correction of quantitative stress perfusion cardiovascular magnetic resonance

【速读】:该论文旨在解决定量应激灌注心血管磁共振成像(stress perfusion cardiovascular magnetic resonance, CMR)中因心脏或呼吸运动导致的图像伪影问题,此类伪影会显著影响像素级心肌血流灌注参数映射的准确性。传统基于配准(registration-based)的运动校正方法存在计算效率低、对采集变异性敏感等问题,限制了其鲁棒性和可扩展性。解决方案的关键在于提出了一种无监督的深度学习运动校正流程,通过一次前向推理(one-shot estimation)替代迭代式配准,实现快速且稳定的运动校正;该方法在三个步骤中完成灌注序列与辅助图像(动脉输入函数和质子密度加权序列)的对齐,并利用鲁棒主成分分析(robust principal component analysis)减少对比剂相关效应的影响,从而提升时间-强度曲线的时间平滑度、心肌分割一致性(Dice系数达0.92)及灌注图的稳定性,同时将处理时间缩短15倍,具备多厂商数据泛化能力,有助于推动定量灌注成像的临床应用。

链接: https://arxiv.org/abs/2510.00723
作者: Noortje I.P. Schueler,Nathan C. K. Wong,Richard J. Crawley,Josien P.W. Pluim,Amedeo Chiribiri,Cian M. Scannell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Background: Quantitative stress perfusion cardiovascular magnetic resonance (CMR) is a powerful tool for assessing myocardial ischemia. Motion correction is essential for accurate pixel-wise mapping but traditional registration-based methods are slow and sensitive to acquisition variability, limiting robustness and scalability. Methods: We developed an unsupervised deep learning-based motion correction pipeline that replaces iterative registration with efficient one-shot estimation. The method corrects motion in three steps and uses robust principal component analysis to reduce contrast-related effects. It aligns the perfusion series and auxiliary images (arterial input function and proton density-weighted series). Models were trained and validated on multivendor data from 201 patients, with 38 held out for testing. Performance was assessed via temporal alignment and quantitative perfusion values, compared to a previously published registration-based method. Results: The deep learning approach significantly improved temporal smoothness of time-intensity curves (p0.001). Myocardial alignment (Dice = 0.92 (0.04) and 0.91 (0.05)) was comparable to the baseline and superior to before registration (Dice = 0.80 (0.09), p0.001). Perfusion maps showed reduced motion, with lower standard deviation in the myocardium (0.52 (0.39) ml/min/g) compared to baseline (0.55 (0.44) ml/min/g). Processing time was reduced 15-fold. Conclusion: This deep learning pipeline enables fast, robust motion correction for stress perfusion CMR, improving accuracy across dynamic and auxiliary images. Trained on multivendor data, it generalizes across sequences and may facilitate broader clinical adoption of quantitative perfusion imaging. Comments: Under review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.00723 [cs.CV] (or arXiv:2510.00723v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.00723 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cian M. Scannell [view email] [v1] Wed, 1 Oct 2025 09:59:48 UTC (1,119 KB)
zh

[CV-38] raining-free Uncertainty Guidance for Complex Visual Tasks with MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度感知任务中的性能瓶颈问题,例如在高分辨率图像中识别小物体、从长视频中定位关键时刻等。现有方法通常依赖于复杂且任务特定的微调策略,导致模型泛化能力弱、结构冗余。其解决方案的关键在于利用MLLM自身的输出熵(output entropy)作为主动引导信号——即当模型接收到相关视觉信息时,其响应不确定性会降低。作者提出了一种无需训练的统一机制,通过评估候选视觉输入的响应不确定性来评分并自动聚焦于最显著的数据,从而在视觉搜索、长视频理解与时间定位三项复杂任务上实现与专用微调方法相当甚至更优的性能。

链接: https://arxiv.org/abs/2510.00705
作者: Sanghwan Kim,Rui Xiao,Stephan Alaniz,Yongqin Xian,Zeynep Akata
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Helmholtz Munich (赫尔姆霍兹慕尼黑研究中心); Google (谷歌); LTCI, Télécom Paris, Institut Polytechnique de Paris (Télécom Paris 工业理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or finding key moments in long videos. Existing works typically rely on complicated, task-specific fine-tuning, which limits their generalizability and increases model complexity. In this work, we propose an effective, training-free framework that uses an MLLM’s intrinsic uncertainty as a proactive guidance signal. Our core insight is that a model’s output entropy decreases when presented with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data. We apply this simple principle to three complex visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned methods. Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.
zh

[CV-39] Graph Integrated Multimodal Concept Bottleneck Model

【速读】:该论文旨在解决传统概念瓶颈模型(Concept Bottleneck Models, CBMs)在高风险领域应用中面临的两大局限性:一是通常仅支持单模态输入,无法有效处理多模态数据;二是忽略了概念之间的结构化关系,难以建模复杂的概念交互。为此,作者提出MoE-SGT框架,其关键创新在于引入两个核心组件:首先,构建答案-概念图和答案-问题图以显式编码多模态输入中的结构化概念关系,并通过图Transformer(Graph Transformer)捕获多层次依赖;其次,用混合专家(Mixture of Experts, MoE)模块替代传统的前馈层,使模型能够动态分配推理任务至不同子专家,从而增强对复杂概念模式的适应能力。该方案显著提升了模型在多个数据集上的准确率,同时增强了可解释性和泛化性能。

链接: https://arxiv.org/abs/2510.00701
作者: Jiakai Lin,Jinchang Zhang,Guoyu Lu
机构: SUNY Binghamton (纽约州立大学宾汉姆顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With growing demand for interpretability in deep learning, especially in high stakes domains, Concept Bottleneck Models (CBMs) address this by inserting human understandable concepts into the prediction pipeline, but they are generally single modal and ignore structured concept relationships. To overcome these limitations, we present MoE-SGT, a reasoning driven framework that augments CBMs with a structure injecting Graph Transformer and a Mixture of Experts (MoE) module. We construct answer-concept and answer-question graphs for multimodal inputs to explicitly model the structured relationships among concepts. Subsequently, we integrate Graph Transformer to capture multi level dependencies, addressing the limitations of traditional Concept Bottleneck Models in modeling concept interactions. However, it still encounters bottlenecks in adapting to complex concept patterns. Therefore, we replace the feed forward layers with a Mixture of Experts (MoE) module, enabling the model to have greater capacity in learning diverse concept relationships while dynamically allocating reasoning tasks to different sub experts, thereby significantly enhancing the model’s adaptability to complex concept reasoning. MoE-SGT achieves higher accuracy than other concept bottleneck networks on multiple datasets by modeling structured relationships among concepts and utilizing a dynamic expert selection mechanism.
zh

[CV-40] HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

【速读】:该论文旨在解决当前视觉-语言-动作模型(Vision-Language-Action models, VLAs)在机器人操作任务中忽视历史依赖性的问题,即这些模型通常仅基于当前观测进行决策,而未能利用先前的感知上下文信息。解决方案的关键在于提出一种名为 HAMLET 的可扩展框架,其核心创新是引入“时刻标记”(moment tokens),用于紧凑编码每个时间步的感知信息,并通过时间对比学习初始化这些标记以更好地捕捉时序特征;同时设计了一个轻量级记忆模块,将历史时刻标记整合为记忆特征,从而在动作预测中显式利用历史上下文。实证结果表明,该方法显著提升了长周期任务的成功率,在真实世界任务中使 GR00T N1.5 的平均成功率从 53.2% 提升至 76.4%,并在 RoboCasa Kitchen 和 LIBERO 基准上分别实现性能突破。

链接: https://arxiv.org/abs/2510.00695
作者: Myungkyu Koo,Daewon Choi,Taeyoung Kim,Kyungmin Lee,Changyeon Kim,Youngyo Seo,Jinwoo Shin
机构: KAIST(韩国科学技术院); UC Berkeley(加州大学伯克利分校); Real World Inc.(真实世界公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.
zh

[CV-41] ProtoMask: Segmentation-Guided Prototype Learning

【速读】:该论文旨在解决生成式 AI(Generative AI)在可解释性(Explainability)方面存在的问题,特别是基于原型的案例推理方法中依赖后处理显著性技术(post-hoc saliency techniques)所引发的可靠性与质量争议。其核心挑战在于这些技术对嵌入空间与输入空间之间映射的真实性难以保证,导致可视化结果存在不确定性。解决方案的关键在于引入先进的图像分割基础模型(image segmentation foundation models),通过将显著性图计算区域限制在预定义的语义图像块(semantic image patch)内,从而提升映射的可信度;同时提出一种新型模型架构 ProtoMask,利用每个分割掩码生成的边界框裁剪图像,以获取具有语义一致性的输入子集,实现更精确、可解释的特征表示。

链接: https://arxiv.org/abs/2510.00683
作者: Steffen Meinert,Philipp Schlinge,Nils Strodthoff,Martin Atzmueller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:XAI gained considerable importance in recent years. Methods based on prototypical case-based reasoning have shown a promising improvement in explainability. However, these methods typically rely on additional post-hoc saliency techniques to explain the semantics of learned prototypes. Multiple critiques have been raised about the reliability and quality of such techniques. For this reason, we study the use of prominent image segmentation foundation models to improve the truthfulness of the mapping between embedding and input space. We aim to restrict the computation area of the saliency map to a predefined semantic image patch to reduce the uncertainty of such visualizations. To perceive the information of an entire image, we use the bounding box from each generated segmentation mask to crop the image. Each mask results in an individual input in our novel model architecture named ProtoMask. We conduct experiments on three popular fine-grained classification datasets with a wide set of metrics, providing a detailed overview on explainability characteristics. The comparison with other popular models demonstrates competitive performance and unique explainability features of our model. this https URL
zh

[CV-42] Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation

【速读】:该论文旨在解决事件相机(event camera)在开放词汇目标检测(open-vocabulary object detection)任务中的挑战,即事件数据缺乏纹理和颜色信息,导致现有方法难以泛化到未见过的类别。其解决方案的关键在于提出一种事件-图像知识蒸馏框架(event-image knowledge distillation framework),通过将CLIP(Contrastive Language–Image Pretraining)模型作为教师模型,利用图像帧输入引导事件数据驱动的学生模型学习CLIP的语义理解能力;同时设计了一种混合脉冲神经网络(spiking neural network, SNN)与卷积神经网络(convolutional neural network, CNN)架构,以自适应地确定事件数据的最佳分割时机,从而保留关键的时间特征并避免因固定分组导致的信息丢失,最终实现基于原始事件流的开放词汇目标检测。

链接: https://arxiv.org/abs/2510.00681
作者: Jinchang Zhang,Zijun Li,Jiakai Lin,Guoyu Lu
机构: Intelligent Vision and Sensing (IVS) Lab at SUNY Binghamton (纽约州立大学宾汉顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP’s semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP’s rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP’s broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.
zh

[CV-43] Beyond one-hot encoding? Journey into compact encoding for large multi-class segmentation MICCAI2025

【速读】:该论文旨在解决医学图像分割任务中因类别数量庞大而导致的计算复杂度和内存消耗过高的问题。其核心挑战在于如何在保持分割性能的同时降低编码方式带来的资源开销。解决方案的关键在于提出了一种基于二进制编码(binary encoding)的替代方案,取代传统的独热编码(one-hot encoding),从而将计算复杂度和内存需求从线性增长降低为对数级增长。文中进一步探索了错误校正输出码(ECOCs)、类别权重、硬/软解码策略、类别到码字的映射方式以及标签嵌入树等改进方法,尽管如此,实验表明二进制编码在108类全脑分区任务上仍难以达到与独热编码相当的分割精度(DSC从82.4降至39.3–73.8),揭示了该类方法在高精度分割场景下的局限性。

链接: https://arxiv.org/abs/2510.00667
作者: Aaron Kujawa,Thomas Booth,Tom Vercauteren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Presented at EMA4MICCAI 2025 Workshop

点击查看摘要

Abstract:This work presents novel methods to reduce computational and memory requirements for medical image segmentation with a large number of classes. We curiously observe challenges in maintaining state-of-the-art segmentation performance with all of the explored options. Standard learning-based methods typically employ one-hot encoding of class labels. The computational complexity and memory requirements thus increase linearly with the number of classes. We propose a family of binary encoding approaches instead of one-hot encoding to reduce the computational complexity and memory requirements to logarithmic in the number of classes. In addition to vanilla binary encoding, we investigate the effects of error-correcting output codes (ECOCs), class weighting, hard/soft decoding, class-to-codeword assignment, and label embedding trees. We apply the methods to the use case of whole brain parcellation with 108 classes based on 3D MRI images. While binary encodings have proven efficient in so-called extreme classification problems in computer vision, we faced challenges in reaching state-of-the-art segmentation quality with binary encodings. Compared to one-hot encoding (Dice Similarity Coefficient (DSC) = 82.4 (2.8)), we report reduced segmentation performance with the binary segmentation approaches, achieving DSCs in the range from 39.3 to 73.8. Informative negative results all too often go unpublished. We hope that this work inspires future research of compact encoding strategies for large multi-class segmentation tasks.
zh

[CV-44] A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models

【速读】:该论文旨在解决当前生成式 AI(Generative AI)图像模型中忽视数据几何结构的问题,即现有方法多依赖概率近似而忽略图像数据在高维空间中的低维流形特性。其关键解决方案在于提出一个统一的几何与概率框架,通过引入基于核函数的概率方法,将扩散模型解释为向“优质图像”流形上的投影机制,并据此构建了新的确定性模型——流形-概率投影模型(Manifold-Probabilistic Projection Model, MPPM)。该模型同时在像素空间和潜在空间中运作,显著提升了图像恢复与生成的质量,尤其在Latent MPPM(LMPPM)上优于Latent Diffusion Model(LDM)。

链接: https://arxiv.org/abs/2510.00666
作者: Leah Bar,Liron Mor Yosef,Shai Zucker,Neta Shoham,Inbar Seroussi,Nir Sochen
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The foundational premise of generative AI for images is the assumption that images are inherently low-dimensional objects embedded within a high-dimensional space. Additionally, it is often implicitly assumed that thematic image datasets form smooth or piecewise smooth manifolds. Common approaches overlook the geometric structure and focus solely on probabilistic methods, approximating the probability distribution through universal approximation techniques such as the kernel method. In some generative models, the low dimensional nature of the data manifest itself by the introduction of a lower dimensional latent space. Yet, the probability distribution in the latent or the manifold coordinate space is considered uninteresting and is predefined or considered uniform. This study unifies the geometric and probabilistic perspectives by providing a geometric framework and a kernel-based probabilistic method simultaneously. The resulting framework demystifies diffusion models by interpreting them as a projection mechanism onto the manifold of ``good images’'. This interpretation leads to the construction of a new deterministic model, the Manifold-Probabilistic Projection Model (MPPM), which operates in both the representation (pixel) space and the latent space. We demonstrate that the Latent MPPM (LMPPM) outperforms the Latent Diffusion Model (LDM) across various datasets, achieving superior results in terms of image restoration and generation.
zh

[CV-45] Multi-Domain Brain Vessel Segmentation Through Feature Disentanglement

【速读】:该论文旨在解决脑血管(cerebrovascular)图像中动脉与静脉自动分割的难题,尤其针对不同医疗中心、成像模态(imaging modality)及血管类型之间存在的显著域差异(domain gap)导致模型泛化能力受限的问题。其解决方案的关键在于利用图像到图像的翻译(image-to-image translation)技术结合解耦表示学习(disentanglement techniques),实现对图像属性的独立操控:在保持空间信息(如血管形状和位置)不变的前提下,仅调整血管外观特征以完成跨域适应,从而无需为特定域设计专用模型或进行数据归一化处理。这一方法确保了标签保留(label-preserving)的迁移能力,显著提升了模型在多场景下的分割准确性与鲁棒性。

链接: https://arxiv.org/abs/2510.00665
作者: Francesco Galati,Daniele Falcetta,Rosa Cortese,Ferran Prados,Ninon Burgos,Maria A. Zuluaga
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 7 figures, 3 tables. Joint first authors: Francesco Galati and Daniele Falcetta. Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL . Code available at this https URL

点击查看摘要

Abstract:The intricate morphology of brain vessels poses significant challenges for automatic segmentation models, which usually focus on a single imaging modality. However, accurately treating brain-related conditions requires a comprehensive understanding of the cerebrovascular tree, regardless of the specific acquisition procedure. Our framework effectively segments brain arteries and veins in various datasets through image-to-image translation while avoiding domain-specific model design and data harmonization between the source and the target domain. This is accomplished by employing disentanglement techniques to independently manipulate different image properties, allowing them to move from one domain to another in a label-preserving manner. Specifically, we focus on manipulating vessel appearances during adaptation while preserving spatial information, such as shapes and locations, which are crucial for correct segmentation. Our evaluation effectively bridges large and varied domain gaps across medical centers, image modalities, and vessel types. Additionally, we conduct ablation studies on the optimal number of required annotations and other architectural choices. The results highlight our framework’s robustness and versatility, demonstrating the potential of domain adaptation methodologies to perform cerebrovascular image segmentation in multiple scenarios accurately. Our code is available at this https URL.
zh

[CV-46] Batch-CAM: Introduction to better reasoning in convolutional deep learning models

【速读】:该论文旨在解决深度学习模型在高风险领域(如医疗健康)中缺乏可解释性的问题,即模型决策过程不透明,难以验证其合理性。解决方案的关键在于提出一种名为Batch-CAM的新训练范式,该方法将Grad-CAM(Gradient-weighted Class Activation Mapping)的批处理实现与原型重建损失(prototypical reconstruction loss)相结合,引导模型聚焦于图像中的显著特征,从而在提升分类准确率的同时改善图像重建质量,并缩短训练和推理时间,增强模型的透明度、可解释性和可信度。

链接: https://arxiv.org/abs/2510.00664
作者: Giacomo Ignesti,Davide Moroni,Massimo Martinelli
机构: Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” (ISTI), Consiglio Nazionale delle Ricerche (CNR)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures, submitted to SN Computer Science Springer Nature

点击查看摘要

Abstract:Understanding the inner workings of deep learning models is crucial for advancing artificial intelligence, particularly in high-stakes fields such as healthcare, where accurate explanations are as vital as precision. This paper introduces Batch-CAM, a novel training paradigm that fuses a batch implementation of the Grad-CAM algorithm with a prototypical reconstruction loss. This combination guides the model to focus on salient image features, thereby enhancing its performance across classification tasks. Our results demonstrate that Batch-CAM achieves a simultaneous improvement in accuracy and image reconstruction quality while reducing training and inference times. By ensuring models learn from evidence-relevant information,this approach makes a relevant contribution to building more transparent, explainable, and trustworthy AI systems.
zh

[CV-47] Unsupervised Unfolded rPCA (U2-rPCA): Deep Interpretable Clutter Filtering for Ultrasound Microvascular Imaging

【速读】:该论文旨在解决超声微血管成像中杂波滤波的难题,特别是传统奇异值分解(SVD)和鲁棒主成分分析(rPCA)方法在特征建模能力不足以及组织与血流信号分离不充分的问题。现有基于深度学习的滤波方法虽具潜力,但受限于缺乏可解释性及体外/体内真实标签(ground truth)。为此,作者提出一种无监督展开的鲁棒主成分分析方法(U2-rPCA),其关键在于将迭代重加权最小二乘(IRLS)rPCA算法进行结构展开,并引入稀疏增强单元以强化对稀疏微血流信号的捕捉能力,从而在保持数学可解释性的前提下实现无需标签的学习策略。该方法通过部分图像序列训练后自适应地应用于后续帧,在仿真和公开活体数据集上均显著优于SVD、rPCA基线及其它深度学习滤波器,尤其使功率多普勒图像的信噪比(CNR)提升2–10 dB。

链接: https://arxiv.org/abs/2510.00660
作者: Huaying Li,Liansheng Wang,Yinran Chen
机构: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University (厦门大学信息学院感知与计算智能城市福建省重点实验室); Department of Computer Science and Technology, School of Informatics, and the National Institute for Data Science in Health and Medicine, Xiamen University (厦门大学信息学院计算机科学与技术系及健康与医学数据科学国家研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-sensitivity clutter filtering is a fundamental step in ultrasound microvascular imaging. Singular value decomposition (SVD) and robust principal component analysis (rPCA) are the main clutter filtering strategies. However, both strategies are limited in feature modeling and tissue-blood flow separation for high-quality microvascular imaging. Recently, deep learning-based clutter filtering has shown potential in more thoroughly separating tissue and blood flow signals. However, the existing supervised filters face the challenges of interpretability and lack of in-vitro and in-vivo ground truths. While the interpretability issue can be addressed by algorithm deep unfolding, the training ground truth remains unsolved. To this end, this paper proposes an unsupervised unfolded rPCA (U2-rPCA) method that preserves mathematical interpretability and is insusceptible to learning labels. Specifically, U2-rPCA is unfolded from an iteratively reweighted least squares (IRLS) rPCA baseline with intrinsic low-rank and sparse regularization. A sparse-enhancement unit is added to the network to strengthen its capability to capture the sparse micro-flow signals. U2-rPCA is like an adaptive filter that is trained with part of the image sequence and then used for the following frames. Experimental validations on a in-silico dataset and public in-vivo datasets demonstrated the outperformance of U2-rPCA when compared with the SVD-based method, the rPCA baseline, and another deep learning-based filter. Particularly, the proposed method improved the contrastto-noise ratio (CNR) of the power Doppler image by 2 dB to 10 dB when compared with other methods. Furthermore, the effectiveness of the building modules of U2-rPCA was validated through ablation studies.
zh

[CV-48] Align Your Tangent: Training Better Consistency Models via Manifold-Aligned Tangents

【速读】:该论文旨在解决一致性模型(Consistency Models, CMs)在训练过程中收敛缓慢且对大批次大小依赖性强的问题,同时保持高质量的生成样本。其关键解决方案是提出一种新的损失函数——流形特征距离(Manifold Feature Distance, MFD),该损失能够引导模型梯度更新方向(即CM切线)朝向数据流形(data manifold)而非沿其平行移动,从而显著减少训练中的振荡现象。由此,所提出的Align Your Tangent (AYT) 方法可将CM训练速度提升数个数量级,并支持极小批次训练而不牺牲样本质量,甚至优于传统感知图像块相似性度量(LPIPS)。

链接: https://arxiv.org/abs/2510.00658
作者: Beomsu Kim,Byunghee Cha,Jong Chul Ye
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:With diffusion and flow matching models achieving state-of-the-art generating performance, the interest of the community now turned to reducing the inference time without sacrificing sample quality. Consistency Models (CMs), which are trained to be consistent on diffusion or probability flow ordinary differential equation (PF-ODE) trajectories, enable one or two-step flow or diffusion sampling. However, CMs typically require prolonged training with large batch sizes to obtain competitive sample quality. In this paper, we examine the training dynamics of CMs near convergence and discover that CM tangents – CM output update directions – are quite oscillatory, in the sense that they move parallel to the data manifold, not towards the manifold. To mitigate oscillatory tangents, we propose a new loss function, called the manifold feature distance (MFD), which provides manifold-aligned tangents that point toward the data manifold. Consequently, our method – dubbed Align Your Tangent (AYT) – can accelerate CM training by orders of magnitude and even out-perform the learned perceptual image patch similarity metric (LPIPS). Furthermore, we find that our loss enables training with extremely small batch sizes without compromising sample quality. Code: this https URL
zh

[CV-49] Weakly Supervised Cloud Detection Combining Spectral Features and Multi-Scale Deep Network

【速读】:该论文旨在解决光学卫星图像中云层干扰导致图像质量下降的问题,尤其针对深度学习方法在薄云检测和训练样本质量较低时精度不足的局限性。解决方案的关键在于提出一种结合光谱特征与多尺度场景级深度网络(SpecMCD)的弱监督云检测方法:首先通过多尺度场景级数据集训练多尺度云检测网络,进而利用密集云覆盖和大范围云区图像的特性融合多尺度概率图与云厚图生成像素级云概率图,最后基于不同尺度场景级云掩膜的差异化区域自适应生成阈值,并引入距离加权优化策略获得二值化云掩膜,从而显著提升复杂云况下的云检测精度。

链接: https://arxiv.org/abs/2510.00654
作者: Shaocong Zhu,Zhiwei Li,Xinghua Li,Huanfeng Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clouds significantly affect the quality of optical satellite images, which seriously limits their precise application. Recently, deep learning has been widely applied to cloud detection and has achieved satisfactory results. However, the lack of distinctive features in thin clouds and the low quality of training samples limit the cloud detection accuracy of deep learning methods, leaving space for further improvements. In this paper, we propose a weakly supervised cloud detection method that combines spectral features and multi-scale scene-level deep network (SpecMCD) to obtain highly accurate pixel-level cloud masks. The method first utilizes a progressive training framework with a multi-scale scene-level dataset to train the multi-scale scene-level cloud detection network. Pixel-level cloud probability maps are then obtained by combining the multi-scale probability maps and cloud thickness map based on the characteristics of clouds in dense cloud coverage and large cloud-area coverage images. Finally, adaptive thresholds are generated based on the differentiated regions of the scene-level cloud masks at different scales and combined with distance-weighted optimization to obtain binary cloud masks. Two datasets, WDCD and GF1MS-WHU, comprising a total of 60 Gaofen-1 multispectral (GF1-MS) images, were used to verify the effectiveness of the proposed method. Compared to the other weakly supervised cloud detection methods such as WDCD and WSFNet, the F1-score of the proposed SpecMCD method shows an improvement of over 7.82%, highlighting the superiority and potential of the SpecMCD method for cloud detection under different cloud coverage conditions.
zh

[CV-50] OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding ICDM2025

【速读】:该论文旨在解决多模态标签任务中封闭集(closed-set)一致性与开放词汇(open-vocabulary)灵活性之间的矛盾问题。传统方法往往受限于预定义类别集合,难以适应用户自定义标签的动态扩展需求;而纯开放标签方法又易导致语义不一致或性能不稳定。解决方案的关键在于提出OTTER框架,其核心是通过一个统一的多标签 tagging 框架,结合预定义标签(predefined labels)与用户驱动的开放标签(open tags),利用多层次注意力机制(multi-head attention architecture)联合对齐视觉和文本表示与两类标签嵌入(label embeddings),从而在保持固定类别准确性的基础上,实现对开放标签的高精度、语义一致的动态识别。实验表明,OTTER在两个基准数据集上均显著优于现有方法,尤其在开放标签上的F1分数接近完美(0.99和0.97),验证了其在兼顾稳定性与适应性方面的有效性。

链接: https://arxiv.org/abs/2510.00652
作者: Jieer Ouyang,Xiaoneng Xiang,Zheng Wang,Yangkai Ding
机构: Huawei Singapore Research Center (华为新加坡研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDM 2025 BigIS Workshop

点击查看摘要

Abstract:We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotated through a hybrid pipeline combining automated vision-language labeling with human refinement. By leveraging a multi-head attention architecture, OTTER jointly aligns visual and textual representations with both fixed and open-set label embeddings, enabling dynamic and semantically consistent tagging. OTTER consistently outperforms competitive baselines on two benchmark datasets: it achieves an overall F1 score of 0.81 on Otter and 0.75 on Favorite, surpassing the next-best results by margins of 0.10 and 0.02, respectively. OTTER attains near-perfect performance on open-set labels, with F1 of 0.99 on Otter and 0.97 on Favorite, while maintaining competitive accuracy on predefined labels. These results demonstrate OTTER’s effectiveness in bridging closed-set consistency with open-vocabulary flexibility for multi-modal tagging applications.
zh

[CV-51] FIN: Fast Inference Network for Map Segmentation

【速读】:该论文旨在解决自动驾驶中地图分割(map segmentation)任务面临的高精度与实时性难以兼顾的问题。当前方法在复杂环境下的感知准确性不足,且难以满足车辆决策所需的低延迟要求。解决方案的关键在于提出一种基于相机与雷达融合的高效BEV(Bird’s Eye View)空间地图分割架构,通过引入先进的损失函数组合及一种新型轻量级头部结构(lightweight head),在保持高精度(达到53.5 mIoU)的同时显著提升推理速度,相较最强基线模型实现260%的性能提升,从而实现了精度与效率的协同优化。

链接: https://arxiv.org/abs/2510.00651
作者: Ruan Bispo,Tim Brophy,Reenu Mohandas,Anthony Scanlan,Ciarán Eising
机构: University of Limerick (利默里克大学); Lero, The Irish Software Research Centre (爱尔兰软件研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-sensor fusion in autonomous vehicles is becoming more common to offer a more robust alternative for several perception tasks. This need arises from the unique contribution of each sensor in collecting data: camera-radar fusion offers a cost-effective solution by combining rich semantic information from cameras with accurate distance measurements from radar, without incurring excessive financial costs or overwhelming data processing requirements. Map segmentation is a critical task for enabling effective vehicle behaviour in its environment, yet it continues to face significant challenges in achieving high accuracy and meeting real-time performance requirements. Therefore, this work presents a novel and efficient map segmentation architecture, using cameras and radars, in the \acrfullbev space. Our model introduces a real-time map segmentation architecture considering aspects such as high accuracy, per-class balancing, and inference time. To accomplish this, we use an advanced loss set together with a new lightweight head to improve the perception results. Our results show that, with these modifications, our approach achieves results comparable to large models, reaching 53.5 mIoU, while also setting a new benchmark for inference time, improving it by 260% over the strongest baseline models.
zh

[CV-52] Erased But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

【速读】:该论文旨在解决当前概念擦除(concept erasure)技术在新一代基于修正流(rectified flow)的文本到图像(text-to-image, T2I)生成模型(如Flux)中有效性不足的问题。现有方法主要针对Stable Diffusion设计,在Flux等新架构上表现有限,其根本原因在于这些方法依赖于注意力定位(attention localization)现象,而该现象在修正流框架中被破坏。论文提出ReFlux攻击方法,其核心创新在于引入一种反向注意力优化策略(reverse-attention optimization),通过稳定注意力机制并重新激活被擦除的概念信号,同时结合速度引导的动态调整(velocity-guided dynamic)增强概念重激活的鲁棒性,并利用一致性保持目标(consistency-preserving objective)维持图像全局结构和无关内容不变,从而实现对最新T2I模型概念擦除安全性的可靠评估。

链接: https://arxiv.org/abs/2510.00635
作者: Nanxiang Jiang,Zhaoxin Fan,Enhan Kang,Daiheng Gao,Yun Zhou,Yanxia Chang,Zheng Zhu,Yeying Jin,Wenjun Wu
机构: Beihang University (北京航空航天大学); USTC (中国科学技术大学); NUDT (国防科技大学); Giga AI; NUS (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, most existing approaches and corresponding attack evaluations are tailored to Stable Diffusion (SD) and exhibit limited effectiveness when transferred to next-generation rectified flow transformers such as Flux. In this work, we present ReFlux, the first concept attack method specifically designed to assess the robustness of concept erasure in the latest rectified flow-based T2I framework. Our approach is motivated by the observation that existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization. Building on this insight, we propose a simple yet effective attack strategy that specifically targets this property. At its core, a reverse-attention optimization strategy is introduced to effectively reactivate suppressed signals while stabilizing attention. This is further reinforced by a velocity-guided dynamic that enhances the robustness of concept reactivation by steering the flow matching process, and a consistency-preserving objective that maintains the global layout and preserves unrelated content. Extensive experiments consistently demonstrate the effectiveness and efficiency of the proposed attack method, establishing a reliable benchmark for evaluating the robustness of concept erasure strategies in rectified flow transformers.
zh

[CV-53] LAKAN: Landmark-assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection

【速读】:该论文旨在解决深度伪造(deepfake)图像中伪造痕迹高度复杂且非线性的问题,现有基于卷积神经网络(CNN)和Transformer的方法在建模此类伪造特征时仍存在局限。其解决方案的关键在于提出一种基于Kolmogorov-Arnold Network(KAN)的新型检测方法,通过将固定激活函数替换为可学习样条函数(learnable splines),显著提升了对伪造 artifacts 的建模能力;进一步引入Landmark-assisted Adaptive Kolmogorov-Arnold Network(LAKAN)模块,利用面部关键点作为结构先验,动态生成KAN内部参数,从而引导通用图像编码器聚焦于含伪造痕迹的面部关键区域,实现几何先验与网络学习过程的深度融合。

链接: https://arxiv.org/abs/2510.00634
作者: Jiayao Jiang,Siran Peng,Bin Liu,Qi Chu,Nenghai Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The rapid development of deepfake generation techniques necessitates robust face forgery detection algorithms. While methods based on Convolutional Neural Networks (CNNs) and Transformers are effective, there is still room for improvement in modeling the highly complex and non-linear nature of forgery artifacts. To address this issue, we propose a novel detection method based on the Kolmogorov-Arnold Network (KAN). By replacing fixed activation functions with learnable splines, our KAN-based approach is better suited to this challenge. Furthermore, to guide the network’s focus towards critical facial areas, we introduce a Landmark-assisted Adaptive Kolmogorov-Arnold Network (LAKAN) module. This module uses facial landmarks as a structural prior to dynamically generate the internal parameters of the KAN, creating an instance-specific signal that steers a general-purpose image encoder towards the most informative facial regions with artifacts. This core innovation creates a powerful combination between geometric priors and the network’s learning process. Extensive experiments on multiple public datasets show that our proposed method achieves superior performance.
zh

[CV-54] Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset

【速读】:该论文旨在解决当前时尚图像生成任务局限于虚拟试穿等狭隘场景的问题,这些问题通常仅在干净的摄影棚环境中呈现服装,缺乏编辑时尚(editorial fashion)中所体现的动态姿势、多样场景和精心构建的视觉叙事。为应对这一挑战,作者提出了“虚拟时尚拍摄”(virtual fashion photo-shoot)的新任务,目标是将标准化的服装图像转化为具有情境意义的编辑风格影像。解决方案的关键在于构建了首个大规模的服装-时尚画册配对数据集(garment-lookbook pairs),并设计了一个自动化检索流水线,通过结合视觉-语言推理与对象级定位技术,在不同领域间对齐服装,从而实现从电商商品图到富有创意与叙事性的时尚影像的转换。该数据集包含三个质量等级共约36万对样本,为推动生成式AI从目录式生成迈向更具艺术性和氛围感的时尚图像创作提供了基础支持。

链接: https://arxiv.org/abs/2510.00633
作者: Yannick Hauri,Luca A. Lanzendörfer,Till Aczel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.
zh

[CV-55] UCD: Unconditional Discriminator Promotes Nash Equilibrium in GANs

【速读】:该论文旨在解决生成对抗网络(GAN)训练中难以收敛及模式崩溃(mode collapse)的问题,其核心在于量化分析GAN训练过程中纳什均衡(Nash equilibrium)的实现程度,并指出判别器(Discriminator, D)通过输入条件信息导致冗余捷径,从而阻碍了有意义特征的提取。解决方案的关键是引入无条件判别器(Unconditional Discriminator, UCD),强制D在不注入条件信息的情况下提取更全面和鲁棒的特征,从而提升对生成器(Generator, G)的监督能力,促进纳什均衡的实现。理论证明UCD可无缝集成到标准GAN框架中,实验表明该方法在ImageNet-64上达到1.47 FID,显著优于StyleGAN-XL和多个前沿单步扩散模型。

链接: https://arxiv.org/abs/2510.00624
作者: Mengfei Xia,Nan Xue,Jiapeng Zhu,Yujun Shen
机构: Ant Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial training turns out to be the key to one-step generation, especially for Generative Adversarial Network (GAN) and diffusion model distillation. Yet in practice, GAN training hardly converges properly and struggles in mode collapse. In this work, we quantitatively analyze the extent of Nash equilibrium in GAN training, and conclude that redundant shortcuts by inputting condition in D disables meaningful knowledge extraction. We thereby propose to employ an unconditional discriminator (UCD), in which D is enforced to extract more comprehensive and robust features with no condition injection. In this way, D is able to leverage better knowledge to supervise G , which promotes Nash equilibrium in GAN literature. Theoretical guarantee on compatibility with vanilla GAN theory indicates that UCD can be implemented in a plug-in manner. Extensive experiments confirm the significant performance improvements with high efficiency. For instance, we achieved \textbf1.47 FID on the ImageNet-64 dataset, surpassing StyleGAN-XL and several state-of-the-art one-step diffusion models. The code will be made publicly available.
zh

[CV-56] Robust Context-Aware Object Recognition

【速读】:该论文旨在解决视觉识别中模型对背景(Background, BG)的过度依赖问题,即由虚假相关性引发的“捷径学习”(shortcut learning),这会显著降低模型在真实场景中的鲁棒性。现有方法通常通过抑制背景信息来提升泛化能力,但牺牲了上下文感知能力。其解决方案的关键在于提出RCOR(Robust Context-Aware Object Recognition)框架,该方法将目标定位视为识别过程的一部分,从而实现对象中心建模与上下文感知建模的解耦,并采用一种稳健的非参数融合策略,在不损害任一能力的前提下同时提升模型的鲁棒性和上下文敏感性。实验表明,RCOR在无需微调的情况下,能有效提升监督模型和视觉语言模型(VLM)在包含域内与域外背景的数据集上的性能。

链接: https://arxiv.org/abs/2510.00618
作者: Klara Janouskova,Cristian Gavrus,Jiri Matas
机构: Czech Technical University in Prague (布拉格捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In visual recognition, both the object of interest (referred to as foreground, FG, for simplicity) and its surrounding context (background, BG) play an important role. However, standard supervised learning often leads to unintended over-reliance on the BG, known as shortcut learning of spurious correlations, limiting model robustness in real-world deployment settings. In the literature, the problem is mainly addressed by suppressing the BG, sacrificing context information for improved generalization. We propose RCOR – Robust Context-Aware Object Recognition – the first approach that jointly achieves robustness and context-awareness without compromising either. RCOR treats localization as an integral part of recognition to decouple object-centric and context-aware modelling, followed by a robust, non-parametric fusion. It improves the performance of both supervised models and VLM on datasets with both in-domain and out-of-domain BG, even without fine-tuning. The results confirm that localization before recognition is now possible even in complex scenes as in ImageNet-1k. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.00618 [cs.CV] (or arXiv:2510.00618v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.00618 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-57] Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation

【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)中因忽略前景与背景信息差异而导致的泛化能力不足问题。现有方法虽通过增强多维视觉表征推动了VLN进展,但未充分挖掘前景区域提供的语义线索与背景区域蕴含的空间连通性信息之间的协同作用。解决方案的关键在于提出一种共识驱动的在线特征增强策略(Consensus-driven Online Feature Augmentation, COFA),其核心机制为:首先利用语义增强的地标识别技术分离前景与背景作为候选增强特征;随后设计一种基于两阶段投票机制的在线增强策略,使代理根据多样化的指令和导航位置动态整合对特征偏好的共识,从而提升导航任务的可迁移性和鲁棒性。实验在REVERIE和R2R数据集上验证了该方法在基线模型基础上显著改善泛化性能并达到当前最优效果。

链接: https://arxiv.org/abs/2510.00604
作者: Yunbo Xu,Xuesong Zhang,Jia Li,Zhenzhen Hu,Richang Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.
zh

[CV-58] LVLMs as inspectors: an agent ic framework for category-level structural defect annotation

【速读】:该论文旨在解决基础设施安全评估中结构缺陷标注成本高、效率低的问题,传统依赖人工标注的方式难以满足大规模数据处理需求。解决方案的关键在于提出一种基于智能体的自动标注框架(Agent-based Defect Pattern Tagger, ADPT),其核心创新包括:集成大型视觉语言模型(Large Vision-Language Models, LVLMs)与语义模式匹配模块,并引入迭代自问式精炼机制,通过优化领域特定提示(domain-specific prompting)和递归验证流程,实现从原始视觉数据到高质量语义标签数据的自动化转换,无需任何人工监督即可生成适用于迁移学习和域适应等下游任务的高保真数据集。

链接: https://arxiv.org/abs/2510.00603
作者: Sheng Jiang,Yuanmin Ning,Bingxi Huang,Peiyin Chen,Zhaohui Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated structural defect annotation is essential for ensuring infrastructure safety while minimizing the high costs and inefficiencies of manual labeling. A novel agentic annotation framework, Agent-based Defect Pattern Tagger (ADPT), is introduced that integrates Large Vision-Language Models (LVLMs) with a semantic pattern matching module and an iterative self-questioning refinement mechanism. By leveraging optimized domain-specific prompting and a recursive verification process, ADPT transforms raw visual data into high-quality, semantically labeled defect datasets without any manual supervision. Experimental results demonstrate that ADPT achieves up to 98% accuracy in distinguishing defective from non-defective images, and 85%-98% annotation accuracy across four defect categories under class-balanced settings, with 80%-92% accuracy on class-imbalanced datasets. The framework offers a scalable and cost-effective solution for high-fidelity dataset construction, providing strong support for downstream tasks such as transfer learning and domain adaptation in structural damage assessment.
zh

[CV-59] Hybrid Training for Vision-Language-Action Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在机器人任务中因引入长链式思维(Chain-of-thought, CoT)导致推理延迟增加的问题,从而影响实际应用中的可用性。其核心挑战在于:尽管CoT能提升视觉-语言-动作模型(Vision-Language-Action models, VLAs)的性能,但伴随而来的长输出序列显著延长了推理时间,尤其在需要长时间动作序列的机器人操作场景中不可接受。解决方案的关键是提出一种混合训练(Hybrid Training, HyT)框架,使VLAs能够在训练阶段学习利用CoT带来的性能优势,同时在推理阶段具备选择性地跳过CoT生成的能力——即模型可灵活决定是否直接预测动作、生成思考过程或遵循指令,从而在性能与效率之间实现平衡。

链接: https://arxiv.org/abs/2510.00600
作者: Pietro Mazzaglia,Cansu Sancaktar,Markus Peschl,Daniel Dijkman
机构: Qualcomm AI Research (高通人工智能研究); University of Tübingen (图宾根大学); Max Planck Institute (马克斯·普朗克研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model’s generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent’s actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.
zh

[CV-60] Multi-level Dynamic Style Transfer for NeRFs

【速读】:该论文旨在解决现有基于神经辐射场(NeRF)的风格迁移方法在内容保真度与艺术风格迁移效果之间难以平衡的问题。现有方法通常将风格统计信息直接嵌入原始NeRF流程,导致在多尺度空间结构保留和风格特征传递方面表现不佳。其解决方案的关键在于提出一种多层级动态风格迁移方法(MDS-NeRF),核心创新包括:1)设计一个多层级特征适配器(multi-level feature adaptor),从内容辐射场中生成多尺度特征网格表示,以有效捕捉场景的多尺度空间结构;2)引入一个动态风格注入模块(dynamic style injection module),学习提取相关风格特征并自适应地融合到内容特征中;3)通过多层级级联解码器(multi-level cascade decoder)将风格化特征转化为最终的风格化视图。此外,该方法还扩展支持使用3D风格参考实现全景风格迁移,实验表明其在保持多尺度结构的同时显著提升了风格迁移质量。

链接: https://arxiv.org/abs/2510.00592
作者: Zesheng Li,Shuaibo Li,Wei Ma,Jianwei Guo,Hongbin Zha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Computational Visual Media Journal (CVMJ)

点击查看摘要

Abstract:As the application of neural radiance fields (NeRFs) in various 3D vision tasks continues to expand, numerous NeRF-based style transfer techniques have been developed. However, existing methods typically integrate style statistics into the original NeRF pipeline, often leading to suboptimal results in both content preservation and artistic stylization. In this paper, we present multi-level dynamic style transfer for NeRFs (MDS-NeRF), a novel approach that reengineers the NeRF pipeline specifically for stylization and incorporates an innovative dynamic style injection module. Particularly, we propose a multi-level feature adaptor that helps generate a multi-level feature grid representation from the content radiance field, effectively capturing the multi-scale spatial structure of the scene. In addition, we present a dynamic style injection module that learns to extract relevant style features and adaptively integrates them into the content patterns. The stylized multi-level features are then transformed into the final stylized view through our proposed multi-level cascade decoder. Furthermore, we extend our 3D style transfer method to support omni-view style transfer using 3D style references. Extensive experiments demonstrate that MDS-NeRF achieves outstanding performance for 3D style transfer, preserving multi-scale spatial structures while effectively transferring stylistic characteristics.
zh

[CV-61] Color Models in Image Processing: A Review and Experimental Comparison

【速读】:该论文旨在解决颜色模型在计算机视觉与人机交互中适配性不足的问题,尤其是现有模型在设备依赖性、色度一致性及计算复杂度等方面的局限。其解决方案的关键在于系统性地分析传统颜色模型(如RGB、CMYK、YUV)、感知均匀空间(如CIELAB、CIELUV)以及基于模糊理论的方法,并通过实验从多个维度评估这些模型的性能;实验结果表明,HS*家族的颜色模型最符合人类视觉感知特性,从而为颜色建模提供了更可靠的技术依据。

链接: https://arxiv.org/abs/2510.00584
作者: Muragul Muratbekova,Nuray Toganas,Ayan Igali,Maksat Shagyrov,Elnara Kadyrgali,Adilet Yerkin,Pakizar Shamoi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript has been submitted to Scientific Reports for consideration

点击查看摘要

Abstract:Color representation is essential in computer vision and human-computer interaction. There are multiple color models available. The choice of a suitable color model is critical for various applications. This paper presents a review of color models and spaces, analyzing their theoretical foundations, computational properties, and practical applications. We explore traditional models such as RGB, CMYK, and YUV, perceptually uniform spaces like CIELAB and CIELUV, and fuzzy-based approaches as well. Additionally, we conduct a series of experiments to evaluate color models from various perspectives, like device dependency, chromatic consistency, and computational complexity. Our experimental results reveal gaps in existing color models and show that the HS* family is the most aligned with human perception. The review also identifies key strengths and limitations of different models and outlines open challenges and future directions This study provides a reference for researchers in image processing, perceptual computing, digital media, and any other color-related field.
zh

[CV-62] Arbitrary Generative Video Interpolation

【速读】:该论文旨在解决现有生成式视频帧插值(Video Frame Interpolation, VFI)方法在灵活性上的局限性,即无法根据需求调整生成帧率或总序列时长的问题。传统方法通常只能合成固定数量的中间帧,难以适应不同应用场景对插值精度和时间粒度的要求。其解决方案的关键在于提出一种名为ArbInterp的新框架,包含两个核心技术:一是Timestamp-aware Rotary Position Embedding(TaRoPE),通过调制时间旋转位置编码(temporal RoPE)中的位置信息,使生成帧能够精确对齐任意目标归一化时间戳,从而实现任意时刻插值;二是基于段落级帧合成的任意长度插值策略,结合外观-运动解耦条件机制,利用前一段终点作为参考以保持外观一致性,并通过时间语义建模确保运动连贯性,从而实现跨段无缝的时空过渡。

链接: https://arxiv.org/abs/2510.00578
作者: Guozhen Zhang,Haiguang Wang,Chunyu Wang,Yuan Zhou,Qinglin Lu,Limin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: this https URL.
zh

[CV-63] Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning

【速读】:该论文旨在解决现有混合专家(Mixture-of-Experts, MoE)多任务学习(Multi-Task Learning, MTL)方法在从单任务学习(Single-Task Learning, STL)向MTL迁移过程中存在的冗余适应和知识共享效率低的问题。其解决方案的关键在于提出基于低秩适配(Low-Rank Adaptation, LoRA)的自适应共享专家(Adaptive Shared Experts, ASE)机制:通过将共享专家的路由计算门控权重与稀疏专家联合归一化,实现更平滑的STL到MTL过渡,并增强专家间的专业化与协作;同时引入细粒度专家设计——在保持参数预算相近的前提下,增加LoRA专家数量并按比例降低其秩,从而提升知识共享的有效性。

链接: https://arxiv.org/abs/2510.00570
作者: Minghao Yang,Ren Togo,Guang Li,Takahiro Ogawa,Miki Haseyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL). However, existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing during the transition from single-task to multi-task learning (STL to MTL). To address these limitations, we propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts. This design facilitates STL to MTL transition, enhances expert specialization, and cooperation. Furthermore, we incorporate fine-grained experts by increasing the number of LoRA experts while proportionally reducing their rank, enabling more effective knowledge sharing under a comparable parameter budget. Extensive experiments on the PASCAL-Context benchmark, under unified training settings, demonstrate that ASE consistently improves performance across diverse configurations and validates the effectiveness of fine-grained designs for MTL.
zh

[CV-64] Assessing Foundation Models for Mold Colony Detection with Limited Training Data

【速读】:该论文旨在解决微生物学中霉菌菌落定量分析的自动化问题,传统方法依赖于大量人工标注数据来训练模型(如YoloV9),而这种数据标注过程耗时且成本高昂。解决方案的关键在于利用数据高效的视觉基础模型(vision foundation models),通过在仅需少量标注样本(如150张甚至25张图像)的情况下微调,即可实现与传统深度学习模型相当甚至更优的性能。实验表明,MaskDINO在仅用150张带边界框和实例级掩码的图像进行微调后,其性能接近于经过大规模训练的YoloV9模型,在极端低样本场景下(25张图像)仍能在约70%的样本上保持可靠表现,从而显著降低数据依赖并加速自动化微生物检测系统的开发迭代。

链接: https://arxiv.org/abs/2510.00561
作者: Henrik Pichler,Janis Keuper,Matthew Copping
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 2 figures, accepted as oral presentation at GCPR 2025

点击查看摘要

Abstract:The process of quantifying mold colonies on Petri dish samples is of critical importance for the assessment of indoor air quality, as high colony counts can indicate potential health risks and deficiencies in ventilation systems. Conventionally the automation of such a labor-intensive process, as well as other tasks in microbiology, relies on the manual annotation of large datasets and the subsequent extensive training of models like YoloV9. To demonstrate that exhaustive annotation is not a prerequisite anymore when tackling a new vision task, we compile a representative dataset of 5000 Petri dish images annotated with bounding boxes, simulating both a traditional data collection approach as well as few-shot and low-shot scenarios with well curated subsets with instance level masks. We benchmark three vision foundation models against traditional baselines on task specific metrics, reflecting realistic real-world requirements. Notably, MaskDINO attains near-parity with an extensively trained YoloV9 model while finetuned only on 150 images, retaining competitive performance with as few as 25 images, still being reliable on \approx 70% of the samples. Our results show that data-efficient foundation models can match traditional approaches with only a fraction of the required data, enabling earlier development and faster iterative improvement of automated microbiological systems with a superior upper-bound performance than traditional models would achieve.
zh

[CV-65] Forestpest-YOLO: A High-Performance Detection Framework for Small Forestry Pests

【速读】:该论文旨在解决复杂林区环境中利用遥感图像检测农业害虫的难题,该任务因目标尺寸微小、遮挡严重且与背景视觉相似性高,导致传统目标检测模型在细粒度特征丢失和极端数据不平衡条件下性能显著下降。解决方案的关键在于提出Forestpest-YOLO框架,其核心创新包括:1)引入无损下采样模块SPD-Conv,确保小目标高分辨率细节在网络中得以保留;2)设计跨阶段特征融合块CSPOK,动态增强多尺度特征表示并抑制背景噪声;3)采用VarifocalLoss优化训练目标,引导模型聚焦于高质量及难分类样本。这些改进共同提升了对微小、遮挡害虫的检测精度,在自建的ForestPest数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2510.00547
作者: Aoduo Li,Peikai Lin,Jiancheng Li,Zhen Zhang,Shiting Wu,Zexiao Liang,Zhifa Jiang
机构: Guangdong University of Technology (广东工业大学); Guangdong Power Grid Co., Ltd. (广东省电网公司); Huizhou University (惠州大学); Huizhou First Maternal and Child Health Care Hospital (惠州市第一妇幼保健院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting agricultural pests in complex forestry environments using remote sensing imagery is fundamental for ecological preservation, yet it is severely hampered by practical challenges. Targets are often minuscule, heavily occluded, and visually similar to the cluttered background, causing conventional object detection models to falter due to the loss of fine-grained features and an inability to handle extreme data imbalance. To overcome these obstacles, this paper introduces Forestpest-YOLO, a detection framework meticulously optimized for the nuances of forestry remote sensing. Building upon the YOLOv8 architecture, our framework introduces a synergistic trio of innovations. We first integrate a lossless downsampling module, SPD-Conv, to ensure that critical high-resolution details of small targets are preserved throughout the network. This is complemented by a novel cross-stage feature fusion block, CSPOK, which dynamically enhances multi-scale feature representation while suppressing background noise. Finally, we employ VarifocalLoss to refine the training objective, compelling the model to focus on high-quality and hard-to-classify samples. Extensive experiments on our challenging, self-constructed ForestPest dataset demonstrate that Forestpest-YOLO achieves state-of-the-art performance, showing marked improvements in detecting small, occluded pests and significantly outperforming established baseline models.
zh

[CV-66] Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation

【速读】:该论文旨在解决3D手部姿态重建中因自遮挡和复杂关节运动导致的姿势歧义问题,以及现有级联方法无法捕捉姿态不确定性、单阶段概率模型难以获得高精度重建的局限性。其解决方案的关键在于提出一种基于扩散机制的粗到精级联框架:第一阶段为联合扩散模型,用于采样多样化的3D关节假设;第二阶段为Mesh Latent Diffusion Model (Mesh LDM),在给定关节样本条件下重建3D手部网格。通过在学习的潜在空间中使用多样化关节假设训练Mesh LDM,该框架能够学习分布感知的关节-网格关系与鲁棒的手部先验,并借助级联设计缓解从2D图像直接映射至密集3D姿态的难度,从而实现更准确的逐级优化。

链接: https://arxiv.org/abs/2510.00527
作者: Taeyun Woo,Jinah Park,Tae-Kyun Kim
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Deterministic models for 3D hand pose reconstruction, whether single-staged or cascaded, struggle with pose ambiguities caused by self-occlusions and complex hand articulations. Existing cascaded approaches refine predictions in a coarse-to-fine manner but remain deterministic and cannot capture pose uncertainties. Recent probabilistic methods model pose distributions yet are restricted to single-stage estimation, which often fails to produce accurate 3D reconstructions without refinement. To address these limitations, we propose a coarse-to-fine cascaded diffusion framework that combines probabilistic modeling with cascaded refinement. The first stage is a joint diffusion model that samples diverse 3D joint hypotheses, and the second stage is a Mesh Latent Diffusion Model (Mesh LDM) that reconstructs a 3D hand mesh conditioned on a joint sample. By training Mesh LDM with diverse joint hypotheses in a learned latent space, our framework learns distribution-aware joint-mesh relationships and robust hand priors. Furthermore, the cascaded design mitigates the difficulty of directly mapping 2D images to dense 3D poses, enhancing accuracy through sequential refinement. Experiments on FreiHAND and HO3Dv2 demonstrate that our method achieves state-of-the-art performance while effectively modeling pose distributions.
zh

[CV-67] VIRTUE: Visual-Interactive Text-Image Universal Embedder

【速读】:该论文旨在解决现有嵌入模型缺乏视觉交互能力的问题,即无法根据用户指定的区域(如点、边界框或掩码)进行局部化意图定位,从而限制了其在复杂场景下的精准应用。传统嵌入模型仅能处理全局图像和文本的匹配任务,而未能利用图像中实体级别的信息来增强表示能力。解决方案的关键在于提出一种新型的视觉交互式文本-图像通用嵌入模型(Visual-InteRactive Text-Image Universal Embedder, VIRTUE),该模型融合了分割模型与视觉语言模型(Vision-Language Model, VLM)的能力,使嵌入器能够接收视觉提示并精确定位图像中的特定区域,从而实现对局部语义内容的精准建模。这一设计不仅拓展了嵌入模型的应用边界,还显著提升了在多模态嵌入基准(MMEB)和视觉交互检索任务(SCaR)上的性能表现。

链接: https://arxiv.org/abs/2510.00523
作者: Wei-Yao Wang,Kazuya Tateishi,Qiyu Wu,Shusuke Takahashi,Yuki Mitsufuji
机构: Sony Group Corporation (索尼集团); Sony AI (索尼人工智能)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages

点击查看摘要

Abstract:Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.
zh

[CV-68] CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

【速读】:该论文旨在解决当前超声心动图领域中基础模型(Foundation Models, FMs)缺乏统一评估标准的问题,尤其针对现有方法多基于私有数据集导致的可比性不足、以及超声心动图特有的挑战(如噪声采集、高帧冗余和公开数据集有限)所引发的模型性能评估困难。其解决方案的关键在于提出了 CardioBench——一个涵盖八个多源公共数据集的标准化基准套件,覆盖四类回归任务与五类分类任务,包含功能、结构、诊断及视图识别等临床相关终点,并采用一致的零样本(zero-shot)、探针(probing)和对齐(alignment)协议系统评估多种类型的基础模型(包括心脏特异性、生物医学和通用编码器)。该工作不仅提供了可复现的预处理流程、数据划分和公开评估管道,还揭示了不同模型家族在特定任务中的互补优势,为未来超声心动图基础模型的设计与优化提供了明确方向。

链接: https://arxiv.org/abs/2510.00520
作者: Darya Taratynova,Ahmed Aly,Numan Saeed,Mohammad Yaqub
机构: Mohamed bin Zayed University of Artificial Intelligence ( Mohamed bin Zayed 大学人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models (FMs) are reshaping medical imaging, yet their application in echocardiography remains limited. While several echocardiography-specific FMs have recently been introduced, no standardized benchmark exists to evaluate them. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited public datasets. Most existing solutions evaluate on private data, restricting comparability. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography FMs. CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. We evaluate several leading FM, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our results highlight complementary strengths across model families: temporal modeling is critical for functional regression, retrieval provides robustness under distribution shift, and domain-specific text encoders capture physiologically meaningful axes. General-purpose encoders transfer strongly and often close the gap with probing, but struggle with fine-grained distinctions like view classification and subtle pathology recognition. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point and offers actionable insights to guide the design of future echocardiography foundation models.
zh

[CV-69] Efficient Multi-modal Large Language Models via Progressive Consistency Distillation NEURIPS2025

【速读】:该论文旨在解决多模态大模型(Multi-modal Large Models, MLLMs)中视觉令牌(visual tokens)占用大量计算资源导致效率低下的问题。现有方法通过训练阶段压缩视觉令牌来提升效率,但常因压缩引入的特征空间扰动加剧了模型的学习难度,使参数空间难以快速适应。解决方案的关键在于提出一种渐进式一致性蒸馏(Efficient MLLMs via Progressive Consistency Distillation, EPIC)框架:通过将压缩引起的特征空间扰动分解为令牌维度(token-wise)和层维度(layer-wise)的扰动,分别引入令牌一致性蒸馏(token consistency distillation)和层一致性蒸馏(layer consistency distillation),利用教师模型提供指导并遵循渐进式学习路径,从而显著降低训练难度,提升模型的效率、鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2510.00515
作者: Zichen Wen,Shaobo Wang,Yufa Zhou,Junyuan Zhang,Qintong Zhang,Yifeng Gao,Zhaorun Chen,Bin Wang,Weijia Li,Conghui He,Linfeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.
zh

[CV-70] Affordance-Guided Diffusion Prior for 3D Hand Reconstruction

【速读】:该论文旨在解决在严重遮挡场景下(如手部自遮挡或与物体相互遮挡)如何准确重建三维手部姿态的问题。其核心挑战在于遮挡导致的几何歧义性,使得传统回归方法难以恢复合理的手部构型。解决方案的关键在于引入一种基于** affordance-aware(具身感知)文本描述**的生成先验(generative prior),通过一个扩散模型(diffusion-based generative model)学习条件分布:即给定从大规模视觉语言模型(Vision-Language Model, VLM)中提取的、描述手-物体交互功能性的文本信息,生成符合物理合理性与功能一致性的手部姿态。该方法利用上下文语义知识(如物体形状与其典型握持方式之间的关联)指导遮挡区域的细化,从而显著提升手部姿态估计的准确性与功能性一致性。

链接: https://arxiv.org/abs/2510.00506
作者: Naru Suzuki,Takehiko Ohkawa,Tatsuro Banno,Jihyun Lee,Ryosuke Furuta,Yoichi Sato
机构: The University of Tokyo (东京大学); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:How can we reconstruct 3D hand poses when large portions of the hand are heavily occluded by itself or by objects? Humans often resolve such ambiguities by leveraging contextual knowledge – such as affordances, where an object’s shape and function suggest how the object is typically grasped. Inspired by this observation, we propose a generative prior for hand pose refinement guided by affordance-aware textual descriptions of hand-object interactions (HOI). Our method employs a diffusion-based generative model that learns the distribution of plausible hand poses conditioned on affordance descriptions, which are inferred from a large vision-language model (VLM). This enables the refinement of occluded regions into more accurate and functionally coherent hand poses. Extensive experiments on HOGraspNet, a 3D hand-affordance dataset with severe occlusions, demonstrate that our affordance-guided refinement significantly improves hand pose estimation over both recent regression methods and diffusion-based refinement lacking contextual reasoning.
zh

[CV-71] Relative-Absolute Fusion: Rethinking Feature Extraction in Image-Based Iterative Method Selection for Solving Sparse Linear Systems

【速读】:该论文旨在解决稀疏线性系统求解中迭代方法选择的鲁棒性不足问题,尤其是图像化选择方法因特征提取技术可能导致不同矩阵被编码为相同的图像表示,从而引发错误选择和次优性能的问题。其解决方案的关键在于提出一种名为RAF(Relative-Absolute Fusion)的高效特征提取技术,通过同时提取并融合相对图像特征与对应的绝对数值特征,构建全面且无歧义的矩阵表示,从而提升选择准确率,并实现当前最优(SOTA)的图像化方法选择性能。

链接: https://arxiv.org/abs/2510.00500
作者: Kaiqi Zhang,Mingguan Yang,Dali Chang,Chun Chen,Yuxiang Zhang,Kexun He,Jing Zhao
机构: Dalian University of Technology (大连理工大学); Greater Bay Area National Center of Technology Innovation (粤港澳大湾区国家技术创新中心); CATARC Automotive Test Center (Tianjin) Co., Ltd (中汽研汽车检测中心(天津)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Iterative method selection is crucial for solving sparse linear systems because these methods inherently lack robustness. Though image-based selection approaches have shown promise, their feature extraction techniques might encode distinct matrices into identical image representations, leading to the same selection and suboptimal method. In this paper, we introduce RAF (Relative-Absolute Fusion), an efficient feature extraction technique to enhance image-based selection approaches. By simultaneously extracting and fusing image representations as relative features with corresponding numerical values as absolute features, RAF achieves comprehensive matrix representations that prevent feature ambiguity across distinct matrices, thus improving selection accuracy and unlocking the potential of image-based selection approaches. We conducted comprehensive evaluations of RAF on SuiteSparse and our developed BMCMat (Balanced Multi-Classification Matrix dataset), demonstrating solution time reductions of 0.08s-0.29s for sparse linear systems, which is 5.86%-11.50% faster than conventional image-based selection approaches and achieves state-of-the-art (SOTA) performance. BMCMat is available at this https URL.
zh

[CV-72] Normal-Abnormal Guided Generalist Anomaly Detection

【速读】:该论文旨在解决通用异常检测(Generalist Anomaly Detection, GAD)中仅依赖正常样本作为参考的局限性,即忽略了实际场景中常可获取的异常样本所蕴含的有价值信息。针对此问题,作者提出了一种更实用的方法——基于正常与异常样本协同引导的通用异常检测,其核心创新在于引入了“正常-异常通用学习”(Normal-Abnormal Generalist Learning, NAGL)框架,包含两个关键组件:残差挖掘(Residual Mining, RM)和异常特征学习(Anomaly Feature Learning, AFL)。RM从正常-异常参考残差中提取异常模式以建立可迁移的异常表征,AFL则通过残差映射自适应地学习查询图像中的异常特征,从而实现实例级异常识别。该方案首次在GAD中采用正常与异常样本混合作为参考,显著提升了跨域异常检测的准确性与效率。

链接: https://arxiv.org/abs/2510.00495
作者: Yuexin Wang,Xiaolei Wang,Yizheng Gong,Jimin Xiao
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at this https URL.
zh

[CV-73] MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

【速读】:该论文旨在解决视觉符号组合推理(Visual Symbolic Compositional Reasoning, VSCR)能力的评估难题,即如何在统一框架下衡量模型对视觉感知、符号操作与算术一致性三者协同处理的能力。其解决方案的关键在于提出MathSticks基准,该基准通过设计需要移动一根或两根火柴棒来修正错误等式的任务,严格约束操作规则,从而系统性地涵盖数字规模、移动复杂度、解的多样性及运算符变化等多个维度。该基准包含140万条生成实例和精心筛选的测试集,能够有效区分不同模型在文本引导和纯视觉场景下的表现差异,为推动视觉-语言模型在组合推理方面的进步提供了严谨的评测平台。

链接: https://arxiv.org/abs/2510.00483
作者: Yuheng Ji,Huajie Tan,Cheng Chi,Yijie Xu,Yuting Zhao,Enshen Zhou,Huaihai Lyu,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Xiaolong Zheng
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Peking University (北京大学); The University of Sydney (悉尼大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce \textscMathSticks, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision–language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90% accuracy. These findings establish \textscMathSticks as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at this https URL.
zh

[CV-74] Diagnosing Shortcut-Induced Rigidity in Continual Learning: The Einstellung Rigidity Index (ERI)

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中因模型过度依赖捷径特征(shortcut features)而导致的鲁棒性下降与新任务适应能力受限的问题。捷径特征是指输入与标签之间存在的非因果关联,这类特征在分布偏移下会显著降低模型可靠性,且在CL场景中会因权重继承而固化为“认知僵化”(Einstellung effect),阻碍对新知识的有效获取。解决方案的关键在于提出一个名为Einstellung Rigidity Index (ERI) 的诊断指标,通过三个可解释维度——适应延迟(Adaptation Delay, AD)、性能缺陷(Performance Deficit, PD)和相对次优特征依赖度(Relative Suboptimal Feature Reliance, SFR_rel),区分真实迁移与由提示误导带来的虚假性能提升。实验表明,在引入人为制造的伪特征(如图像中的品红色补丁)后,多数CL方法虽更快达到准确率阈值(负AD),但最终在含干扰特征类上表现更差(正PD),且掩蔽该补丁反而提升其准确率,说明这些CL方法将补丁视为干扰而非有效线索(负SFR_rel),从而揭示了捷径依赖导致的刚性问题。

链接: https://arxiv.org/abs/2510.00475
作者: Kai Gu,Weishi Shi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Deep neural networks frequently exploit shortcut features, defined as incidental correlations between inputs and labels without causal meaning. Shortcut features undermine robustness and reduce reliability under distribution shifts. In continual learning (CL), the consequences of shortcut exploitation can persist and intensify: weights inherited from earlier tasks bias representation reuse toward whatever features most easily satisfied prior labels, mirroring the cognitive Einstellung effect, a phenomenon where past habits block optimal solutions. Whereas catastrophic forgetting erodes past skills, shortcut-induced rigidity throttles the acquisition of new ones. We introduce the Einstellung Rigidity Index (ERI), a compact diagnostic that disentangles genuine transfer from cue-inflated performance using three interpretable facets: (i) Adaptation Delay (AD), (ii) Performance Deficit (PD), and (iii) Relative Suboptimal Feature Reliance (SFR_rel). On a two-phase CIFAR-100 CL benchmark with a deliberately spurious magenta patch in Phase 2, we evaluate Naive fine-tuning (SGD), online Elastic Weight Consolidation (EWC_on), Dark Experience Replay (DER++), Gradient Projection Memory (GPM), and Deep Generative Replay (DGR). Across these continual learning methods, we observe that CL methods reach accuracy thresholds earlier than a Scratch-T2 baseline (negative AD) but achieve slightly lower final accuracy on patched shortcut classes (positive PD). Masking the patch improves accuracy for CL methods while slightly reducing Scratch-T2, yielding negative SFR_rel. This pattern indicates the patch acted as a distractor for CL models in this setting rather than a helpful shortcut.
zh

[CV-75] Rehearsal-free and Task-free Online Continual Learning With Contrastive Prompt

【速读】:该论文旨在解决在线持续学习(Online Continual Learning, OCL)中的灾难性遗忘(catastrophic forgetting)问题。现有方法通常依赖于存储样本的回放缓冲区(rehearsal buffer)或假设任务边界可识别,但前者存在数据隐私风险,后者在单次数据流处理中难以实现。为此,论文提出了一种无需样本回放且不依赖任务边界或身份信息的解决方案(即F2OCL),其关键在于将提示学习(prompt learning)与神经协同记忆分类器(NCM classifier)相结合,从而在不存储数据、不依赖任务标识的情况下有效缓解遗忘问题。

链接: https://arxiv.org/abs/2510.00467
作者: Aopeng Wang,Ke Deng,Yongli Ren,Jun Luo
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: preparing for CVIU

点击查看摘要

Abstract:The main challenge of continual learning is \textitcatastrophic forgetting. Because of processing data in one pass, online continual learning (OCL) is one of the most difficult continual learning scenarios. To address catastrophic forgetting in OCL, some existing studies use a rehearsal buffer to store samples and replay them in the later learning process, other studies do not store samples but assume a sequence of learning tasks so that the task identities can be explored. However, storing samples may raise data security or privacy concerns and it is not always possible to identify the boundaries between learning tasks in one pass of data processing. It motivates us to investigate rehearsal-free and task-free OCL (F2OCL). By integrating prompt learning with an NCM classifier, this study has effectively tackled catastrophic forgetting without storing samples and without usage of task boundaries or identities. The extensive experimental results on two benchmarks have demonstrated the effectiveness of the proposed method.
zh

[CV-76] VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

【速读】:该论文旨在解决视觉-语言目标检测器(Vision-Language Object Detectors, VLODs)在面临域偏移(domain shift)时性能下降的问题,尤其是在跨场景、低光照、风格化图像及常见噪声干扰等分布变化下的零样本识别能力退化。其解决方案的关键在于提出一种测试时自适应(Test-Time Adaptation, TTA)框架——VLOD-TTA,核心创新包括:一是引入基于IoU加权的熵目标函数,聚焦于空间上连贯的候选框聚类区域进行适应,从而减少孤立预测框带来的确认偏差;二是设计图像条件化的提示选择机制,通过图像级兼容性评分筛选最优文本提示,并将其与检测器输出logits融合,增强语义对齐精度。该方法在多个具有挑战性的分布偏移场景中显著提升了YOLO-World和Grounding DINO两个前沿VLOD模型的鲁棒性。

链接: https://arxiv.org/abs/2510.00458
作者: Atif Belal,Heitor R. Medeiros,Marco Pedersoli,Eric Granger
机构: LIVIA, Dept. of Systems Engineering, ÉTS Montréal, Canada(加拿大蒙特利尔工程学院); International Laboratory on Learning Systems (ILLS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts – including stylized domains, driving scenes, low-light conditions, and common corruptions – shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : this https URL
zh

[CV-77] Measuring and Controlling the Spectral Bias for Self-Supervised Image Denoising

【速读】:该论文旨在解决当前自监督去噪方法在成对噪声图像处理中存在的两个关键问题:一是高频结构细节保留不足,二是网络在拟合高频信息时会学习到噪声成分。解决方案的核心在于提出一种谱控网络(Spectral Controlling network, SCNet),其关键技术包括:1)设计频带选择策略以加速训练收敛;2)通过限制卷积核的Lipschitz常数来抑制对高频噪声的学习能力,无需改变网络结构;3)引入谱分离与低秩重建模块(Spectral Separation and low-rank Reconstruction module, SSR module),利用频域分离和低秩空间重构实现噪声与高频结构细节的有效分离,从而有效保留图像的高频结构信息。

链接: https://arxiv.org/abs/2510.00454
作者: Wang Zhang,Huaqiu Li,Xiaowan Hu,Tao Jiang,Zikang Chen,Haoqian Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current self-supervised denoising methods for paired noisy images typically involve mapping one noisy image through the network to the other noisy image. However, after measuring the spectral bias of such methods using our proposed Image Pair Frequency-Band Similarity, it suffers from two practical limitations. Firstly, the high-frequency structural details in images are not preserved well enough. Secondly, during the process of fitting high frequencies, the network learns high-frequency noise from the mapped noisy images. To address these challenges, we introduce a Spectral Controlling network (SCNet) to optimize self-supervised denoising of paired noisy images. First, we propose a selection strategy to choose frequency band components for noisy images, to accelerate the convergence speed of training. Next, we present a parameter optimization method that restricts the learning ability of convolutional kernels to high-frequency noise using the Lipschitz constant, without changing the network structure. Finally, we introduce the Spectral Separation and low-rank Reconstruction module (SSR module), which separates noise and high-frequency details through frequency domain separation and low-rank space reconstruction, to retain the high-frequency structural details of images. Experiments performed on synthetic and real-world datasets verify the effectiveness of SCNet.
zh

[CV-78] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

【速读】:该论文旨在解决现有视频生成模型在主体一致性(subject consistency)方面的不足,尤其是当提示词包含复杂空间关系、时间逻辑及多主体交互时,模型难以准确解析并保持视觉上的一致性。其解决方案的关键在于提出了一种统一框架 BindWeave,其中引入了 MLLM-DiT(Multimodal Large Language Model - Diffusion Transformer)架构:通过预训练的多模态大语言模型(Multimodal Large Language Model, MLLM)进行深度跨模态推理,将提示中的实体进行语义定位、角色解耦与属性分离,从而生成具有主体感知能力的隐状态(subject-aware hidden states),作为扩散Transformer(Diffusion Transformer)的条件输入,实现高保真且主体一致的视频生成。

链接: https://arxiv.org/abs/2510.00438
作者: Zhaoyang Li,Dongjun Qian,Kai Su,Qishuai Diao,Xiangyang Xia,Chang Liu,Wenfei Yang,Tianzhu Zhang,Zehuan Yuan
机构: University of Science and Technology of China (中国科学技术大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.
zh

[CV-79] On-the-Fly Data Augmentation via Gradient-Guided and Sample-Aware Influence Estimation

【速读】:该论文旨在解决现有数据增强方法在动态训练过程中因未考虑样本难度演化而导致的增强策略与模型训练需求不匹配问题,进而影响深度神经网络泛化能力提升的问题。其解决方案的关键在于提出一种样本感知的动态增强方法(Sample-Aware Dynamic Augmentation, SADA),通过在线估计每个样本对模型优化的影响强度——具体地,利用样本梯度在累积模型更新方向上的投影及其局部训练窗口内的时序方差来衡量稳定性:稳定样本(低方差)被施加更强的增强以增强多样性,而不稳定样本则采用较弱的变换以保留语义一致性并稳定学习过程。该方法无需额外辅助模型或策略调参,可作为即插即用模块集成至现有训练流程中。

链接: https://arxiv.org/abs/2510.00434
作者: Suorong Yang,Jie Zong,Lihang Wang,Ziheng Qin,Hai Gan,Pengfei Zhou,Kai Wang,Yang You,Furao Shen
机构: Nanjing University (南京大学); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data augmentation has been widely employed to improve the generalization of deep neural networks. Most existing methods apply fixed or random transformations. However, we find that sample difficulty evolves along with the model’s generalization capabilities in dynamic training environments. As a result, applying uniform or stochastic augmentations, without accounting for such dynamics, can lead to a mismatch between augmented data and the model’s evolving training needs, ultimately degrading training effectiveness. To address this, we introduce SADA, a Sample-Aware Dynamic Augmentation that performs on-the-fly adjustment of augmentation strengths based on each sample’s evolving influence on model optimization. Specifically, we estimate each sample’s influence by projecting its gradient onto the accumulated model update direction and computing the temporal variance within a local training window. Samples with low variance, indicating stable and consistent influence, are augmented more strongly to emphasize diversity, while unstable samples receive milder transformations to preserve semantic fidelity and stabilize learning. Our method is lightweight, which does not require auxiliary models or policy tuning. It can be seamlessly integrated into existing training pipelines as a plug-and-play module. Experiments across various benchmark datasets and model architectures show consistent improvements of SADA, including +7.3% on fine-grained tasks and +4.3% on long-tailed datasets, highlighting the method’s effectiveness and practicality.
zh

[CV-80] Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment

【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)微调扩散模型时面临的泛化能力差、组合性不足以及对奖励欺骗(reward hacking)敏感等问题。现有方法多采用前馈式提示优化策略,即在整个采样轨迹中使用单一优化后的提示,未能充分利用RL的序列特性。其解决方案的关键在于提出PromptLoop框架——一个可插拔的RL范式,通过将扩散模型的中间潜在状态(latent states)作为反馈信号,利用多模态大语言模型(Multimodal Large Language Model, MLLM)在训练过程中迭代更新提示,从而实现步骤级的提示精炼。该设计在结构上类比于Diffusion RL,同时保持了提示对齐的灵活性与通用性,在多个奖励函数和扩散模型基座上验证了其在奖励优化、泛化能力、与其他对齐方法的正交组合性及抑制过优化方面的优势。

链接: https://arxiv.org/abs/2510.00430
作者: Suhyeon Lee,Jong Chul Ye
机构: Kim Jaechul Graduate School of AI (金在哲人工智能研究生院); Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 figures

点击查看摘要

Abstract:Despite the recent progress, reinforcement learning (RL)-based fine-tuning of diffusion models often struggles with generalization, composability, and robustness against reward hacking. Recent studies have explored prompt refinement as a modular alternative, but most adopt a feed-forward approach that applies a single refined prompt throughout the entire sampling trajectory, thereby failing to fully leverage the sequential nature of reinforcement learning. To address this, here we introduce PromptLoop, a plug-and-play RL framework that incorporates latent feedback into step-wise prompt refinement. Rather than modifying diffusion model weights, a multimodal large language model (MLLM) is trained with RL to iteratively update prompts based on intermediate latent states of diffusion models. This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment. Extensive experiments across diverse reward functions and diffusion backbones demonstrate that PromptLoop (i) achieves effective reward optimization, (ii) generalizes seamlessly to unseen models, (iii) composes orthogonally with existing alignment methods, and (iv) mitigates over-optimization and reward hacking.
zh

[CV-81] Domain-Specialized Interactive Segmentation Framework for Meningioma Radiotherapy Planning MICCAI2025

【速读】:该论文旨在解决脑膜瘤(meningioma)在放射治疗(radiotherapy, RT)规划中精准分割难题,这一问题直接影响治疗效果和周围健康组织的保护。现有自动化深度学习方法因肿瘤异质性难以实现临床一致的高精度分割,而通用交互式医学图像分割(Interactive Medical Image Segmentation, IMIS)工具又缺乏针对脑膜瘤RT规划这类特定任务的临床适配性。解决方案的关键在于开发专用的IMIS工具Interactive-MEN-RT,其整合了多种临床友好的交互方式(如点标注、边界框、自由手绘和涂鸦),并针对脑膜瘤RT流程进行优化,在500例增强T1加权MRI数据上的评估中显著优于其他方法,Dice系数达77.6%,IoU达64.8%,验证了面向临床场景定制化分割工具的重要性。

链接: https://arxiv.org/abs/2510.00416
作者: Junhyeok Lee,Han Jang,Kyu Sung Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Clinical Image-Based Procedures (CLIP 2025), MICCAI 2025 Workshop

点击查看摘要

Abstract:Precise delineation of meningiomas is crucial for effective radiotherapy (RT) planning, directly influencing treatment efficacy and preservation of adjacent healthy tissues. While automated deep learning approaches have demonstrated considerable potential, achieving consistently accurate clinical segmentation remains challenging due to tumor heterogeneity. Interactive Medical Image Segmentation (IMIS) addresses this challenge by integrating advanced AI techniques with clinical input. However, generic segmentation tools, despite widespread applicability, often lack the specificity required for clinically critical and disease-specific tasks like meningioma RT planning. To overcome these limitations, we introduce Interactive-MEN-RT, a dedicated IMIS tool specifically developed for clinician-assisted 3D meningioma segmentation in RT workflows. The system incorporates multiple clinically relevant interaction methods, including point annotations, bounding boxes, lasso tools, and scribbles, enhancing usability and clinical precision. In our evaluation involving 500 contrast-enhanced T1-weighted MRI scans from the BraTS 2025 Meningioma RT Segmentation Challenge, Interactive-MEN-RT demonstrated substantial improvement compared to other segmentation methods, achieving Dice similarity coefficients of up to 77.6% and Intersection over Union scores of 64.8%. These results emphasize the need for clinically tailored segmentation solutions in critical applications such as meningioma RT planning. The code is publicly available at: this https URL
zh

[CV-82] PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents

【速读】:该论文旨在解决基于视觉的图形用户界面(GUI)智能体在执行长程任务时因记忆限制而导致的关键信息丢失问题,尤其是在需要回溯历史视觉细节以支持未来决策时。现有方法要么简单截断历史记录,要么依赖低效的文本摘要,难以保留对后续动作至关重要的视觉上下文。解决方案的关键在于提出PAL-UI框架,其核心创新是引入一种双层摘要机制(观察级线索与动作级结果)和一个专用检索工具,使智能体能够按需主动调用特定历史截图进行规划,从而实现动态、精准的记忆回溯,显著提升长程任务的执行准确性和跨域泛化能力。

链接: https://arxiv.org/abs/2510.00413
作者: Zikang Liu,Junyi Li,Wayne Xin Zhao,Dawei Gao,Yaliang Li,Ji-rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) promise human-like interaction with software applications, yet long-horizon tasks remain challenging due to memory limitations. Existing approaches either truncate history or rely on simple textual summaries, which risk losing critical information when past visual details become necessary for future decisions. In this paper, we propose \textbfPAL-UI (\textbfPlanning with \textbfActive \textbfLook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required. PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool that allows the agent to recall specific historical screenshots during planning. We curate a step-level instruction dataset of 8.6K samples from mobile GUI navigation trajectories and train \textbfPAL-UI-3B and \textbfPAL-UI-7B models based on Qwen2.5-VL. Extensive experiments demonstrate that PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings. Moreover, PAL-UI exhibits strong cross-domain generalization, achieving notable improvements in web navigation without additional training. Our work highlights the potential of active memory retrieval for long-horizon planning capabilities of vision-based GUI agents.
zh

[CV-83] David and Goliath in Medical Vision: Convolutional Networks vs Biomedical Vision Language Models

【速读】:该论文旨在解决如何在胸部X光片(chest radiographs)的自动化诊断中有效利用零样本医学视觉-语言模型(Vision-Language Model, VLM),以实现与监督训练的轻量级卷积神经网络(Convolutional Neural Network, CNN)相当甚至更优的性能。研究发现,尽管默认零样本设置下BiomedCLIP等VLM表现较差,但通过在验证集上进行简单的决策阈值校准(decision threshold calibration),其性能可显著提升:在肺炎检测任务中,F1分数从0.7698提升至0.8841,超过监督CNN的0.8803;在结核病检测任务中,F1分数从0.4812大幅提高至0.7684,接近监督基线的0.7834。因此,解决方案的关键在于决策阈值校准,这是释放零样本VLM全部诊断潜力的核心机制。

链接: https://arxiv.org/abs/2510.00411
作者: Ran Tong,Jiaqi Liu,Su Liu,Jiexi Xu,Lanruo Wang,Tong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6pages,3 this http URL review of International Conference on Artificial Intelligence, Computer, Data Sciences and Applications

点击查看摘要

Abstract:The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN’s 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline’s 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.
zh

[CV-84] VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在具身决策中因依赖模仿学习而导致的误差累积和分布偏移下的鲁棒性不足问题。其解决方案的关键在于提出一种基于数据驱动世界模型的强化微调框架(VLA-RFT),该框架利用真实交互数据训练的世界模型作为可控模拟器,能够根据动作预测未来视觉观测,并结合目标达成参考生成密集的轨迹级奖励信号,从而提供高效且与动作对齐的学习信号,显著降低样本需求并提升任务执行的稳定性与鲁棒性。

链接: https://arxiv.org/abs/2510.00406
作者: Hengtao Li,Pengxiang Ding,Runze Suo,Yihao Wang,Zirui Ge,Dongyuan Zang,Kexian Yu,Mingyang Sun,Hongyin Zhang,Donglin Wang,Weihua Su
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); OpenHelix Team; Fudan University (复旦大学); Zhengzhou University (郑州大学); BUPT (北京邮电大学); Hebei University of Technology (河北工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to this https URL.
zh

[CV-85] EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations

【速读】:该论文旨在解决基于第一人称视角(ego-centric)的轨迹预测模型在真实场景中鲁棒性不足的问题,其核心挑战在于现有方法通常假设观测历史理想化,而忽略了实际第一人称视觉中存在的感知伪影(如遮挡、ID切换和跟踪漂移),导致训练与部署环境不一致,严重影响模型性能。解决方案的关键在于提出EgoTraj-Bench这一首个真实世界基准,将噪声第一人称视觉历史与干净的鸟瞰图未来轨迹对齐,从而实现更贴近现实的训练;并进一步设计BiFlow双流流匹配模型,通过共享潜在表示同时完成历史观测去噪与未来运动预测,并引入EgoAnchor机制,利用特征调制条件化预测解码器以更好地建模代理意图,显著提升了轨迹预测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2510.00405
作者: Jiayi Liu,Jiaming Zhou,Ke Ye,Kun-Yu Lin,Allan Wang,Junwei Liang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The University of Hong Kong (香港大学); Miraikan – The National Museum of Emerging Science and Innovation (国立未来科学博物馆); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird’s-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real-world, ego-centric perception.
zh

[CV-86] Discrete Wavelet Transform as a Facilitator for Expressive Latent Space Representation in Variational Autoencoders in Satellite Imagery

【速读】:该论文旨在解决Latent Diffusion Models (LDM) 在遥感(Remote Sensing, RS)应用中,由于变分自编码器(Variational Autoencoder, VAE)所构建的潜在空间表示能力有限而导致的生成质量与特征表达不足的问题。解决方案的关键在于提出了一种名为ExpDWT-VAE的新架构,其核心创新是引入离散小波变换(Discrete Wavelet Transform, DWT),通过双分支结构融合空间域与频域信息:一个分支处理原始空间域输入,另一个分支利用二维Haar小波分解提取频率特征并经卷积操作后重构,最终将两者融合形成空间-频率联合表示,并通过卷积和对角高斯映射进一步优化为更鲁棒的潜在表示,从而显著提升VAE在卫星图像中的潜在空间建模能力。

链接: https://arxiv.org/abs/2510.00376
作者: Arpan Mahara,Md Rezaul Karim Khan,Naphtali Rishe,Wenjia Wang,Seyed Masoud Sadjadi
机构: Knight Foundation School of Computing and Information Sciences (Knight基金会计算机与信息科学学院); Florida International University (佛罗里达国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 Figures

点击查看摘要

Abstract:Latent Diffusion Models (LDM), a subclass of diffusion models, mitigate the computational complexity of pixel-space diffusion by operating within a compressed latent space constructed by Variational Autoencoders (VAEs), demonstrating significant advantages in Remote Sensing (RS) applications. Though numerous studies enhancing LDMs have been conducted, investigations explicitly targeting improvements within the intrinsic latent space remain scarce. This paper proposes an innovative perspective, utilizing the Discrete Wavelet Transform (DWT) to enhance the VAE’s latent space representation, designed for satellite imagery. The proposed method, ExpDWT-VAE, introduces dual branches: one processes spatial domain input through convolutional operations, while the other extracts and processes frequency-domain features via 2D Haar wavelet decomposition, convolutional operation, and inverse DWT reconstruction. These branches merge to create an integrated spatial-frequency representation, further refined through convolutional and diagonal Gaussian mapping into a robust latent representation. We utilize a new satellite imagery dataset housed by the TerraFly mapping system to validate our method. Experimental results across several performance metrics highlight the efficacy of the proposed method at enhancing latent space representation.
zh

[CV-87] Motion In-Betweening for Densely Interacting Characters

【速读】:该论文旨在解决多角色在关键帧之间进行自然交互的运动插值(motion in-betweening)问题,尤其针对密集互动场景下如何维持空间-时间一致性与长期动作质量的挑战。其核心解决方案是提出一种名为“跨空间插值”(Cross-Space In-Betweening)的新方法,通过在不同条件表示空间中建模角色间的交互关系,从而实现动态、可控且长时间稳定的双角色交互运动生成。为应对因强约束导致的解空间受限和运动退化问题,作者进一步引入两种机制:一是利用对抗学习识别周期性交互模式以保持交互质量;二是学习对漂移潜在空间进行修正,防止姿态误差累积,从而确保长期合成的稳定性与真实性。

链接: https://arxiv.org/abs/2510.00314
作者: Xiaotang Zhang,Ziyi Chang,Qianhui Men,Hubert P. H. Shum
机构: Durham University (杜伦大学); University of Bristol (布里斯托大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion in-betweening is the problem to synthesize movement between keyposes. Traditional research focused primarily on single characters. Extending them to densely interacting characters is highly challenging, as it demands precise spatial-temporal correspondence between the characters to maintain the interaction, while creating natural transitions towards predefined keyposes. In this research, we present a method for long-horizon interaction in-betweening that enables two characters to engage and respond to one another naturally. To effectively represent and synthesize interactions, we propose a novel solution called Cross-Space In-Betweening, which models the interactions of each character across different conditioning representation spaces. We further observe that the significantly increased constraints in interacting characters heavily limit the solution space, leading to degraded motion quality and diminished interaction over time. To enable long-horizon synthesis, we present two solutions to maintain long-term interaction and motion quality, thereby keeping synthesis in the stable region of the solution this http URL first sustain interaction quality by identifying periodic interaction patterns through adversarial learning. We further maintain the motion quality by learning to refine the drifted latent space and prevent pose error accumulation. We demonstrate that our approach produces realistic, controllable, and long-horizon in-between motions of two characters with dynamic boxing and dancing actions across multiple keyposes, supported by extensive quantitative evaluations and user studies.
zh

[CV-88] Looking Beyond the Known: Towards a Data Discovery Guided Open-World Object Detection NEURIPS’25

【速读】:该论文针对开放世界目标检测(Open-World Object Detection, OWOD)中存在的语义混淆(known-unknown semantic confusion)和灾难性遗忘(catastrophic forgetting)问题,提出了一种统一框架Combinatorial Open-World Detection (CROWD)。其解决方案的关键在于将未知目标发现与表征学习重构为一个交织的组合优化任务:CROWD-Discover通过最大化子模条件增益(Submodular Conditional Gain, SCG)函数,选择与已知类别显著不同的代表性未知实例;CROWD-Learn则引入新颖的组合目标函数,联合解耦已知与未知类别的表征,同时保持已知类间的判别一致性,从而有效缓解语义混淆与记忆遗忘,提升检测性能。

链接: https://arxiv.org/abs/2510.00303
作者: Anay Majee,Amitesh Gangrade,Rishabh Iyer
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to NeurIPS’25. 22 pages, 6 figures

点击查看摘要

Abstract:Open-World Object Detection (OWOD) enriches traditional object detectors by enabling continual discovery and integration of unknown objects via human guidance. However, existing OWOD approaches frequently suffer from semantic confusion between known and unknown classes, alongside catastrophic forgetting, leading to diminished unknown recall and degraded known-class accuracy. To overcome these challenges, we propose Combinatorial Open-World Detection (CROWD), a unified framework reformulating unknown object discovery and adaptation as an interwoven combinatorial (set-based) data-discovery (CROWD-Discover) and representation learning (CROWD-Learn) task. CROWD-Discover strategically mines unknown instances by maximizing Submodular Conditional Gain (SCG) functions, selecting representative examples distinctly dissimilar from known objects. Subsequently, CROWD-Learn employs novel combinatorial objectives that jointly disentangle known and unknown representations while maintaining discriminative coherence among known classes, thus mitigating confusion and forgetting. Extensive evaluations on OWOD benchmarks illustrate that CROWD achieves improvements of 2.83% and 2.05% in known-class accuracy on M-OWODB and S-OWODB, respectively, and nearly 2.4x unknown recall compared to leading baselines.
zh

[CV-89] MOLM: Mixture of LoRA Markers ICLR2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像的溯源与检测问题,即如何在保证图像质量的前提下,实现对合成图像的鲁棒水印嵌入与验证。现有水印方法存在对真实世界失真敏感、易被自适应移除以及密钥更新成本高等缺陷。其解决方案的关键在于提出一种通用水印框架,将编码问题建模为生成模型参数的密钥依赖扰动,并在此框架下设计了基于路由机制的 Mixture of LoRA Markers (MOLM),通过二进制密钥激活残差块和注意力模块中的轻量级 LoRA 适配器(LoRA adapters),从而实现无需重新训练即可动态切换密钥、同时保持不可感知性、保真度、可验证性和鲁棒性的目标。实验表明,MOLM 在 Stable Diffusion 和 FLUX 模型上均能有效抵抗压缩、再生、平均攻击及黑盒对抗攻击,实现高精度密钥恢复。

链接: https://arxiv.org/abs/2510.00293
作者: Samar Fares,Nurbek Tastan,Noor Hussein,Karthik Nandakumar
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Michigan State University (MSU)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 21 pages, 11 figures, Under review at ICLR 2026

点击查看摘要

Abstract:Generative models can generate photorealistic images at scale. This raises urgent concerns about the ability to detect synthetically generated images and attribute these images to specific sources. While watermarking has emerged as a possible solution, existing methods remain fragile to realistic distortions, susceptible to adaptive removal, and expensive to update when the underlying watermarking key changes. We propose a general watermarking framework that formulates the encoding problem as key-dependent perturbation of the parameters of a generative model. Within this framework, we introduce Mixture of LoRA Markers (MOLM), a routing-based instantiation in which binary keys activate lightweight LoRA adapters inside residual and attention blocks. This design avoids key-specific re-training and achieves the desired properties such as imperceptibility, fidelity, verifiability, and robustness. Experiments on Stable Diffusion and FLUX show that MOLM preserves image quality while achieving robust key recovery against distortions, compression and regeneration, averaging attacks, and black-box adversarial attacks on the extractor.
zh

[CV-90] Learning Energy-based Variational Latent Prior for VAEs

【速读】:该论文旨在解决变分自编码器(Variational Auto-Encoders, VAEs)生成图像模糊且不一致的问题,其核心原因在于“先验空洞”(prior hole)现象——即先验分布中高概率区域在后验分布中概率较低,导致生成样本质量下降。为缓解此问题,论文提出将先验建模为能量模型(Energy-Based Model, EBM),利用EBM对后验的灵活拟合能力以减少先验空洞并提升证据下界(ELBO)。解决方案的关键在于引入变分方法处理EBM中的归一化常数,从而避免传统EBM依赖昂贵马尔可夫链蒙特卡洛(MCMC)采样的低效性;具体而言,通过训练一个采样网络近似变分形式,并将其作为交替优化问题进行求解,同时该采样网络在生成阶段退化为隐式变分先验,实现高效快速采样。

链接: https://arxiv.org/abs/2510.00260
作者: Debottam Dutta,Chaitanya Amballa,Zhongweiyang Xu,Yu-Lin Wei,Romit Roy Choudhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Variational Auto-Encoders (VAEs) are known to generate blurry and inconsistent samples. One reason for this is the “prior hole” problem. A prior hole refers to regions that have high probability under the VAE’s prior but low probability under the VAE’s posterior. This means that during data generation, high probability samples from the prior could have low probability under the posterior, resulting in poor quality data. Ideally, a prior needs to be flexible enough to match the posterior while retaining the ability to generate samples fast. Generative models continue to address this tradeoff. This paper proposes to model the prior as an energy-based model (EBM). While EBMs are known to offer the flexibility to match posteriors (and also improving the ELBO), they are traditionally slow in sample generation due to their dependency on MCMC methods. Our key idea is to bring a variational approach to tackle the normalization constant in EBMs, thus bypassing the expensive MCMC approaches. The variational form can be approximated with a sampler network, and we show that such an approach to training priors can be formulated as an alternating optimization problem. Moreover, the same sampler reduces to an implicit variational prior during generation, providing efficient and fast sampling. We compare our Energy-based Variational Latent Prior (EVaLP) method to multiple SOTA baselines and show improvements in image generation quality, reduced prior holes, and better sampling efficiency.
zh

[CV-91] Improved Hyperspectral Anomaly Detection via Unsupervised Subspace Modeling in the Signed Cumulative Distribution Transform Domain

【速读】:该论文旨在解决高光谱异常检测(Hyperspectral Anomaly Detection, HAD)中因复杂现实环境和先验知识有限导致的检测精度不足问题。其解决方案的关键在于提出一种基于运输理论(transport-based)的数学模型,将高光谱像素视为模板模式在未知变形下的观测结果,并通过符号累积分布变换(Signed Cumulative Distribution Transform, SCDT)域进行表征;随后利用无监督子空间建模技术在该域中构建背景信号模型,从而将异常信号识别为对学习模型的偏离。

链接: https://arxiv.org/abs/2510.00148
作者: Abu Hasnat Mohammad Rubaiyat,Jordan Vincent,Colin Olson
机构: Amentum(阿门图姆); Tekla Research(泰克拉研究); U.S. Naval Research Laboratory(美国海军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Hyperspectral anomaly detection (HAD), a crucial approach for many civilian and military applications, seeks to identify pixels with spectral signatures that are anomalous relative to a preponderance of background signatures. Significant effort has been made to improve HAD techniques, but challenges arise due to complex real-world environments and, by definition, limited prior knowledge of potential signatures of interest. This paper introduces a novel HAD method by proposing a transport-based mathematical model to describe the pixels comprising a given hyperspectral image. In this approach, hyperspectral pixels are viewed as observations of a template pattern undergoing unknown deformations that enables their representation in the signed cumulative distribution transform (SCDT) domain. An unsupervised subspace modeling technique is then used to construct a model of abundant background signals in this domain, whereupon anomalous signals are detected as deviations from the learned model. Comprehensive evaluations across five distinct datasets illustrate the superiority of our approach compared to state-of-the-art methods.
zh

[CV-92] Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks

【速读】:该论文旨在解决深度神经网络在视觉和机器人应用中对语义变换扰动(如亮度和对比度变化)缺乏有效鲁棒性验证的问题,尤其是现有认证训练与鲁棒性认证方法因模型过度参数化而导致认证紧致性和可扩展性不足的挑战。其解决方案的关键在于提出一个基于“无偏且平滑神经元度量”(Unbiased and Smooth Neuron metric, USN)的新指标,用于量化神经元层对输入扰动的稳定性与方差,并据此设计一种新型神经网络剪枝方法——通过移除低USN神经元、保留高USN神经元,在不牺牲模型表达能力的前提下减少冗余参数;进一步引入Wasserstein距离损失函数,使剪枝后的神经元在各层间分布更集中,从而提升认证效率与鲁棒性性能。

链接: https://arxiv.org/abs/2510.00083
作者: Hanjiang Hu,Bowei Li,Ziwei Wang,Tianhao Wei,Casidhe Hutchison,Eric Sample,Changliu Liu
机构: Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人学院); School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore (新加坡南洋理工大学电气与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks have been widely adopted in many vision and robotics applications with visual inputs. It is essential to verify its robustness against semantic transformation perturbations, such as brightness and contrast. However, current certified training and robustness certification methods face the challenge of over-parameterization, which hinders the tightness and scalability due to the over-complicated neural networks. To this end, we first analyze stability and variance of layers and neurons against input perturbation, showing that certifiable robustness can be indicated by a fundamental Unbiased and Smooth Neuron metric (USN). Based on USN, we introduce a novel neural network pruning method that removes neurons with low USN and retains those with high USN, thereby preserving model expressiveness without over-parameterization. To further enhance this pruning process, we propose a new Wasserstein distance loss to ensure that pruned neurons are more concentrated across layers. We validate our approach through extensive experiments on the challenging robust keypoint detection task, which involves realistic brightness and contrast perturbations, demonstrating that our method achieves superior robustness certification performance and efficiency compared to baselines.
zh

[CV-93] Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning

【速读】:该论文旨在解决视觉语言模型在地理空间推理(geospatial reasoning)能力上的不足,即如何使模型在没有昂贵的人工标注推理过程的情况下,能够有效结合视觉线索与地理先验知识并进行准确预测。其解决方案的关键在于提出Geo-R1框架,该框架采用分阶段的后训练策略:第一阶段通过监督微调(supervised fine-tuning)在合成思维链(chain-of-thought)示例上注入“地理空间思维范式”,使模型无需人工标注即可建立视觉与地理先验之间的关联;第二阶段则利用基于GRPO的强化学习(reinforcement learning),在弱监督的跨视角配对代理任务上优化模型,提供可验证且可扩展的奖励信号,从而引导模型跨模态捕捉和整合特征,并以推理驱动实现精准预测。此方法将地理空间建模从传统的领域预训练或监督微调推进至“推理优先”的后训练范式。

链接: https://arxiv.org/abs/2510.00072
作者: Chenhui Xu,Fuxun Yu,Michael J. Bianco,Jacob Kovarskiy,Raphael Tang,Qi Zhang,Zirui Xu,Will LeVine,Brandon Dubbs,Heming Liao,Cassandra Burgess,Suvam Bag,Jay Patravali,Rupanjali Kukal,Mikael Figueroa,Rishi Madhok,Nikolaos Karianakis,Jinjun Xiong
机构: University at Buffalo (纽约州立大学布法罗分校); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at this https URL.
zh

[CV-94] OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解“单图引导”(One-Image Guide, OIG)这一人类感知与认知特征鲜明的视觉信息结构时的能力评估不足问题。OIG 是一种融合文本、图像和符号的可视化表达形式,其设计初衷是提升人类对复杂信息的理解效率,因此对模型的人类级理解能力提出了更高要求。解决方案的关键在于构建了一个名为 OIG-Bench 的综合性基准测试集,并开发了一种半自动化标注流程:通过多个智能代理协作生成初步图像描述,辅助人工高效构建高质量图像-文本配对数据。该方法显著降低了标注成本,同时确保了数据质量,从而支持对 29 种前沿 MLLM 的系统性评估,揭示出当前模型在语义理解和逻辑推理方面的局限性,并验证了多智能体标注系统在图像描述任务上的优越性能。

链接: https://arxiv.org/abs/2510.00069
作者: Jiancong Xie,Wenjin Wang,Zhuomeng Zhang,Zihan Liu,Qi Liu,Ke Feng,Zixun Sun,Yuedong Yang
机构: Sun Yat-sen University (中山大学); Tencent Inc (腾讯公司); Machine Intelligence and Advanced Computing (教育部机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at this https URL.
zh

[CV-95] Intelligent 5S Audit: Application of Artificial Intelligence for Continuous Improvement in the Automotive Industry

【速读】:该论文旨在解决传统5S管理(整理、整顿、清扫、清洁、素养)审计过程效率低、主观性强及难以适应工业4.0标准的问题。其解决方案的关键在于构建一个基于大规模语言模型(Large-scale Language Models, LLM)的自动化5S审计系统,通过智能图像分析实现对五个维度的标准化评估,从而提升审计的客观性与一致性。该系统在可靠性上经Cohen’s kappa系数验证(κ = 0.75),并显著缩短审计时间50%,同时降低运营成本99.8%,为精益生产体系与生成式AI技术的融合提供了可扩展的新范式。

链接: https://arxiv.org/abs/2510.00067
作者: Rafael da Silva Maciel,Lucio Veraldo Jr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures, 5 tables

点击查看摘要

Abstract:The evolution of the 5S methodology with the support of artificial intelligence techniques represents a significant opportunity to improve industrial organization audits in the automotive chain, making them more objective, efficient and aligned with Industry 4.0 standards. This work developed an automated 5S audit system based on large-scale language models (LLM), capable of assessing the five senses (Seiri, Seiton, Seiso, Seiketsu, Shitsuke) in a standardized way through intelligent image analysis. The system’s reliability was validated using Cohen’s concordance coefficient (kappa = 0.75), showing strong alignment between the automated assessments and the corresponding human audits. The results indicate that the proposed solution contributes significantly to continuous improvement in automotive manufacturing environments, speeding up the audit process by 50% of the traditional time and maintaining the consistency of the assessments, with a 99.8% reduction in operating costs compared to traditional manual audits. The methodology presented establishes a new paradigm for integrating lean systems with emerging AI technologies, offering scalability for implementation in automotive plants of different sizes.
zh

[CV-96] Efficient CNN Compression via Multi-method Low Rank Factorization and Feature Map Similarity

【速读】:该论文旨在解决低秩分解(Low-Rank Factorization, LRF)在压缩卷积神经网络(Convolutional Neural Networks, CNNs)过程中面临的四大挑战:最优秩选择困难、设计空间庞大、微调时间过长以及对不同层类型和分解方法兼容性差的问题。其解决方案的关键在于提出一种端到端的设计空间探索(Design Space Exploration, DSE)方法与框架,其中引入基于特征图相似性的新型秩选择策略,能够更有效地捕捉层输出间的非线性交互关系;同时采用一次性微调(one-shot fine-tuning)机制显著缩短整体微调时间,并支持所有类型的卷积层(Conv)和全连接层(FC)的灵活组合应用。此外,框架集成六种不同的LRF技术(三类用于Conv层,三类用于FC层),并根据每层特性进行选择性部署,从而实现比单一方法全局应用更高的压缩效率与精度保持能力。

链接: https://arxiv.org/abs/2510.00062
作者: M. Kokhazadeh(1),G. Keramidas(1)V. Kelefouras(2) ((1) Aristotle University of Thessaloniki, Thessaloniki, Greece, (2) University of Plymouth, Plymouth, UK)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 17 figures, This work has been submitted to the IEEE for possible publication (IEEE Transactions on Artificial Intelligence)

点击查看摘要

Abstract:Low-Rank Factorization (LRF) is a widely adopted technique for compressing deep neural networks (DNNs). However, it faces several challenges, including optimal rank selection, a vast design space, long fine-tuning times, and limited compatibility with different layer types and decomposition methods. This paper presents an end-to-end Design Space Exploration (DSE) methodology and framework for compressing convolutional neural networks (CNNs) that addresses all these issues. We introduce a novel rank selection strategy based on feature map similarity, which captures non-linear interactions between layer outputs more effectively than traditional weight-based approaches. Unlike prior works, our method uses a one-shot fine-tuning process, significantly reducing the overall fine-tuning time. The proposed framework is fully compatible with all types of convolutional (Conv) and fully connected (FC) layers. To further improve compression, the framework integrates three different LRF techniques for Conv layers and three for FC layers, applying them selectively on a per-layer basis. We demonstrate that combining multiple LRF methods within a single model yields better compression results than using a single method uniformly across all layers. Finally, we provide a comprehensive evaluation and comparison of the six LRF techniques, offering practical insights into their effectiveness across different scenarios. The proposed work is integrated into TensorFlow 2.x, ensuring compatibility with widely used deep learning workflows. Experimental results on 14 CNN models across eight datasets demonstrate that the proposed methodology achieves substantial compression with minimal accuracy loss, outperforming several state-of-the-art techniques.
zh

[CV-97] Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

【速读】:该论文旨在解决自动驾驶中轨迹规划任务的复杂性与端到端学习的局限性问题,提出将自动驾驶重新建模为广义语言任务,并将轨迹规划转化为下一航点预测。其解决方案的关键在于引入Max-V1框架,该框架采用单阶段生成范式,利用视觉-语言模型(VLM)直接从前视摄像头输入生成驾驶轨迹,通过基于统计建模的原理性监督策略构建明确的学习目标,从而在大规模专家示范数据上实现高效的模仿学习,显著提升驾驶策略的复杂性和泛化能力。

链接: https://arxiv.org/abs/2510.00060
作者: Sheng Yang,Tong Zhan,Guancheng Chen,Yanfeng Lu,Jian Wang
机构: Fudan University (复旦大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this work, we reconceptualize autonomous driving as a generalized language and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the VLM (Vision-Language Model) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to master complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset, delivers an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. Due to these empirical strengths, this work introduces a model enabling fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.
zh

[CV-98] FSDENet: A Frequency and Spatial Domains based Detail Enhancement Network for Remote Sensing Semantic Segmentation

【速读】:该论文旨在解决遥感图像分割中因灰度变化(如阴影和低对比度区域)导致的语义边缘模糊问题,同时充分利用空间信息以提升分割精度。其解决方案的关键在于提出了一种基于频域与空间域的细节增强网络(FSDENet),通过引入快速傅里叶变换(Fast Fourier Transform, FFT)融合全局频率域信息,增强模型在灰度变化下的全局表征能力;同时利用哈尔小波变换(Haar wavelet transform)将特征分解为高低频分量,利用其对边缘敏感性的差异性来优化边界分割。该方法实现了空间粒度与频率域边缘敏感性的双重协同,显著提升了边界区域和灰度过渡区的分割准确性。

链接: https://arxiv.org/abs/2510.00059
作者: Jiahao Fu,Yinfeng Yu,Liejun Wang
机构: Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication by IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:To fully leverage spatial information for remote sensing image segmentation and address semantic edge ambiguities caused by grayscale variations (e.g., shadows and low-contrast regions), we propose the Frequency and Spatial Domains based Detail Enhancement Network (FSDENet). Our framework employs spatial processing methods to extract rich multi-scale spatial features and fine-grained semantic details. By effectively integrating global and frequency-domain information through the Fast Fourier Transform (FFT) in global mappings, the model’s capability to discern global representations under grayscale variations is significantly strengthened. Additionally, we utilize Haar wavelet transform to decompose features into high- and low-frequency components, leveraging their distinct sensitivity to edge information to refine boundary segmentation. The model achieves dual-domain synergy by integrating spatial granularity with frequency-domain edge sensitivity, substantially improving segmentation accuracy in boundary regions and grayscale transition zones. Comprehensive experimental results demonstrate that FSDENet achieves state-of-the-art (SOTA) performance on four widely adopted datasets: LoveDA, Vaihingen, Potsdam, and iSAID.
zh

[CV-99] HiDe: Rethinking The Zoom-IN method in High Resolution MLLM s via Hierarchical Decoupling

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在高分辨率图像理解任务中性能不佳的问题。传统观点认为这是由于感知限制导致模型难以识别小物体,因而采用“zoom in”策略以获取细节;然而本文通过系统性分析指出,真正瓶颈在于复杂背景干扰而非对象尺寸本身。解决方案的核心是提出一种无需训练的分层解耦框架(Hierarchical Decoupling Framework, HiDe),其关键创新在于两个模块:一是Token-wise Attention Decoupling (TAD),用于分离问题token与关键信息token,并基于注意力权重精确定位目标视觉区域;二是Layout-Preserving Decoupling (LPD),将目标区域从背景中解耦并重建保留空间布局的紧凑表示,从而有效消除背景干扰。该方法在V*Bench、HRBench4K和HRBench8K上达到新SOTA,且内存消耗比前人方法减少75%。

链接: https://arxiv.org/abs/2510.00054
作者: Xianjie Liu,Yiman Hu,Yixiong Zou,Liang Wu,Jian Xu,Bo Zheng
机构: Alibaba Group(阿里巴巴集团); Huazhong University of Science and Technology(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use “zoom in” strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this “zoom in” operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on VBench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on VBench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in this https URL.
zh

[CV-100] Object-AVEdit: An Object-level Audio-Visual Editing Model

【速读】:该论文旨在解决视频后期制作和影视创作中对象级音视频编辑的难题,即如何在音频与视觉模态间实现对象级别的添加、替换和移除操作,同时保持源实例的结构信息完整性。其核心解决方案是提出Object-AVEdit框架,基于逆向-再生范式(inversion-regeneration paradigm),关键创新在于:1)构建了一个词到声音对象对齐的音频生成模型(word-to-sounding-object well-aligned audio generation model),显著提升了音频模态的对象可控性;2)设计了一种整体优化的逆向-再生编辑算法(inversion-regeneration holistically-optimized editing algorithm),有效保障了逆向过程中的信息保留与再生效果,从而实现高质量的音视频语义对齐和对象级编辑。

链接: https://arxiv.org/abs/2510.00050
作者: Youquan Fu,Ruiyang Si,Hongfa Wang,Dongzhan Zhou,Jiacheng Sun,Ping Luo,Di Hu,Hongyuan Zhang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom; Gaoling School of Artificial Intelligence Renmin University of China Beijing, China; Beijing University of Posts and Telecommunications; Tencent Data Platform; Tsinghua University, Beijing, China; Shanghai Artificial Intelligence Laboratory; Huawei Noah’s Ark Lab; The University of Hong Kong
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbfObject-AVEdit, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: this https URL.
zh

[CV-101] Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations NEURIPS2025

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)生成的自然语言解释(Natural Language Explanations, NLEs)虽流畅可信但可能缺乏因果忠实性(faithfulness)的问题,即解释看似合理却未必反映模型决策的真实驱动因素,这会带来技术和治理风险。解决方案的关键在于提出一种全自动的验证方法——解释驱动的反事实测试(Explanation-Driven Counterfactual Testing, EDCT),其核心思想是将模型自身的NLE视为可证伪的假设:首先提取模型对图像-问题对的回答及解释,继而解析解释中的可测试视觉概念,通过生成式修复(generative inpainting)生成针对性的反事实图像编辑,并利用大语言模型(LLM)辅助分析答案与解释的变化,最终计算反事实一致性得分(Counterfactual Consistency Score, CCS),从而量化并揭示VLM在特定场景下的因果忠实性缺陷。

链接: https://arxiv.org/abs/2510.00047
作者: Sihao Ding,Santosh Vasa,Aditi Ramadwar
机构: Mercedes-Benz Research & Development North America (梅赛德斯-奔驰北美研发公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 workshop on Regulatable ML

点击查看摘要

Abstract:Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect the causal factors driving predictions. This mismatch of plausibility and faithfulness poses technical and governance risks. We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM that treats the model’s own explanation as a falsifiable hypothesis. Given an image-question pair, EDCT: (1) obtains the model’s answer and NLE, (2) parses the NLE into testable visual concepts, (3) generates targeted counterfactual edits via generative inpainting, and (4) computes a Counterfactual Consistency Score (CCS) using LLM-assisted analysis of changes in both answers and explanations. Across 120 curated OK-VQA examples and multiple VLMs, EDCT uncovers substantial faithfulness gaps and provides regulator-aligned audit artifacts indicating when cited concepts fail causal tests.
zh

[CV-102] Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在文本到图像生成任务中因提示词(prompt)交易市场兴起而引发的提示模板盗取安全问题。现有研究未充分关注提示词本身可能被逆向提取的风险,而本文首次系统揭示了这一漏洞,并提出RLStealer——一种基于强化学习的提示词反演框架,其核心创新在于将提示模板窃取建模为序列决策问题,并设计多种基于相似度的反馈信号作为奖励函数,以高效探索庞大的提示空间。实验表明,RLStealer在公开基准上达到当前最优性能,且攻击总成本仅为现有方法的13%以下,同时具备跨不同图像风格的泛化能力,有效实现了对未知提示模板的高效窃取。

链接: https://arxiv.org/abs/2510.00046
作者: Xiaotian Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have transformed text-to-image workflows, allowing designers to create novel visual concepts with unprecedented speed. This progress has given rise to a thriving prompt trading market, where curated prompts that induce trademark styles are bought and sold. Although commercially attractive, prompt trading also introduces a largely unexamined security risk: the prompts themselves can be stolen. In this paper, we expose this vulnerability and present RLStealer, a reinforcement learning based prompt inversion framework that recovers its template from only a small set of example images. RLStealer treats template stealing as a sequential decision making problem and employs multiple similarity based feedback signals as reward functions to effectively explore the prompt space. Comprehensive experiments on publicly available benchmarks demonstrate that RLStealer gets state-of-the-art performance while reducing the total attack cost to under 13% of that required by existing baselines. Our further analysis confirms that RLStealer can effectively generalize across different image styles to efficiently steal unseen prompt templates. Our study highlights an urgent security threat inherent in prompt trading and lays the groundwork for developing protective standards in the emerging MLLMs marketplace. Comments: 10 pages, 3 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.00046 [cs.CV] (or arXiv:2510.00046v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.00046 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-103] Beyond the Prompt: Gender Bias in Text-to-Image Models with a Case Study on Hospital Professions

【速读】:该论文旨在解决文本到图像(Text-to-Image, TTI)生成模型在医疗职业相关图像生成中存在系统性性别偏见的问题,即模型输出往往固化并放大社会对特定职业的性别刻板印象。其解决方案的关键在于通过设计精细化的提示词(prompt)实验,系统评估六种主流开放权重模型在不同职业与肖像修饰词组合下的性别分布表现,揭示了模型间差异显著:部分模型如Qwen-Image和Stable-Diffusion-XL表现出强男性主导倾向,而FLUX.1-dev则呈现女性偏向;同时发现提示词中的语义修饰(如“corporate”强化男性,“beautiful”倾向女性)会显著调节性别平衡,且不同模型对提示词变化的敏感度差异极大。研究强调,除模型架构外,提示词设计是影响生成结果性别公平性的关键变量,因此提出应采用偏见感知的设计策略、设定平衡的默认配置,并提供用户引导机制以避免生成内容强化职业性别 stereotypes。

链接: https://arxiv.org/abs/2510.00045
作者: Franck Vandewiele,Remi Synave,Samuel Delepoulle,Remi Cozot
机构: Université du Littoral Côte d’Opale (滨海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image (TTI) models are increasingly used in professional, educational, and creative contexts, yet their outputs often embed and amplify social biases. This paper investigates gender representation in six state-of-the-art open-weight models: HunyuanImage 2.1, HiDream-I1-dev, Qwen-Image, FLUX.1-dev, Stable-Diffusion 3.5 Large, and Stable-Diffusion-XL. Using carefully designed prompts, we generated 100 images for each combination of five hospital-related professions (cardiologist, hospital director, nurse, paramedic, surgeon) and five portrait qualifiers (“”, corporate, neutral, aesthetic, beautiful). Our analysis reveals systematic occupational stereotypes: all models produced nurses exclusively as women and surgeons predominantly as men. However, differences emerge across models: Qwen-Image and SDXL enforce rigid male dominance, HiDream-I1-dev shows mixed outcomes, and FLUX.1-dev skews female in most roles. HunyuanImage 2.1 and Stable-Diffusion 3.5 Large also reproduce gender stereotypes but with varying degrees of sensitivity to prompt formulation. Portrait qualifiers further modulate gender balance, with terms like corporate reinforcing male depictions and beautiful favoring female ones. Sensitivity varies widely: Qwen-Image remains nearly unaffected, while FLUX.1-dev, SDXL, and SD3.5 show strong prompt dependence. These findings demonstrate that gender bias in TTI models is both systematic and model-specific. Beyond documenting disparities, we argue that prompt wording plays a critical role in shaping demographic outcomes. The results underscore the need for bias-aware design, balanced defaults, and user guidance to prevent the reinforcement of occupational stereotypes in generative AI. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) MSC classes: I.2 ARTIFICIAL INTELLIGENCE Cite as: arXiv:2510.00045 [cs.CV] (or arXiv:2510.00045v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.00045 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-104] Culture In a Frame: C3B as a Comic-Based Benchmark for Multimodal Culturally Awareness

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在文化意识能力评估中存在的不足,包括基准测试任务难度设计单一、缺乏跨语言任务以及使用单文化真实图像导致评测过于简单等问题。解决方案的关键在于提出一个全新的多文化、多任务、多语言的文化意识能力基准——C³B(Comic Cross-Cultural Benchmark),该基准包含超过2000张图像和18000个问答对,涵盖从基础视觉识别到高级文化冲突理解再到文化内容生成的三阶段递进式任务设计,从而系统性地提升对MLLMs文化认知能力的评估深度与广度。

链接: https://arxiv.org/abs/2510.00041
作者: Yuchen Song,Andong Chen,Wenxin Zhu,Kehai Chen,Xuefeng Bai,Muyun Yang,Tiejun Zhao
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C ^3 B ( \textbfC omics \textbfC ross- \textbfC ultural \textbfB enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C ^3 B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C ^3 B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.
zh

[CV-105] Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

【速读】:该论文旨在解决大规模视觉语言模型(VLMs)在指令微调(instruction tuning)过程中难以控制行为的问题,尤其是在减少训练数据预算时易出现性能下降的现象。传统方法依赖任务特定的启发式策略对数据进行筛选,将模型视为黑箱,忽视了驱动学习的潜在能力(intrinsic capabilities)。其解决方案的关键在于提出Capability-Attributed Data Curation(CADC)框架,通过无监督方式从梯度学习轨迹中发现内在能力,利用影响估计将训练数据归因于这些能力,并基于平衡选择与分阶段排序构建能力感知的课程学习策略,从而将黑箱微调转变为可控制、以能力为导向的过程。

链接: https://arxiv.org/abs/2510.00040
作者: Junjie Li,Ziao Wang,Jianghong Ma,Xiaofeng Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Hong Kong Baptist University, China (香港浸会大学); City University of Hong Kong, China (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.
zh

[CV-106] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在真实世界部署中对多模态扰动缺乏鲁棒性的问题,现有方法仅关注简单的视觉干扰,忽视了动作、指令、环境和观测等多模态扰动的复杂交互。解决方案的关键在于提出RobustVLA框架,通过双路径优化实现输入与输出的多模态鲁棒性:一是针对输出鲁棒性,采用离线鲁棒优化策略,在流匹配目标下对抗最坏情况的动作噪声,等效于对抗训练、标签平滑与异常值惩罚;二是针对输入鲁棒性,强制在保持任务语义不变的前提下,输入变化下动作保持一致。此外,将多扰动场景建模为多臂赌博机问题并使用上置信界(Upper Confidence Bound, UCB)算法自动识别最具破坏性的噪声类型,从而提升模型在多种扰动组合下的泛化能力。实验表明,RobustVLA在LIBERO数据集上相较基线绝对提升达12.6%(基于pi0骨干网络)和10.4%(基于OpenVLA骨干网络),且推理速度比现有视觉鲁棒VLA快50.6倍,在真实FR5机器人上仅用有限示范即实现65.6%的性能增益。

链接: https://arxiv.org/abs/2510.00037
作者: Jianing Guo,Zhenhong Wu,Chang Tu,Yiyao Ma,Xiangqi Kong,Zhiqian Liu,Jiaming Ji,Shuning Zhang,Yuanpei Chen,Kai Chen,Xianglong Liu,Qi Dou,Yaodong Yang,Huijie Zhao,Weifeng Lv,Simin Li
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院); University of Science and Technology of China (中国科学技术大学); Peking University (北京大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
zh

[CV-107] Review of Hallucination Understanding in Large Language and Vision Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中幻觉(hallucinations)问题,即模型在文本或图像生成过程中产生错误或无意义输出的现象,这种现象可能导致部署阶段的信息误导,进而引发财务和运营损失。其解决方案的关键在于提出一个统一的多层级框架,用于跨应用、跨模态(文本与图像)系统性地刻画幻觉,并通过任务-模态交错的方法将幻觉与模型生命周期中的具体机制关联起来,从而揭示幻觉源于数据分布中的可预测模式及继承偏见,为开发更具鲁棒性和泛化能力的幻觉缓解方法奠定基础。

链接: https://arxiv.org/abs/2510.00034
作者: Zhengyi Ho,Siyuan Liang,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread adoption of large language and vision models in real-world applications has made urgent the need to address hallucinations – instances where models produce incorrect or nonsensical outputs. These errors can propagate misinformation during deployment, leading to both financial and operational harm. Although much research has been devoted to mitigating hallucinations, our understanding of it is still incomplete and fragmented. Without a coherent understanding of hallucinations, proposed solutions risk mitigating surface symptoms rather than underlying causes, limiting their effectiveness and generalizability in deployment. To tackle this gap, we first present a unified, multi-level framework for characterizing both image and text hallucinations across diverse applications, aiming to reduce conceptual fragmentation. We then link these hallucinations to specific mechanisms within a model’s lifecycle, using a task-modality interleaved approach to promote a more integrated understanding. Our investigations reveal that hallucinations often stem from predictable patterns in data distributions and inherited biases. By deepening our understanding, this survey provides a foundation for developing more robust and effective solutions to hallucinations in real-world generative AI systems.
zh

[CV-108] Hybrid Deep Learning for Hyperspectral Single Image Super-Resolution

【速读】:该论文旨在解决高光谱单图像超分辨率(Hyperspectral Single Image Super-Resolution, HISR)任务中难以同时恢复精细空间细节并保持宽波长范围内光谱保真度的问题,这限制了传统深度学习模型的性能。其解决方案的关键在于提出一种可无缝嵌入标准2D卷积架构的新型模块——光谱-空间解混融合模块(Spectral-Spatial Unmixing Fusion, SSUF),该模块结合光谱解混与光谱-空间特征提取,引导基于ResNet的卷积神经网络实现更优重建;同时设计了一种自定义的空间-光谱梯度损失函数(Spatial-Spectral Gradient Loss),在均方误差基础上引入空间和光谱梯度项,从而增强对空间结构和光谱特性的准确重建能力。

链接: https://arxiv.org/abs/2510.00033
作者: Usman Muhammad,Jorma Laaksonen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hyperspectral single image super-resolution (SISR) is a challenging task due to the difficulty of restoring fine spatial details while preserving spectral fidelity across a wide range of wavelengths, which limits the performance of conventional deep learning models. To address this challenge, we introduce Spectral-Spatial Unmixing Fusion (SSUF), a novel module that can be seamlessly integrated into standard 2D convolutional architectures to enhance both spatial resolution and spectral integrity. The SSUF combines spectral unmixing with spectral–spatial feature extraction and guides a ResNet-based convolutional neural network for improved reconstruction. In addition, we propose a custom Spatial-Spectral Gradient Loss function that integrates mean squared error with spatial and spectral gradient components, encouraging accurate reconstruction of both spatial and spectral features. Experiments on three public remote sensing hyperspectral datasets demonstrate that the proposed hybrid deep learning model achieves competitive performance while reducing model complexity.
zh

[CV-109] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

【速读】:该论文旨在解决基于策略梯度的强化学习方法在文本到图像(Text-to-Image, T2I)模型对齐训练中面临的训练不稳定性和高方差问题,这些问题导致收敛速度缓慢并影响图像质量。其解决方案的关键在于提出比例信用分配策略优化(Proportionate Credit Policy Optimization, PCPO),通过稳定的目标重构和原理性的时步重加权机制,强制实现比例信用分配,从而缓解生成采样器数学结构引发的非比例反馈波动,显著提升训练稳定性、加速收敛并改善图像质量,有效避免模型坍缩(model collapse)这一常见失败模式。

链接: https://arxiv.org/abs/2509.25774
作者: Jeongjae Lee,Jong Chul Ye
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 17 figures

点击查看摘要

Abstract:While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO.
zh

[CV-110] ReLumix: Extending Image Relighting to Video via Video Diffusion Models

【速读】:该论文旨在解决视频后期制作中光照控制这一关键但难以实现的问题,现有方法通常灵活性不足,限制用户只能使用特定的重光照(relighting)模型。其解决方案的关键在于提出 ReLumix 框架,通过将重光照算法与时间合成过程解耦,使得任意图像级重光照技术(如扩散模型或物理渲染器)均可无缝应用于视频序列。该框架采用两阶段流程:第一阶段由艺术家在单个参考帧上使用任意偏好的图像重光照技术进行操作;第二阶段则利用微调后的稳定视频扩散模型(stable video diffusion, SVD)将目标光照信息平滑传播至整个视频序列。为保障时序一致性并避免伪影,作者引入门控交叉注意力机制以实现特征的平稳融合,并采用基于 SVD 运动先验的时间自举策略,从而在合成数据上训练后仍能良好泛化到真实视频场景,显著提升视觉保真度,提供了一种可扩展且灵活的动态光照控制方案。

链接: https://arxiv.org/abs/2509.23769
作者: Lezhong Wang,Shutong Jin,Ruiqi Cui,Anders Bjorholm Dahl,Jeppe Revall Frisvad,Siavash Bigdeli
机构: Technical University of Denmark (丹麦技术大学); KTH Royal Institute of Technology (皇家理工学院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Controlling illumination during video post-production is a crucial yet elusive goal in computational photography. Existing methods often lack flexibility, restricting users to certain relighting models. This paper introduces ReLumix, a novel framework that decouples the relighting algorithm from temporal synthesis, thereby enabling any image relighting technique to be seamlessly applied to video. Our approach reformulates video relighting into a simple yet effective two-stage process: (1) an artist relights a single reference frame using any preferred image-based technique (e.g., Diffusion Models, physics-based renderers); and (2) a fine-tuned stable video diffusion (SVD) model seamlessly propagates this target illumination throughout the sequence. To ensure temporal coherence and prevent artifacts, we introduce a gated cross-attention mechanism for smooth feature blending and a temporal bootstrapping strategy that harnesses SVD’s powerful motion priors. Although trained on synthetic data, ReLumix shows competitive generalization to real-world videos. The method demonstrates significant improvements in visual fidelity, offering a scalable and versatile solution for dynamic lighting control.
zh

[CV-111] EVO-LRP: Evolutionary Optimization of LRP for Interpretable Model Explanations

【速读】:该论文旨在解决现有可解释人工智能(Explainable AI, XAI)方法在图像区域重要性归因时存在的细节与可解释性之间的权衡问题,尤其是传统层间相关性传播(Layer-wise Relevance Propagation, LRP)方法依赖启发式规则集、缺乏针对模型行为对齐和清晰度优化的局限。解决方案的关键在于提出EVO-LRP,通过协方差矩阵自适应进化策略(Covariance Matrix Adaptation Evolution Strategy, CMA-ES)对LRP超参数进行任务特定优化,以量化可解释性指标(如忠实性或稀疏性)为目标函数,从而系统性提升归因质量与视觉一致性,并增强对类别特异性特征的敏感性。

链接: https://arxiv.org/abs/2509.23585
作者: Emerald Zhang,Julian Weaver,Samantha R Santacruz,Edward Castillo
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Explainable AI (XAI) methods help identify which image regions influence a model’s prediction, but often face a trade-off between detail and interpretability. Layer-wise Relevance Propagation (LRP) offers a model-aware alternative. However, LRP implementations commonly rely on heuristic rule sets that are not optimized for clarity or alignment with model behavior. We introduce EVO-LRP, a method that applies Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to tune LRP hyperparameters based on quantitative interpretability metrics, such as faithfulness or sparseness. EVO-LRP outperforms traditional XAI approaches in both interpretability metric performance and visual coherence, with strong sensitivity to class-specific features. These findings demonstrate that attribution quality can be systematically improved through principled, task-specific optimization.
zh

[CV-112] U-DFA: A Unified DINOv2-Unet with Dual Fusion Attention for Multi-Dataset Medical Segmentation

【速读】:该论文旨在解决医学图像分割中卷积神经网络(CNN)因感受野局限而难以捕捉全局上下文信息的问题,以及现有融合CNN与Transformer的方法在局部与全局特征融合上效果不佳的挑战。其解决方案的关键在于提出一种统一的U-DFA架构,该架构基于DINOv2-Unet的编码器-解码器结构,并引入新颖的局部-全局融合适配器(Local-Global Fusion Adapter, LGFA),通过在多个阶段将CNN-based空间模式适配器(Spatial Pattern Adapter, SPA)注入冻结的DINOv2块中,实现高层语义特征与空间特征的有效融合,从而在仅使用33%可训练参数的情况下,在Synapse和ACDC数据集上达到当前最优的分割性能。

链接: https://arxiv.org/abs/2510.00585
作者: Zulkaif Sajjad,Furqan Shaukat,Junaid Mir
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate medical image segmentation plays a crucial role in overall diagnosis and is one of the most essential tasks in the diagnostic pipeline. CNN-based models, despite their extensive use, suffer from a local receptive field and fail to capture the global context. A common approach that combines CNNs with transformers attempts to bridge this gap but fails to effectively fuse the local and global features. With the recent emergence of VLMs and foundation models, they have been adapted for downstream medical imaging tasks; however, they suffer from an inherent domain gap and high computational cost. To this end, we propose U-DFA, a unified DINOv2-Unet encoder-decoder architecture that integrates a novel Local-Global Fusion Adapter (LGFA) to enhance segmentation performance. LGFA modules inject spatial features from a CNN-based Spatial Pattern Adapter (SPA) module into frozen DINOv2 blocks at multiple stages, enabling effective fusion of high-level semantic and spatial features. Our method achieves state-of-the-art performance on the Synapse and ACDC datasets with only 33% of the trainable model parameters. These results demonstrate that U-DFA is a robust and scalable framework for medical image segmentation across multiple modalities.
zh

[CV-113] A Fast and Precise Method for Searching Rectangular Tumor Regions in Brain MR Images

【速读】:该论文旨在解决脑肿瘤磁共振成像(MRI)中矩形区域快速且精确搜索的问题,以提升脑肿瘤诊断的效率与准确性。其解决方案的关键在于两个方面:一是采用改进的U-Net结构(编码器替换为EfficientNet)进行图像分割,提高分割精度;二是设计基于累加面积表(summed-area tables)的快速搜索算法,实现三维全搜索(3D full search)的高效执行,同时引入用户可调的搜索指标,优先选择立方体形状并优化高肿瘤占比区域的评分,从而在保证速度的同时显著提升矩形肿瘤区域的质量。

链接: https://arxiv.org/abs/2510.00505
作者: Hidenori Takeshima,Shuki Maruyama
机构: Canon Medical Systems Corporation(佳能医疗系统公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: To develop a fast and precise method for searching rectangular regions in brain tumor images. Methods: The authors propose a new method for searching rectangular tumor regions in brain MR images. The proposed method consisted of a segmentation network and a fast search method with a user-controllable search metric. As the segmentation network, the U-Net whose encoder was replaced by the EfficientNet was used. In the fast search method, summed-area tables were used for accelerating sums of voxels in rectangular regions. Use of the summed-area tables enabled exhaustive search of the 3D offset (3D full search). The search metric was designed for giving priority to cubes over oblongs, and assigning better values for higher tumor fractions even if they exceeded target tumor fractions. The proposed computation and metric were compared with those used in a conventional method using the Brain Tumor Image Segmentation dataset. Results: When the 3D full search was used, the proposed computation (8 seconds) was 100-500 times faster than the conventional computation (11-40 minutes). When the user-controllable parts of the search metrics were changed variously, the tumor fractions of the proposed metric were higher than those of the conventional metric. In addition, the conventional metric preferred oblongs whereas the proposed metric preferred cubes. Conclusion: The proposed method is promising for implementing fast and precise search of rectangular tumor regions, which is useful for brain tumor diagnosis using MRI systems. The proposed computation reduced processing times of the 3D full search, and the proposed metric improved the quality of the assigned rectangular tumor regions.
zh

[CV-114] A Deep Learning Pipeline for Epilepsy Genomic Analysis Using GPT -2 XL and NVIDIA H100

【速读】:该论文旨在解决癫痫(epilepsy)研究中高通量转录组数据(transcriptomic data)分析复杂性高、解析困难的问题。其核心解决方案是构建一种融合深度学习策略与GPU加速计算的新分析流程,关键在于利用基于Transformer架构的大型语言模型(Large Language Model, LLM)GPT-2 XL(15亿参数)对基因序列进行编码,并依托最新NVIDIA H100 Tensor Core GPU(基于Hopper架构)实现高效的数据预处理、基因序列表示及表达模式识别,从而提升对癫痫相关转录组特征的挖掘效率与准确性。

链接: https://arxiv.org/abs/2510.00392
作者: Muhammad Omer Latif,Hayat Ullah,Muhammad Ali Shafique,Zhihua Dong
机构: Connetquot Central School District of Long Island(长岛康内特夸特中央学区); Florida Atlantic University(佛罗里达大西洋大学); Kansas State University(堪萨斯州立大学); Brookhaven National Laboratory(布鲁克海文国家实验室)
类目: Genomics (q-bio.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:Epilepsy is a chronic neurological condition characterized by recurrent seizures, with global prevalence estimated at 50 million people worldwide. While progress in high-throughput sequencing has allowed for broad-based transcriptomic profiling of brain tissues, the deciphering of these highly complex datasets remains one of the challenges. To address this issue, in this paper we propose a new analysis pipeline that integrates the power of deep learning strategies with GPU-acceleration computation for investigating Gene expression patterns in epilepsy. Specifically, our proposed approach employs GPT-2 XL, a transformer-based Large Language Model (LLM) with 1.5 billion parameters for genomic sequence analysis over the latest NVIDIA H100 Tensor Core GPUs based on Hopper architecture. Our proposed method enables efficient preprocessing of RNA sequence data, gene sequence encoding, and subsequent pattern identification. We conducted experiments on two epilepsy datasets including GEO accession GSE264537 and GSE275235. The obtained results reveal several significant transcriptomic modifications, including reduced hippocampal astrogliosis after ketogenic diet treatment as well as restored excitatory-inhibitory signaling equilibrium in zebrafish epilepsy model. Moreover, our results highlight the effectiveness of leveraging LLMs in combination with advanced hardware acceleration for transcriptomic characterization in neurological diseases.
zh

[CV-115] Behavioural Classification in C. elegans: a Spatio-Temporal Analysis of Locomotion

【速读】:该论文旨在解决在高密度条件下难以对秀丽隐杆线虫(C. elegans)进行完整体态追踪从而提取行为单元的问题,传统方法依赖于对整个虫体的清晰视图,而在社会性环境模拟中常无法实现。其解决方案的关键在于提出一种基于单点追踪的无监督自动流程,无需预先设定行为假设即可自动识别行为单元,并通过与人工设计的行为单元对比及代理模型(agent-based model)仿真验证其有效性,结果表明即使仅从单点轨迹也能提取出具有生物学意义的时空运动模式,且这些模式构成了行为分类的核心基础。

链接: https://arxiv.org/abs/2510.00086
作者: Nemanja Antonic,Monika Scholz,Aymeric Vellinger,Euphrasie Ramahefarivo,Elio Tuci
机构: University of Namur, Belgium (比利时那慕尔大学); Max Planck Institute for Neurobiology of Behaviour, Germany (德国马克斯·普朗克行为神经生物学研究所)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The 1mm roundworm C. elegans is a model organism used in many sub-areas of biology to investigate different types of biological processes. In order to complement the n-vivo analysis with computer-based investigations, several methods have been proposed to simulate the worm behaviour. These methods extract discrete behavioural units from the flow of the worm movements using different types of tracking techniques. Nevertheless, these techniques require a clear view of the entire worm body, which is not always achievable. For example, this happens in high density worm conditions, which are particularly informative to understand the influence of the social context on the single worm behaviour. In this paper, we illustrate and evaluate a method to extract behavioural units from recordings of C. elegans movements which do not necessarily require a clear view of the entire worm body. Moreover, the behavioural units are defined by an unsupervised automatic pipeline which frees the process from predefined assumptions that inevitably bias the behavioural analysis. The behavioural units resulting from the automatic method are interpreted by comparing them with hand-designed behavioural units. The effectiveness of the automatic method is evaluated by measuring the extent to which the movement of a simulated worm, with an agent-based model, matches the movement of a natural worm. Our results indicate that spatio-temporal locomotory patterns emerge even from single point worm tracking. Moreover, we show that such patterns represent a fundamental aspect of the behavioural classification process.
zh

[CV-116] Survey of AI-Powered Approaches for Osteoporosis Diagnosis in Medical Imaging

链接: https://arxiv.org/abs/2510.00061
作者: Abdul Rahman,Bumshik Lee
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 56 pages, 18 figures

点击查看摘要

[CV-117] Variable Rate Image Compression via N-Gram Context based Swin-transformer

【速读】:该论文旨在解决传统学习图像压缩方法在高分辨率图像重建中因感受野受限而导致的重建质量下降问题,尤其是Swim Transformer模型在处理大区域信息时的不足。其解决方案的关键在于引入N-gram上下文机制,增强模型对相邻窗口间上下文信息的感知能力,从而扩展像素恢复时考虑的区域范围,提升高分辨率重建质量;该改进使模型在单一架构下实现可变率压缩,并在BD-Rate指标上相较现有技术提升5.86%,同时显著改善感兴趣区域(ROI)的重建质量,适用于工业视觉等对象聚焦场景。

链接: https://arxiv.org/abs/2510.00058
作者: Priyanka Mudgal,Feng Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at ISVC 2025

点击查看摘要

Abstract:This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.
zh

[CV-118] Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods Study

【速读】:该论文旨在解决生成式 AI(Generative AI)在皮肤疾病诊断中因训练数据偏向浅色皮肤而引发的性能偏差问题,尤其关注不同Fitzpatrick肤色类型下的公平性与准确性差异。其解决方案的关键在于:基于预训练的SkinGPT-4视觉语言模型,开发针对特定皮肤疾病的微调模型,并结合公平性指标(如demographic parity和equalized odds)进行优化,最终实现对深色皮肤人群更均衡的诊断性能,显著提升模型在不同肤色群体中的公平性和临床实用性。

链接: https://arxiv.org/abs/2510.00055
作者: Kiran Nijjer,Ryan Bui,Derek Jiu,Adnan Ahmed,Peter Wang,Benjamin Liu,Kevin Zhu,Lilly Zhu
机构: Stanford University (斯坦福大学); Algoverse AI Research (Algoverse AI 研究)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted to EADV (European Academy of Dermatology) and SID (Society for Investigative Dermatology)

点击查看摘要

Abstract:SkinGPT-4, a large vision-language model, leverages annotated skin disease images to augment clinical workflows in underserved communities. However, its training dataset predominantly represents lighter skin tones, limiting diagnostic accuracy for darker tones. Here, we evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases, including eczema, allergic-contact dermatitis, and psoriasis using the open-sourced SCIN dataset. We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks and explored bias mitigation strategies. Clinical evaluation by board-certified dermatologists on six relevant skin diseases from 300 SCIN cases assessed images for diagnostic accuracy, informativity, physician utility, and patient utility. Model fairness metrics, including demographic parity and equalized odds, were calculated across skin tones. SkinGPT-4 achieved an average demographic parity of 0.10 across Fitzpatrick types, with notable differences of 0.10-0.15 between lightest and darkest tones across evaluation metrics. Model hallucinations in artifacts and anatomy occurred at a rate of 17.8. Our customized models achieved average F1, precision, and AUROC of 0.75, 0.78, and 0.78 across visually similar disease pairs. Fairness analysis showed an average demographic parity of 0.75, with a maximum disparity of 0.21 across skin tones. The best model achieved parity scores of 0.83, 0.83, 0.76, 0.89, 0.90, and 0.90 for Fitzpatrick I-VI, indicating robust fairness. Large language models such as SkinGPT-4 showed weaker performance on darker tones. Model biases exist across evaluation criteria, and hallucinations may affect diagnostic efficacy. These findings demonstrate the efficacy of training accurate, fair models using existing backbones for custom skin disease classification.
zh

[CV-119] DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction

【速读】:该论文旨在解决病理全切片图像(Whole-Slide Images, WSI)生存分析中现有方法普遍存在的可解释性不足以及在异质性切片图像中忽略预测不确定性的问题。其解决方案的关键在于提出一种双原型证据融合网络(Dual-Prototype Whole-Slide Image Evidential Fusion Network, DPsurv),该方法不仅能输出带有不确定性的生存区间,还通过patch原型分配图、组件原型及组件级相对风险聚合实现多层次的可解释性,从而提升模型决策的透明度与可信度。

链接: https://arxiv.org/abs/2510.00053
作者: Yucheng Xing,Ling Huang,Jingying Ma,Ruping Hong,Jiangdong Qiu,Pei Liu,Kai He,Huazhu Fu,Mengling Feng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.
zh

[CV-120] Latent Representation Learning from 3D Brain MRI for Interpretable Prediction in Multiple Sclerosis

【速读】:该论文旨在解决3D脑部磁共振成像(MRI)中缺乏可解释生物标志物的问题,传统统计模型和浅层机器学习方法预测能力有限,而大多数深度学习方法则表现为“黑箱”模型,难以提供临床可理解的特征。其解决方案的关键在于提出InfoVAE-Med3D,一种基于信息最大化约束的变分自编码器(Variational Autoencoder, VAE)扩展方法,通过显式最大化图像与潜在变量之间的互信息(Mutual Information, MI),生成结构紧凑且保留临床意义信息的潜在表示(latent representations)。该方法在健康对照组和多发性硬化症患者数据集上均展现出优越的脑龄预测和符号数字模式测试(SDMT)评分回归性能,并形成直观可解释的聚类,从而实现了预测准确性与可解释性的统一。

链接: https://arxiv.org/abs/2510.00051
作者: Trinh Ngoc Huynh,Nguyen Duc Kien,Nguyen Hai Anh,Dinh Tran Hiep,Manuela Vaneckova,Tomas Uher,Jeroen Van Schependom,Stijn Denissen,Tran Quoc Long,Nguyen Linh Trung,Guy Nagels
机构: Vietnam National University, Hanoi (河内国家大学); VNU-HCM (胡志明市国家大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: The abstract has been condensed to under 1920 characters

点击查看摘要

Abstract:We present InfoVAE-Med3D, a latent-representation learning approach for 3D brain MRI that targets interpretable biomarkers of cognitive decline. Standard statistical models and shallow machine learning often lack power, while most deep learning methods behave as black boxes. Our method extends InfoVAE to explicitly maximize mutual information between images and latent variables, producing compact, structured embeddings that retain clinically meaningful content. We evaluate on two cohorts: a large healthy-control dataset (n=6527) with chronological age, and a clinical multiple sclerosis dataset from Charles University in Prague (n=904) with age and Symbol Digit Modalities Test (SDMT) scores. The learned latents support accurate brain-age and SDMT regression, preserve key medical attributes, and form intuitive clusters that aid interpretation. Across reconstruction and downstream prediction tasks, InfoVAE-Med3D consistently outperforms other VAE variants, indicating stronger information capture in the embedding space. By uniting predictive performance with interpretability, InfoVAE-Med3D offers a practical path toward MRI-based biomarkers and more transparent analysis of cognitive deterioration in neurological disease.
zh

[CV-121] AI-Based Stroke Rehabilitation Domiciliary Assessment System with ST_GCN Attention

【速读】:该论文旨在解决中风患者在居家环境中进行连续、量化且一致的康复评估与反馈难题。当前康复训练多依赖临床环境,难以实现日常化和持续性监测,而现有技术缺乏对动作质量的精准识别与个性化反馈机制。解决方案的关键在于构建一个集成式家庭康复系统,其核心是基于RGB-D相机与可穿戴传感器的数据采集模块、移动端引导应用及AI服务器组成的闭环体系;其中,AI服务器采用RAST-G@模型——该模型融合时空图卷积网络(spatio-temporal graph convolutional network, ST-GCN)与基于Transformer的时间注意力机制,能够从骨骼序列中提取高维特征并准确判断动作质量,从而提供客观、可量化的评估结果。该方法通过自建NRC数据集(包含10项上肢日常生活活动和5项关节活动范围数据)验证了其有效性,并在KIMORE和NRC数据集上显著优于基线模型,在指标如平均绝对误差(MAD)、均方根误差(RMSE)和平均绝对百分比误差(MAPE)方面表现更优,实现了面向患者的个体化康复指导与远程监控。

链接: https://arxiv.org/abs/2510.00049
作者: Suhyeon Lim,Ye-eun Kim,Andrew J. Choi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages(except references), 7 figures 6 Tables

点击查看摘要

Abstract:Effective stroke recovery requires continuous rehabilitation integrated with daily living. To support this need, we propose a home-based rehabilitation exercise and feedback system. The system consists of (1) hardware setup with RGB-D camera and wearable sensors to capture Stroke movements, (2) a mobile application for exercise guidance, and (3) an AI server for assessment and feedback. When Stroke user exercises following the application guidance, the system records skeleton sequences, which are then Assessed by the deep learning model, RAST-G@. The model employs a spatio-temporal graph convolutional network (ST-GCN) to extract skeletal features and integrates transformer-based temporal attention to figure out action quality. For system implementation, we constructed the NRC dataset, include 10 upper-limb activities of daily living (ADL) and 5 range-of-motion (ROM) collected from stroke and non-disabled participants, with Score annotations provided by licensed physiotherapists. Results on the KIMORE and NRC datasets show that RAST-G@ improves over baseline in terms of MAD, RMSE, and MAPE. Furthermore, the system provides user feedback that combines patient-centered assessment and monitoring. The results demonstrate that the proposed system offers a scalable approach for quantitative and consistent domiciliary rehabilitation assessment.
zh

[CV-122] Deep Learning Approaches with Explainable AI for Differentiating Alzheimer Disease and Mild Cognitive Impairment

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer Disease, AD)与轻度认知障碍(Mild Cognitive Impairment, MCI)之间的早期精准鉴别问题,这是临床干预的关键前提,尤其在MCI作为AD前驱阶段时,其结构变化细微且易与其他正常老化混淆。解决方案的关键在于提出一种混合深度学习集成框架:首先利用预训练卷积神经网络(ResNet50、NASNet和MobileNet)对灰质和白质MRI切片进行端到端微调;随后引入堆叠集成学习策略,结合元学习器与加权平均机制以最优融合多个基模型输出;此外,通过梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)实现可解释性增强,生成热力图和归因图,揭示影响模型决策的关键脑区结构生物标志物。该方法在ADNI数据集上达到99.21%的AD vs. MCI分类准确率,显著优于传统迁移学习和基准集成方法,具备临床可推广的潜力。

链接: https://arxiv.org/abs/2510.00048
作者: Fahad Mostafa,Kannon Hossain,Hafiz Khan
机构: Arizona State University (亚利桑那州立大学); Florida Gulf Coast University (佛罗里达大西洋大学); Texas Tech University Health Sciences Center (德克萨斯理工大学健康科学中心)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Early and accurate diagnosis of Alzheimer Disease is critical for effective clinical intervention, particularly in distinguishing it from Mild Cognitive Impairment, a prodromal stage marked by subtle structural changes. In this study, we propose a hybrid deep learning ensemble framework for Alzheimer Disease classification using structural magnetic resonance imaging. Gray and white matter slices are used as inputs to three pretrained convolutional neural networks such as ResNet50, NASNet, and MobileNet, each fine tuned through an end to end process. To further enhance performance, we incorporate a stacked ensemble learning strategy with a meta learner and weighted averaging to optimally combine the base models. Evaluated on the Alzheimer Disease Neuroimaging Initiative dataset, the proposed method achieves state of the art accuracy of 99.21% for Alzheimer Disease vs. Mild Cognitive Impairment and 91.0% for Mild Cognitive Impairment vs. Normal Controls, outperforming conventional transfer learning and baseline ensemble methods. To improve interpretability in image based diagnostics, we integrate Explainable AI techniques by Gradient weighted Class Activation, which generates heatmaps and attribution maps that highlight critical regions in gray and white matter slices, revealing structural biomarkers that influence model decisions. These results highlight the frameworks potential for robust and scalable clinical decision support in neurodegenerative disease diagnostics.
zh

[CV-123] Deep Learning-Based Pneumonia Detection from Chest X-ray Images: A CNN Approach with Performance Analysis and Clinical Implications

【速读】:该论文旨在解决肺炎自动检测中诊断精度与临床可实施性之间的关键挑战,特别是在医学影像分析中提升模型性能的同时保障数据隐私、增强模型可解释性并实现与现有医疗系统集成。其解决方案的关键在于构建一个基于卷积神经网络(Convolutional Neural Networks, CNN)的深度学习框架,融合分离卷积(separable convolutions)、批归一化(batch normalization)和丢弃正则化(dropout regularization)等先进技术以优化特征提取并抑制过拟合;同时通过数据增强和自适应学习率策略提升模型泛化能力,并引入医学本体(medical ontologies)与语义技术将机器学习输出与结构化医学知识体系结合,从而显著提高诊断准确性与可解释性,推动AI在临床场景中的可靠部署。

链接: https://arxiv.org/abs/2510.00035
作者: P K Dutta,Anushri Chowdhury,Anouska Bhattacharyya,Shakya Chakraborty,Sujatra Dey
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Deep learning integration into medical imaging systems has transformed disease detection and diagnosis processes with a focus on pneumonia identification. The study introduces an intricate deep learning system using Convolutional Neural Networks for automated pneumonia detection from chest Xray images which boosts diagnostic precision and speed. The proposed CNN architecture integrates sophisticated methods including separable convolutions along with batch normalization and dropout regularization to enhance feature extraction while reducing overfitting. Through the application of data augmentation techniques and adaptive learning rate strategies the model underwent training on an extensive collection of chest Xray images to enhance its generalization capabilities. A convoluted array of evaluation metrics such as accuracy, precision, recall, and F1 score collectively verify the model exceptional performance by recording an accuracy rate of 91. This study tackles critical clinical implementation obstacles such as data privacy protection, model interpretability, and integration with current healthcare systems beyond just model performance. This approach introduces a critical advancement by integrating medical ontologies with semantic technology to improve diagnostic accuracy. The study enhances AI diagnostic reliability by integrating machine learning outputs with structured medical knowledge frameworks to boost interpretability. The findings demonstrate AI powered healthcare tools as a scalable efficient pneumonia detection solution. This study advances AI integration into clinical settings by developing more precise automated diagnostic methods that deliver consistent medical imaging results.
zh

[CV-124] Enhancing Safety in Diabetic Retinopathy Detection: Uncertainty-Aware Deep Learning Models with Rejection Capabilities

【速读】:该论文旨在解决深度学习模型在糖尿病视网膜病变(Diabetic Retinopathy, DR)诊断中因缺乏置信度估计而导致的临床不确定性问题,从而提升模型在安全关键场景下的可靠性。其解决方案的关键在于引入不确定性感知的深度学习模型(uncertainty-aware deep learning models),并结合拒绝机制(rejection mechanism)以剔除低置信度预测,同时通过延迟决策(deferred decision-making)策略实现对不确定预测的临床管理,从而在预测覆盖范围(coverage)与预测可靠性之间取得平衡。

链接: https://arxiv.org/abs/2510.00029
作者: Madhushan Ramalingam,Yaish Riaz,Priyanthi Rajamanoharan,Piyumi Dasanayaka
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: VBLL, Rejection threshold, Expected Calibration Error , Coverage, Rejection rate

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a major cause of visual impairment, and effective treatment options depend heavily on timely and accurate diagnosis. Deep learning models have demonstrated great success identifying DR from retinal images. However, relying only on predictions made by models, without any indication of model confidence, creates uncertainty and poses significant risk in clinical settings. This paper investigates an alternative in uncertainty-aware deep learning models, including a rejection mechanism to reject low-confidence predictions, contextualized by deferred decision-making in clinical practice. The results show there is a trade-off between prediction coverage and coverage reliability. The Variational Bayesian model adopted a more conservative strategy when predicting DR, subsequently rejecting the uncertain predictions. The model is evaluated by means of important performance metrics such as Accuracy on accepted predictions, the proportion of accepted cases (coverage), the rejection-ratio, and Expected Calibration Error (ECE). The findings also demonstrate a clear trade-off between accuracy and caution, establishing that the use of uncertainty estimation and selective rejection improves the model’s reliability in safety-critical diagnostic use cases.
zh

人工智能

[AI-0] COM-BOM: Bayesian Exemplar Search for Efficiently Exploring the Accuracy-Calibration Pareto Frontier EMNLP2025

【速读】:该论文旨在解决当前上下文学习(in-context learning)中示例选择(exemplar selection)的局限性问题,即现有方法仅优化预测准确性,而忽视了模型校准(model calibration)这一影响可信度与安全部署的关键因素。解决方案的核心在于将示例选择建模为多目标优化问题,同时最大化预测准确性和最小化期望校准误差(expected calibration error),并采用一种样本高效的组合贝叶斯优化算法(Combinatorial Bayesian Optimization algorithm, COM-BOM)来高效求解帕累托前沿(Pareto front),从而在准确性和校准性之间实现最优权衡。实验表明,COM-BOM在多个任务上优于或匹配基线方法,且显著减少大语言模型(LLM)API调用次数。

链接: https://arxiv.org/abs/2510.01178
作者: Gaoxiang Luo,Aryan Deshwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025 Main, Code: this https URL

点击查看摘要

Abstract:Selecting an optimal set of exemplars is critical for good performance of in-context learning. However, prior exemplar search methods narrowly optimize for predictive accuracy, critically neglecting model calibration–a key determinant of trustworthiness and safe deployment. In this paper, we formulate exemplar selection as a multi-objective optimization problem, explicitly targeting both the maximization of predictive accuracy and the minimization of expected calibration error. We solve this problem with a sample-efficient Combinatorial Bayesian Optimization algorithm (COM-BOM) to find the Pareto front that optimally trades off the two objectives of accuracy and calibration. We evaluate COM-BOM on multiple tasks from unsaturated MMLU-Pro benchmark and find that COM-BOM beats or matches the baselines at jointly optimizing the two objectives, while requiring a minimal number of LLM API calls.
zh

[AI-1] Fiaingen: A financial time series generative method matching real-world data quality

【速读】:该论文旨在解决金融领域中机器学习模型因真实数据稀缺而导致性能受限的问题,尤其是不同金融资产的数据量、质量和多样性不足限制了模型在投资与交易决策中的表现。其解决方案的关键在于提出了一套新颖的时间序列数据生成技术(命名为Fiaingen),通过在低维空间中实现真实数据与合成数据的高度重叠、提升下游机器学习任务的性能以及保持接近秒级的数据生成时间,从而在保证生成数据质量的同时显著提升了可扩展性。实验表明,Fiaingen方法在三项评估指标上均达到当前最优水平,且基于合成数据训练的模型性能接近使用真实数据训练的结果。

链接: https://arxiv.org/abs/2510.01169
作者: Jože M. Rožanec,Tina Žezlin,Laurentiu Vasiliu,Dunja Mladenić,Radu Prodan,Dumitru Roman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data is vital in enabling machine learning models to advance research and practical applications in finance, where accurate and robust models are essential for investment and trading decision-making. However, real-world data is limited despite its quantity, quality, and variety. The data shortage of various financial assets directly hinders the performance of machine learning models designed to trade and invest in these assets. Generative methods can mitigate this shortage. In this paper, we introduce a set of novel techniques for time series data generation (we name them Fiaingen) and assess their performance across three criteria: (a) overlap of real-world and synthetic data on a reduced dimensionality space, (b) performance on downstream machine learning tasks, and © runtime performance. Our experiments demonstrate that the methods achieve state-of-the-art performance across the three criteria listed above. Synthetic data generated with Fiaingen methods more closely mirrors the original time series data while keeping data generation time close to seconds - ensuring the scalability of the proposed approach. Furthermore, models trained on it achieve performance close to those trained with real-world data.
zh

[AI-2] Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLM s?

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型推理中因依赖在线策略(on-policy)训练而导致的效率与可扩展性受限问题。现有方法在异步RL系统中虽通过解耦采样与训练缓解了这一瓶颈,但其性能易受高延迟采样数据(stale data)影响,导致优化不稳定甚至崩溃。解决方案的关键在于提出M2PO(Second-Moment Trust Policy Optimization),其核心创新是通过约束重要性权重的二阶矩(second moment)来抑制极端异常值,同时保留具有信息量的更新信号,从而实现对高延迟数据的有效利用。实验表明,M2PO显著降低剪裁令牌比例(从1.22%降至0.06%),并可在数据滞后至少256个模型更新的情况下维持稳定训练,最终达到与在线策略相当的性能。

链接: https://arxiv.org/abs/2510.01161
作者: Haizhong Zheng,Jiawei Zhao,Bedi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.
zh

[AI-3] Generalized Parallel Scaling with Interdependent Generations

【速读】:该论文旨在解决并行大语言模型(Large Language Model, LLM)推理扩展中生成响应独立性导致的计算资源浪费与信息利用不足问题。传统方法在对单个输入提示并行生成多个响应时,各响应间缺乏交互,无法共享已产生的中间状态(hidden states),从而限制了生成质量与一致性。其解决方案的关键在于提出Bridge机制,通过将批量隐藏状态视为整体张量(holistic tensors)而非独立切片,引入少量新增参数(仅2.8%–5.1%),实现并行响应间的相互依赖生成,从而提升强化学习奖励验证下的平均准确率(最高提升50%)及正确响应的一致性,且无需重新训练即可适配任意生成宽度(generation width)。

链接: https://arxiv.org/abs/2510.01143
作者: Harry Dong,David Brandfonbrener,Eryk Helenowski,Yun He,Mrinal Kumar,Han Fang,Yuejie Chi,Karthik Abinav Sankararaman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parallel LLM inference scaling involves sampling a set of N1 responses for a single input prompt. However, these N parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 50% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.
zh

[AI-4] Apriel-1.5-15b-Thinker

【速读】:该论文旨在解决当前多模态推理模型依赖大规模参数和计算资源才能达到前沿性能的问题,尤其是在资源受限环境下难以实现高性能多模态推理的挑战。解决方案的关键在于提出一种数据驱动的渐进式三阶段训练方法:首先通过深度扩展(depth upscaling)提升推理能力而不从头预训练;其次采用分阶段持续预训练,先构建基础文本与视觉理解能力,再利用针对性合成数据增强空间结构、组合理解和细粒度感知等视觉推理能力;最后在高质量纯文本指令微调中引入显式的推理路径,覆盖数学、编程、科学和工具使用等多个领域。该方法无需强化学习或偏好优化即可实现与大模型相当的性能,显著降低了对算力的需求,证明了中等规模模型通过精心设计的训练策略可逼近前沿水平。

链接: https://arxiv.org/abs/2510.01141
作者: Shruthan Radhakrishna,Aman Tiwari,Aanjaneya Shukla,Masoud Hashemi,Rishabh Maheshwary,Shiva Krishna Reddy Malay,Jash Mehta,Pulkit Pattnaik,Saloni Mittal,Khalil Slimi,Kelechi Ogueji,Akintunde Oladipo,Soham Parikh,Oluwanifemi Bamgbose,Toby Liang,Ahmed Masry,Khyati Mahajan,Sai Rajeswar Mudumba,Vikas Yadav,Sathwik Tejaswi Madhusudhan,Torsten Scholak,Sagar Davasam,Srinivas Sunkara,Nicholas Chapados
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.
zh

[AI-5] abINR: An Implicit Neural Representation Framework for Tabular Data Imputation

【速读】:该论文旨在解决现实世界中表格数据(tabular data)普遍存在缺失值的问题,这类缺失会显著降低下游模型的性能或限制其适用性。传统插补策略常引入偏差或扭曲数据分布,而现有深度学习方法在高维场景下表现不稳定且推理速度较慢。解决方案的关键在于提出一种基于自编码器的隐式神经表示(Implicit Neural Representation, INR)框架 TabINR,该框架将表格建模为神经函数,并通过可学习的行嵌入(row embeddings)和特征嵌入(feature embeddings)有效捕捉表格数据的离散结构,从而实现无需修改已训练模型即可进行实例自适应插补的能力,同时在多种缺失机制和高维数据上展现出优于经典方法(如KNN、MICE、MissForest)及主流深度学习模型(如GAIN、ReMasker)的插补精度与鲁棒性。

链接: https://arxiv.org/abs/2510.01136
作者: Vincent Ochs,Florentin Bieder,Sidaty el Hadramy,Paul Friedrich,Stephanie Taha-Mehlitz,Anas Taha,Philippe C. Cattin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular data builds the basis for a wide range of applications, yet real-world datasets are frequently incomplete due to collection errors, privacy restrictions, or sensor failures. As missing values degrade the performance or hinder the applicability of downstream models, and while simple imputing strategies tend to introduce bias or distort the underlying data distribution, we require imputers that provide high-quality imputations, are robust across dataset sizes and yield fast inference. We therefore introduce TabINR, an auto-decoder based Implicit Neural Representation (INR) framework that models tables as neural functions. Building on recent advances in generalizable INRs, we introduce learnable row and feature embeddings that effectively deal with the discrete structure of tabular data and can be inferred from partial observations, enabling instance adaptive imputations without modifying the trained model. We evaluate our framework across a diverse range of twelve real-world datasets and multiple missingness mechanisms, demonstrating consistently strong imputation accuracy, mostly matching or outperforming classical (KNN, MICE, MissForest) and deep learning based models (GAIN, ReMasker), with the clearest gains on high-dimensional datasets.
zh

[AI-6] Rethinking Thinking Tokens: LLM s as Improvement Operators

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理训练中因生成长链思维(Long Chain of Thought, Long CoT)而导致的上下文长度膨胀、计算成本增加和响应延迟上升的问题,同时希望实现更优的准确率与更低的资源消耗之间的权衡。其解决方案的关键在于提出一种名为“并行蒸馏精炼”(Parallel-Distill-Refine, PDR)的新型推理范式:首先并行生成多样化的思维草稿,随后将这些草稿压缩到有限的文本工作空间中进行蒸馏,最后基于该工作空间迭代精炼出最终输出。该方法通过控制并行度可灵活调节上下文长度和计算开销,且不依赖于总生成token数,从而打破了传统长CoT中上下文长度与生成质量强耦合的限制;实验表明,PDR在多项数学任务上显著优于长CoT,在相同序列预算下表现更优,尤其当并行度设为1时所对应的顺序精炼(Sequential Refinement, SR)策略也超越了长CoT。进一步地,作者通过强化学习(Reinforcement Learning, RL)训练一个8B参数的思考模型使其适应PDR推理机制,在验证答案可判定的任务中实现了比单次推理基线更高的性能提升。

链接: https://arxiv.org/abs/2510.01123
作者: Lovish Madaan,Aniket Didolkar,Suchin Gururangan,John Quan,Ruan Silva,Ruslan Salakhutdinov,Manzil Zaheer,Sanjeev Arora,Anirudh Goyal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own “thoughts” with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).
zh

[AI-7] Exploring Network-Knowledge Graph Duality: A Case Study in Agent ic Supply Chain Risk Analysis

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理金融风险分析中复杂、多模态及网络原生数据时面临的挑战,即标准检索增强生成(Retrieval-Augmented Generation, RAG)方法过于简化关系,而专用模型则成本高且静态不可扩展。解决方案的关键在于构建一个以LLM为中心的智能体(agent)框架,利用供应链网络与知识图谱(Knowledge Graph, KG)之间的内在二元性——将供应链网络视为KG,并基于结构网络科学原理进行检索;通过由网络中心性得分引导的图遍历器高效提取最具经济意义的风险路径;同时,引入创新的“上下文壳”(context shells)机制,将原始数值数据嵌入自然语言描述模板中,使定量信息对LLM完全可理解,从而实现无需昂贵微调或专用图数据库支持下的实时、简洁、可解释且上下文丰富的风险叙事生成。

链接: https://arxiv.org/abs/2510.01115
作者: Evan Heus,Rick Bookstaber,Dhruv Sharma
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH); Physics and Society (physics.soc-ph)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with the complex, multi-modal, and network-native data underlying financial risk. Standard Retrieval-Augmented Generation (RAG) oversimplifies relationships, while specialist models are costly and static. We address this gap with an LLM-centric agent framework for supply chain risk analysis. Our core contribution is to exploit the inherent duality between networks and knowledge graphs (KG). We treat the supply chain network as a KG, allowing us to use structural network science principles for retrieval. A graph traverser, guided by network centrality scores, efficiently extracts the most economically salient risk paths. An agentic architecture orchestrates this graph retrieval alongside data from numerical factor tables and news streams. Crucially, it employs novel ``context shells’’ – descriptive templates that embed raw figures in natural language – to make quantitative data fully intelligible to the LLM. This lightweight approach enables the model to generate concise, explainable, and context-rich risk narratives in real-time without costly fine-tuning or a dedicated graph database.
zh

[AI-8] PRISM-Consult: A Panel-of-Experts Architecture for Clinician-Aligned Diagnosis

【速读】:该论文旨在解决临床场景中生成式 AI(Generative AI)模型在急诊科(Emergency Department)应用时面临的效率、安全性和可解释性难题,尤其是如何在保证诊断准确性的同时实现低延迟、高可审计性的专科咨询(consultation)。其解决方案的关键在于提出 PRISM-Consult 架构——一种基于轻量级路由器的专家路由系统(clinician-aligned panel-of-experts architecture),将统一的小型 Transformer 模型(PRISM)扩展为多个领域专业化模型(如心血管、呼吸、胃肠、骨骼肌肉、心理源性等),通过分析初始结构化临床事件(structured clinical events)的前几 token 来动态调度至最相关的专科模型。该设计实现了参数高效共享、领域专注推理与计算资源优化,在保持低开发困惑度(perplexity)的同时显著降低整体计算开销,并通过路由质量评估和安全优先策略保障临床部署可行性。

链接: https://arxiv.org/abs/2510.01114
作者: Lionel Levine,John Santerre,Alexander S. Young,T. Barry Levine,Francis Campion,Majid Sarrafzadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:We present PRISM-Consult, a clinician-aligned panel-of-experts architecture that extends the compact PRISM sequence model into a routed family of domain specialists. Episodes are tokenized as structured clinical events; a light-weight router reads the first few tokens and dispatches to specialist models (Cardiac-Vascular, Pulmonary, Gastro-Oesophageal, Musculoskeletal, Psychogenic). Each specialist inherits PRISM’s small transformer backbone and token template, enabling parameter efficiency and interpretability. On real-world Emergency Department cohorts, specialists exhibit smooth convergence with low development perplexities across domains, while the router achieves high routing quality and large compute savings versus consult-all under a safety-first policy. We detail the data methodology (initial vs. conclusive ICD-9 families), routing thresholds and calibration, and report per-domain results to avoid dominance by common events. The framework provides a practical path to safe, auditable, and low-latency consult at scale, and we outline validation steps-external/temporal replication, asymmetric life-threat thresholds, and multi-label arbitration-to meet prospective clinical deployment standards.
zh

[AI-9] Optimizing Fairness in Production Planning : A Human-Centric Approach to Machine and Workforce Allocation

【速读】:该论文旨在解决工业制造中生产计划优化与劳动力公平性之间的平衡问题,即如何在提升运营效率的同时保障工人的工作满意度和公平待遇。其解决方案的关键在于构建一个两层的人本化生产计划框架:第一层将订单-产线分配建模为约束规划(Constraint Programming, CP)问题,以生成高利用率且满足机器容量、加工时间和交货期约束的可行生产计划;第二层将工人-产线分配建模为马尔可夫决策过程(Markov Decision Process, MDP),整合工人偏好、经验、抗压能力及医疗限制等人类因素,通过贪婪分配、蒙特卡洛树搜索(MCTS)和强化学习(Reinforcement Learning, RL)三种策略进行优化。实证结果显示,该方法在降低延迟率和提高分配公平性方面显著优于基线方案,验证了CP与学习驱动决策相结合在实现吞吐量与员工福祉协同优化方面的有效性。

链接: https://arxiv.org/abs/2510.01094
作者: Alexander Nasuta,Alessandro Cisi,Sylwia Olbrych,Gustavo Vieira,Rui Fernandes,Lucas Paletta,Marlene Mayr,Rishyank Chevuri,Robert Woitsch,Hans Aoyang Zhou,Anas Abdelrazeq,Robert H. Schmitt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents a two-layer, human-centric production planning framework designed to optimize both operational efficiency and workforce fairness in industrial manufacturing. The first layer formulates the Order-Line allocation as a Constraint Programming (CP) problem, generating high-utilization production schedules that respect machine capacities, processing times, and due dates. The second layer models Worker-Line allocation as a Markov Decision Process (MDP), integrating human factors such as worker preference, experience, resilience, and medical constraints into the assignment process. Three solution strategies, greedy allocation, MCTS, and RL, are implemented and compared across multiple evaluation scenarios. The proposed system is validated through 16 test sessions with domain experts from the automotive industry, combining quantitative key performance indicators (KPIs) with expert ratings. Results indicate that the CP-based scheduling approach produces compact, feasible production plans with low tardiness, while the MDP-based worker allocation significantly improves fairness and preference alignment compared to baseline approaches. Domain experts rated both the Order-Line and Worker-Line components as effective and highlighted opportunities to further refine the objective function to penalize excessive earliness and improve continuity in worker assignments. Overall, the findings demonstrate that combining CP with learning-based decision-making provides a robust approach for human-centric production planning. The approach enables simultaneous optimization of throughput and workforce well-being, offering a practical foundation for fair and efficient manufacturing scheduling in industrial settings.
zh

[AI-10] Safety Instincts: LLM s Learn to Trust Their Internal Compass for Self-Defense

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)安全性保障中缺乏通用标准和可靠内容验证器的问题,从而难以获取有效的训练信号。其解决方案的关键在于发现并利用模型内部已存在的安全信念——即模型在面对有害请求时会以高置信度拒绝,而在生成潜在危险内容时则表现出高熵特征;通过引入Safety Instincts Reinforcement Learning (SIRL),将这种内在的置信度转化为自生成的奖励信号,使模型学会信任自身的安全直觉,从而在无需外部标注或人工验证的情况下实现高效对齐。该方法仅需15,000条未标注提示即可在多种攻击场景下保持89%以上的防御成功率,同时维持数学、编程与对话等任务性能。

链接: https://arxiv.org/abs/2510.01088
作者: Guobin Shen,Dongcheng Zhao,Haibo Tong,Jindong Li,Feifei Zhao,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal–models intrinsically “know” when to refuse. We introduce Safety Instincts Reinforcement Learning (SIRL), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. SIRL teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, SIRL maintains 89%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to adaptive attacks. Using only 15,000 unlabeled prompts, SIRL surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.
zh

[AI-11] CodeGenLink: A Tool to Find the Likely Origin and License of Automatically Generated Code

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件开发中生成代码时缺乏可信度和潜在版权或许可证违规的问题,根源在于生成代码缺少出处信息(code provenance)。解决方案的关键在于提出 CodeGenLink,这是一个集成于 Visual Studio Code 的 GitHub Copilot 扩展,其核心机制是结合 LLM 的 Web 搜索能力与代码相似性分析:首先通过 LLM 获取候选代码链接,再对生成代码与检索到的代码进行相似性比对,从而过滤无关链接,并在可能的情况下标注出源代码的许可证信息。

链接: https://arxiv.org/abs/2510.01077
作者: Daniele Bifolco,Guido Annicchiarico,Pierluigi Barbiero,Massimiliano Di Penta,Fiorella Zampetti
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025), November 16-20 2025, Seoul, South Korea

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in software development tasks nowadays. Unlike reusing code taken from the Web, for LLMs’ generated code, developers are concerned about its lack of trustworthiness and possible copyright or licensing violations, due to the lack of code provenance information. This paper proposes CodeGenLink, a GitHub CoPilot extension for Visual Studio Code aimed at (i) suggesting links containing code very similar to automatically generated code, and (ii) whenever possible, indicating the license of the likely origin of the code. CodeGenLink retrieves candidate links by combining LLMs with their web search features and then performs similarity analysis between the generated and retrieved code. Preliminary results show that CodeGenLink effectively filters unrelated links via similarity analysis and provides licensing information when available. Tool URL: this https URL Tool Video: this https URL
zh

[AI-12] yped Chain-of-Thought: A Curry-Howard Framework for Verifying LLM Reasoning

【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)提示中推理轨迹的忠实性(faithfulness)问题,即生成的自然语言推理步骤是否真正反映模型内部的计算逻辑,这对大语言模型的可解释性至关重要。解决方案的关键在于引入柯里-霍华德对应(Curry-Howard correspondence)作为理论基础,将忠实的推理轨迹类比为类型正确的程序,其中每一步推理对应一个类型化的逻辑推导;进而提出方法将CoT中的非形式化自然语言步骤映射为形式化的类型证明结构,成功转换即构成计算忠实性的强验证凭证,从而实现从启发式解释向形式化验证的跨越。

链接: https://arxiv.org/abs/2510.01069
作者: Elija Perrier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:While Chain-of-Thought (CoT) prompting enhances the reasoning capabilities of large language models, the faithfulness of the generated rationales remains an open problem for model interpretability. We propose a novel theoretical lens for this problem grounded in the Curry-Howard correspondence, which posits a direct relationship between formal proofs and computer programs. Under this paradigm, a faithful reasoning trace is analogous to a well-typed program, where each intermediate step corresponds to a typed logical inference. We operationalise this analogy, presenting methods to extract and map the informal, natural language steps of CoT into a formal, typed proof structure. Successfully converting a CoT trace into a well-typed proof serves as a strong, verifiable certificate of its computational faithfulness, moving beyond heuristic interpretability towards formal verification. Our framework provides a methodology to transform plausible narrative explanations into formally verifiable programs, offering a path towards building more reliable and trustworthy AI systems.
zh

[AI-13] CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务训练中因课程学习(Curriculum Learning)策略不当导致的效率低下问题,具体表现为现有方法未能充分考虑提示(prompt)难度差异,或仅依赖简单过滤机制选择数据集,造成显著的计算资源浪费。其解决方案的关键在于从强化学习梯度优化的角度出发,系统性地分析并改进两个核心因素:一是训练提示的选择分布,二是不同提示间采样回放次数(rollout quantity)的分配策略;基于理论分析提出 CurES 方法,通过贝叶斯后验估计实现高效收敛与计算开销最小化,从而显著提升训练效率和稳定性。

链接: https://arxiv.org/abs/2510.01037
作者: Yongcheng Zeng,Zexu Sun,Bokai Ji,Erxue Min,Hengyi Cai,Shuaiqiang Wang,Dawei Yin,Haifeng Zhang,Xu Chen,Jun Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 10 Figures

点击查看摘要

Abstract:Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf+3.30 points and \textbf+4.82 points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.
zh

[AI-14] Uncovering the Computational Ingredients of Human-Like Representations in LLM s

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在构建类人概念表征方面缺乏明确的计算成分解析,以及现有基准测试无法可靠衡量人类与模型之间表征对齐度的问题。解决方案的关键在于通过一个基于THINGS数据库概念的三元组相似性任务(triplet similarity task),系统评估超过70种具有不同架构、微调方法和训练数据的LLM,发现指令微调(instruction-finetuning)和注意力头维度较大是提升模型与人类表征对齐度的核心因素;同时指出当前主流基准测试(如MMLU优于MUSR)虽部分相关,但均无法充分解释表征对齐的方差,揭示了现有评估体系在捕捉人类-人工智能对齐方面的局限性。

链接: https://arxiv.org/abs/2510.01030
作者: Zach Studdiford,Timothy T. Rogers,Kushin Mukherjee,Siddharth Suresh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:The ability to translate diverse patterns of inputs into structured patterns of behavior has been thought to rest on both humans’ and machines’ ability to learn robust representations of relevant concepts. The rapid advancement of transformer-based large language models (LLMs) has led to a diversity of computational ingredients – architectures, fine tuning methods, and training datasets among others – but it remains unclear which of these ingredients are most crucial for building models that develop human-like representations. Further, most current LLM benchmarks are not suited to measuring representational alignment between humans and models, making benchmark scores unreliable for assessing if current LLMs are making progress towards becoming useful cognitive models. We address these limitations by first evaluating a set of over 70 models that widely vary in their computational ingredients on a triplet similarity task, a method well established in the cognitive sciences for measuring human conceptual representations, using concepts from the THINGS database. Comparing human and model representations, we find that models that undergo instruction-finetuning and which have larger dimensionality of attention heads are among the most human aligned, while multimodal pretraining and parameter size have limited bearing on alignment. Correlations between alignment scores and scores on existing benchmarks reveal that while some benchmarks (e.g., MMLU) are better suited than others (e.g., MUSR) for capturing representational alignment, no existing benchmark is capable of fully accounting for the variance of alignment scores, demonstrating their insufficiency in capturing human-AI alignment. Taken together, our findings help highlight the computational ingredients most essential for advancing LLMs towards models of human conceptual representation and address a key benchmarking gap in LLM evaluation.
zh

[AI-15] he Good the Bad and the Sampled: a No-Regret Approach to Safe Online Classification

【速读】:该论文旨在解决在未知逻辑回归模型参数 θ\theta^* 和患者特征分布 PP 的情况下,对个体进行序列化疾病检测时如何最小化测试次数,同时保证误分类率不超过预设阈值 α\alpha,且该保证以至少 1δ1-\delta 的概率成立的问题。解决方案的关键在于提出了一种新颖的算法,该算法通过交错执行标签收集与分布估计,联合学习 θ\theta^*PP,并基于数据自适应地计算一个保守的阈值 τt\tau_t 来决定是否需要测试,该阈值基于逻辑得分 xtθ|x_t^\top\theta| 的绝对值。理论分析表明,该方法在概率 1δ1-\delta 下满足误差约束,并且相比已知 θ\theta^*PP 的理想基准,仅需额外 O(T)O(\sqrt{T}) 次测试,首次实现了误差约束下逻辑回归检测的无遗憾(no-regret)性能保证,适用于成本敏感的医疗筛查场景。

链接: https://arxiv.org/abs/2510.01020
作者: Tavor Z. Baharav,Spyros Dragazis,Aldo Pacchiano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: 43 pages

点击查看摘要

Abstract:We study the problem of sequentially testing individuals for a binary disease outcome whose true risk is governed by an unknown logistic model. At each round, a patient arrives with feature vector x_t , and the decision maker may either pay to administer a (noiseless) diagnostic test–revealing the true label–or skip testing and predict the patient’s disease status based on their feature vector and prior history. Our goal is to minimize the total number of costly tests required while guaranteeing that the fraction of misclassifications does not exceed a prespecified error tolerance \alpha , with probability at least 1-\delta . To address this, we develop a novel algorithm that interleaves label-collection and distribution estimation to estimate both \theta^* and the context distribution P , and computes a conservative, data-driven threshold \tau_t on the logistic score |x_t^\top\theta| to decide when testing is necessary. We prove that, with probability at least 1-\delta , our procedure does not exceed the target misclassification rate, and requires only O(\sqrtT) excess tests compared to the oracle baseline that knows both \theta^* and the patient feature distribution P . This establishes the first no-regret guarantees for error-constrained logistic testing, with direct applications to cost-sensitive medical screening. Simulations corroborate our theoretical guarantees, showing that in practice our procedure efficiently estimates \theta^* while retaining safety guarantees, and does not require too many excess tests.
zh

[AI-16] Integrating AI and Ensemble Forecasting: Explainable Materials Planning with Scorecards and Trend Insights for a Large-Scale Manufacturer

【速读】:该论文旨在解决跨国家、多品类的售后需求预测(after-sales demand forecasting)中精度不足与业务可解释性弱的问题,尤其在面对复杂外部信号(如安装基数、价格、宏观经济指标、生命周期阶段和季节性)及突发事件(如新冠疫情作为独立状态)时,传统方法难以实现高精度且具决策指导意义的预测。解决方案的关键在于构建一个统一的集成模型架构:一方面通过帕累托感知的分群策略(Pareto-aware segmentation),对高收入部件进行个体建模,同时将长尾部分聚类处理以提升效率;另一方面采用基于业务损失函数(如WMAPE)的时序感知加权集成(horizon-aware ensembling),确保预测权重与实际业务目标一致。此外,引入角色驱动的分析层(role-driven analytics layer),结合生成式AI(LLMs)自动生成面向不同角色的可解释叙事,实现从预测到趋势诊断、根因分析和库存决策的闭环管理,从而支持全球90余国、约6,000种零部件的精准预测与动态监控。

链接: https://arxiv.org/abs/2510.01006
作者: Saravanan Venkatachalam
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a practical architecture for after-sales demand forecasting and monitoring that unifies a revenue- and cluster-aware ensemble of statistical, machine-learning, and deep-learning models with a role-driven analytics layer for scorecards and trend diagnostics. The framework ingests exogenous signals (installed base, pricing, macro indicators, life cycle, seasonality) and treats COVID-19 as a distinct regime, producing country-part forecasts with calibrated intervals. A Pareto-aware segmentation forecasts high-revenue items individually and pools the long tail via clusters, while horizon-aware ensembling aligns weights with business-relevant losses (e.g., WMAPE). Beyond forecasts, a performance scorecard delivers decision-focused insights: accuracy within tolerance thresholds by revenue share and count, bias decomposition (over- vs under-forecast), geographic and product-family hotspots, and ranked root causes tied to high-impact part-country pairs. A trend module tracks trajectories of MAPE/WMAPE and bias across recent months, flags entities that are improving or deteriorating, detects change points aligned with known regimes, and attributes movements to lifecycle and seasonal factors. LLMs are embedded in the analytics layer to generate role-aware narratives and enforce reporting contracts. They standardize business definitions, automate quality checks and reconciliations, and translate quantitative results into concise, explainable summaries for planners and executives. The system exposes a reproducible workflow - request specification, model execution, database-backed artifacts, and AI-generated narratives - so planners can move from “How accurate are we now?” to “Where is accuracy heading and which levers should we pull?”, closing the loop between forecasting, monitoring, and inventory decisions across more than 90 countries and about 6,000 parts.
zh

[AI-17] Adaptive Federated Few-Shot Rare-Disease Diagnosis with Energy-Aware Secure Aggregation

【速读】:该论文旨在解决罕见病诊断中面临的三大核心挑战:数据极度稀缺、隐私保护需求严苛以及边缘设备资源受限。其解决方案的关键在于提出了一种自适应联邦小样本罕见病诊断框架(Adaptive Federated Few-Shot Rare-Disease Diagnosis, AFFR),通过三个协同机制实现:(i) 基于元学习的少样本联邦优化,以从有限患者样本中泛化出高精度模型;(ii) 能量感知的客户端调度策略,降低设备掉线率并保障参与均衡性;(iii) 校准差分隐私驱动的安全聚合机制,在保护敏感模型更新的同时维持临床可接受的隐私-效用平衡。该框架将上述模块整合为可部署于真实医疗网络的端到端流程,实验表明其在准确率上相较基线联邦学习提升最高达10%,且客户端掉线率降低超50%而不影响收敛性。

链接: https://arxiv.org/abs/2510.00976
作者: Aueaphum Aueawatthanaphisut
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 6 pages, 6 figures, 12 equations, 1 algorithm

点击查看摘要

Abstract:Rare-disease diagnosis remains one of the most pressing challenges in digital health, hindered by extreme data scarcity, privacy concerns, and the limited resources of edge devices. This paper proposes the Adaptive Federated Few-Shot Rare-Disease Diagnosis (AFFR) framework, which integrates three pillars: (i) few-shot federated optimization with meta-learning to generalize from limited patient samples, (ii) energy-aware client scheduling to mitigate device dropouts and ensure balanced participation, and (iii) secure aggregation with calibrated differential privacy to safeguard sensitive model updates. Unlike prior work that addresses these aspects in isolation, AFFR unifies them into a modular pipeline deployable on real-world clinical networks. Experimental evaluation on simulated rare-disease detection datasets demonstrates up to 10% improvement in accuracy compared with baseline FL, while reducing client dropouts by over 50% without degrading convergence. Furthermore, privacy-utility trade-offs remain within clinically acceptable bounds. These findings highlight AFFR as a practical pathway for equitable and trustworthy federated diagnosis of rare conditions.
zh

[AI-18] QUASAR: Quantum Assembly Code Generation Using Tool-Augmented LLM s via Agent ic RL

【速读】:该论文旨在解决生成式 AI 在量子电路(quantum circuit)设计与优化中的两大核心挑战:一是参数化量子门需要精确的数值以实现最优性能,而这些参数受电路深度、门数量及布局等多重因素影响;二是大语言模型(LLM)因缺乏量子领域专业知识,常生成语法或语义错误的量子电路。解决方案的关键在于提出 QUASAR 框架,其基于工具增强型 LLM 与强化学习(reinforcement learning, RL)相结合,通过两个创新机制实现突破:(i) 利用外部量子模拟器进行量子电路验证,确保生成结果在物理层面正确;(ii) 设计分层奖励机制,在 RL 训练中同时优化电路的语法正确性与语义功能性,从而显著提升生成质量。实验证明,QUASAR 在 Pass@1 和 Pass@10 上分别达到 99.31% 和 100% 的有效性,优于 GPT-4o、GPT-5 等工业级 LLM 及多种监督微调(SFT)和纯强化学习基线方法。

链接: https://arxiv.org/abs/2510.00967
作者: Cong Yu,Valter Uotila,Shilong Deng,Qingyuan Wu,Tuo Shi,Songlin Jiang,Lei You,Bo Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Designing and optimizing task-specific quantum circuits are crucial to leverage the advantage of quantum computing. Recent large language model (LLM)-based quantum circuit generation has emerged as a promising automatic solution. However, the fundamental challenges remain unaddressed: (i) parameterized quantum gates require precise numerical values for optimal performance, which also depend on multiple aspects, including the number of quantum gates, their parameters, and the layout/depth of the circuits. (ii) LLMs often generate low-quality or incorrect quantum circuits due to the lack of quantum domain-specific knowledge. We propose QUASAR, an agentic reinforcement learning (RL) framework for quantum circuits generation and optimization based on tool-augmented LLMs. To align the LLM with quantum-specific knowledge and improve the generated quantum circuits, QUASAR designs (i) a quantum circuit verification approach with external quantum simulators and (ii) a sophisticated hierarchical reward mechanism in RL training. Extensive evaluation shows improvements in both syntax and semantic performance of the generated quantum circuits. When augmenting a 4B LLM, QUASAR has achieved the validity of 99.31% in Pass@1 and 100% in Pass@10, outperforming industrial LLMs of GPT-4o, GPT-5 and DeepSeek-V3 and several supervised-fine-tuning (SFT)-only and RL-only baselines.
zh

[AI-19] Deep Learning-Based Approach for Improving Relational Aggregated Search

【速读】:该论文旨在解决互联网信息爆炸背景下,传统搜索引擎在阿拉伯语文本数据聚类中存在检索不精准、缺乏上下文相关性及个性化不足的问题。解决方案的关键在于引入先进的自然语言处理技术,即利用堆叠自编码器(Stacked Autoencoders)进行表示学习,并结合AraBERT嵌入(AraBERT embeddings)提取语义特征,进而通过K-means聚类算法挖掘搜索结果中的显著特征与关联关系,从而提升聚合搜索环境中阿拉伯语文本的聚类效果和检索准确性。

链接: https://arxiv.org/abs/2510.00966
作者: Sara Saad Soliman,Ahmed Younes,Islam Elkabani,Ashraf Elsayed
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to an information explosion on the internet, there is a need for the development of aggregated search systems that can boost the retrieval and management of content in various formats. To further improve the clustering of Arabic text data in aggregated search environments, this research investigates the application of advanced natural language processing techniques, namely stacked autoencoders and AraBERT embeddings. By transcending the limitations of traditional search engines, which are imprecise, not contextually relevant, and not personalized, we offer more enriched, context-aware characterizations of search results, so we used a K-means clustering algorithm to discover distinctive features and relationships in these results, we then used our approach on different Arabic queries to evaluate its effectiveness. Our model illustrates that using stacked autoencoders in representation learning suits clustering tasks and can significantly improve clustering search results. It also demonstrates improved accuracy and relevance of search results.
zh

[AI-20] A Neuro-Fuzzy System for Interpretable Long-Term Stock Market Forecasting

【速读】:该论文旨在解决多变量时间序列预测中准确性与可解释性难以兼顾的问题(accuracy and interpretability trade-off in multivariate time series forecasting)。其解决方案的关键在于提出了一种融合长短期记忆网络(LSTM)与多头自注意力机制的新型递归神经网络架构——模糊变换器(Fuzzformer),通过时序注意力将多变量数据压缩为适合模糊推理系统(fuzzy inference system, FIS)处理的可解释特征,从而在保持与ARIMA和LSTM相当预测性能的同时,实现网络内部信息流动的可解释性。

链接: https://arxiv.org/abs/2510.00960
作者: Miha Ožbot,Igor Škrjanc,Vitomir Štruc
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
备注: Published in: ERK 2025 – 34th International Electrotechnical and Computer Science Conference, Portorož, Slovenia, Sept. 25–26, 2025. Proceedings published by Društvo Slovenska sekcija IEEE. ISSN: 2591-0442 (online). 4 pages, 2 figures

点击查看摘要

Abstract:In the complex landscape of multivariate time series forecasting, achieving both accuracy and interpretability remains a significant challenge. This paper introduces the Fuzzy Transformer (Fuzzformer), a novel recurrent neural network architecture combined with multi-head self-attention and fuzzy inference systems to analyze multivariate stock market data and conduct long-term time series forecasting. The method leverages LSTM networks and temporal attention to condense multivariate data into interpretable features suitable for fuzzy inference systems. The resulting architecture offers comparable forecasting performance to conventional models such as ARIMA and LSTM while providing meaningful information flow within the network. The method was examined on the real world stock market index S\P500. Initial results show potential for interpretable forecasting and identify current performance tradeoffs, suggesting practical application in understanding and forecasting stock market behavior.
zh

[AI-21] st-Time Search in Neural Graph Coarsening Procedures for the Capacitated Vehicle Routing Problem

【速读】:该论文旨在解决生成式 AI (Generative AI) 在求解带容量约束的车辆路径问题(Capacitated Vehicle Routing Problem, CVRP)时,基于深度学习的割平面分离方法在推理阶段生成割不等式(如舍入容量不等式 RCI)数量不足、多样性不够的问题。其解决方案的关键在于引入一种新的测试时搜索机制,通过在图粗化(graph coarsening)过程中加入随机性以增强模型对不同子集的敏感度,并提出基于粗化历史的分区算法(Graph Coarsening History-based Partitioning, GraphCHiP),不仅能够识别RCI,还能首次有效识别帧式容量不等式(Framed Capacity Inequalities, FCIs),从而显著提升割平面法的对偶间隙收敛效果。

链接: https://arxiv.org/abs/2510.00958
作者: Yoonju Sim,Hyeonah Kim,Changhyun Kwon
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The identification of valid inequalities, such as the rounded capacity inequalities (RCIs), is a key component of cutting plane methods for the Capacitated Vehicle Routing Problem (CVRP). While a deep learning-based separation method can learn to find high-quality cuts, our analysis reveals that the model produces fewer cuts than expected because it is insufficiently sensitive to generate a diverse set of generated subsets. This paper proposes an alternative: enhancing the performance of a trained model at inference time through a new test-time search with stochasticity. First, we introduce stochastic edge selection into the graph coarsening procedure, replacing the previously proposed greedy approach. Second, we propose the Graph Coarsening History-based Partitioning (GraphCHiP) algorithm, which leverages coarsening history to identify not only RCIs but also, for the first time, the Framed capacity inequalities (FCIs). Experiments on randomly generated CVRP instances demonstrate the effectiveness of our approach in reducing the dual gap compared to the existing neural separation method. Additionally, our method discovers effective FCIs on a specific instance, despite the challenging nature of identifying such cuts.
zh

[AI-22] Bridging the Gap Between Simulated and Real Network Data Using Transfer Learning

【速读】:该论文旨在解决机器学习(Machine Learning, ML)网络模型在真实环境中部署时因训练数据不足而导致预测精度下降的问题,尤其是在关键场景(如网络故障)中,真实数据获取成本高且有限。其解决方案的关键在于提出一种融合迁移学习(transfer learning)的混合方法,通过使用预训练的仿真模型(基于RouteNet-Fermi)并结合少量真实数据进行微调(fine-tuning),从而显著提升模型在实际网络中的性能表现。实验表明,仅需10个真实场景即可使包延迟预测的平均绝对百分比误差(Mean Absolute Percentage Error, MAPE)降低37%,而50个场景则可降低48%。

链接: https://arxiv.org/abs/2510.00956
作者: Carlos Güemes-Palau,Miquel Ferriol-Galmés,Jordi Paillisse-Vilanova,Albert López-Brescó,Pere Barlet-Ros,Albert Cabellos-Aparicio
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper was submitted to IEEE ICC 2026. 7 Pages, 5 Figures

点击查看摘要

Abstract:Machine Learning (ML)-based network models provide fast and accurate predictions for complex network behaviors but require substantial training data. Collecting such data from real networks is often costly and limited, especially for critical scenarios like failures. As a result, researchers commonly rely on simulated data, which reduces accuracy when models are deployed in real environments. We propose a hybrid approach leveraging transfer learning to combine simulated and real-world data. Using RouteNet-Fermi, we show that fine-tuning a pre-trained model with a small real dataset significantly improves performance. Our experiments with OMNeT++ and a custom testbed reduce the Mean Absolute Percentage Error (MAPE) in packet delay prediction by up to 88%. With just 10 real scenarios, MAPE drops by 37%, and with 50 scenarios, by 48%.
zh

[AI-23] On Discovering Algorithms for Adversarial Imitation Learning

【速读】:该论文旨在解决对抗式模仿学习(Adversarial Imitation Learning, AIL)在训练过程中不稳定的问题,尤其是现有方法中奖励分配(Reward Assignment, RA)函数设计依赖人工经验、缺乏系统性优化所导致的性能瓶颈。其解决方案的关键在于提出一种数据驱动的RA函数发现机制——通过大语言模型(LLM)引导的进化框架,自动探索RA函数空间并筛选出能提升策略性能的函数形式,从而构建首个元学习的AIL算法DAIL(Discovered Adversarial Imitation Learning)。该方法不仅显著提升了训练稳定性与泛化能力,还揭示了RA函数对AIL稳定性的关键作用机制。

链接: https://arxiv.org/abs/2510.00922
作者: Shashank Reddy Chirra,Jayden Teoh,Praveen Paruchuri,Pradeep Varakantham
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial Imitation Learning (AIL) methods, while effective in settings with limited expert demonstrations, are often considered unstable. These approaches typically decompose into two components: Density Ratio (DR) estimation \frac\rho_E\rho_\pi , where a discriminator estimates the relative occupancy of state-action pairs under the policy versus the expert; and Reward Assignment (RA), where this ratio is transformed into a reward signal used to train the policy. While significant research has focused on improving density estimation, the role of reward assignment in influencing training dynamics and final policy performance has been largely overlooked. RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity. In this work, we take a different approach: we investigate the discovery of data-driven RA functions, i.e, based directly on the performance of the resulting imitation policy. To this end, we leverage an LLM-guided evolutionary framework that efficiently explores the space of RA functions, yielding \emphDiscovered Adversarial Imitation Learning (DAIL), the first meta-learnt AIL algorithm. Remarkably, DAIL generalises across unseen environments and policy optimization algorithms, outperforming the current state-of-the-art of \emphhuman-designed baselines. Finally, we analyse why DAIL leads to more stable training, offering novel insights into the role of RA functions in the stability of AIL. Code is publicly available: this https URL.
zh

[AI-24] Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练时,因验证器(verifier)不可靠而导致的策略优化偏差问题。具体而言,现有方法为降低验证器被攻击的风险,常将奖励压缩为二值信号(0,1),但这会引入不对称的噪声:虚假负例(False Negatives, FNs)——正确答案被错误拒绝;以及虚假正例(False Positives, FPs)——错误答案被误判为正确。这种噪声破坏了策略梯度估计的无偏性,影响模型收敛与性能。解决方案的关键在于将验证器建模为带有非对称噪声率的随机奖励信道,并据此提出两种轻量级校正机制:一是后向校正(backward correction),通过去偏观测到的二值奖励以恢复干净策略梯度的无偏估计;二是前向校正(forward correction),仅需虚假负例率(FN rate)即可重新加权评分函数项,使期望更新方向与干净梯度一致。实验表明,这两种校正算法在数学推理任务上均优于未校正训练,且前向校正收敛更快、抗噪能力更强。

链接: https://arxiv.org/abs/2510.00915
作者: Xin-Qiang Cai,Wei Wang,Feng Liu,Tongliang Liu,Gang Niu,Masashi Sugiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary \0,1\ during training. This choice carries a cost: it introduces \textitfalse negatives (rejecting correct answers, FNs) and \textitfalse positives (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction \frac1236 as wrong when compared against the canonical \frac13 due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a \textitbackward correction that de-biases the observed binary reward to recover an \textitunbiased estimator of the clean policy gradient. The second is a \textitforward correction that reweights score-function terms so that the expected update direction aligns with the \textitclean gradient; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.
zh

[AI-25] RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

【速读】:该论文旨在解决当前基于均值的强化学习方法(如Group Relative Policy Optimization, GRPO)在大语言模型(LLM)后训练过程中存在的熵崩溃(entropy collapse)和推理能力提升有限的问题。这些问题主要源于对高概率输出序列的过度关注,而忽视了稀有但信息丰富的推理路径。解决方案的关键在于提出一种基于风险的策略优化方法(Risk-based Policy Optimization, RiskPO),其核心是用严谨的风险度量替代传统的均值目标函数,并引入混合VaR(Mixed Value-at-Risk)目标,通过加权关注奖励分布的不同区域来增强困难样本的梯度信号,防止过自信收敛;同时设计问题打包(bundling)机制以丰富反馈信号,从而稳定训练过程并促进探索。理论与实证结果均表明,该方法能有效缓解熵崩溃并显著提升数学推理、多模态推理及代码生成等任务性能。

链接: https://arxiv.org/abs/2510.00911
作者: Tao Ren,Jinyang Jiang,Hui Yang,Wan Tian,Minhao Zou,Guanghao Li,Zishi Zhang,Qinghao Wang,Shentao Qin,Yanjun Zhao,Rui Tao,Hui Shao,Yijie Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from entropy collapse and limited reasoning gains. We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths. To address these challenges, we propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures. Specifically, we introduce a Mixed Value-at-Risk objective that integrates weighted attention over multiple regions of the reward distribution, thereby amplifying gradient signals on challenging instances and preventing overconfident convergence. We further design a bundling scheme that aggregates multiple questions into bundles, thus enriching the feedback signal and yielding more stable and informative training dynamics. Theoretically, we prove that the risk-averse update alleviates entropy collapse and promotes exploration. Numerically, RiskPO achieves consistent and significant improvements in mathematical reasoning, multi-modal reasoning, and code generation benchmarks, surpassing GRPO and its variants on both Pass@1 and Pass@k metrics. Our results demonstrate that risk-based optimization provides a rigorous and effective paradigm for enhancing LLM reasoning capabilities.
zh

[AI-26] “We are not Future-ready”: Understanding AI Privacy Risks and Existing Mitigation Strategies from the Perspective of AI Developers in Europe

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)发展中隐私风险认知不一及防护策略落地不足的问题。研究发现,AI开发者对隐私威胁的优先级排序缺乏共识,且这种分歧主要源于人类因素而非纯技术因素;尽管开发者了解多种缓解策略,但在实际应用中采纳率极低。解决方案的关键在于识别并填补开发者在隐私风险认知与实践之间的差距,通过增强其对隐私风险的理解和推动有效防护措施的落地,从而提升AI系统整体的隐私保护能力。

链接: https://arxiv.org/abs/2510.00909
作者: Alexandra Klymenko,Stephen Meisenbacher,Patrick Gage Kelley,Sai Teja Peddinti,Kurt Thomas,Florian Matthes
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages, 1 figure, 4 tables. Accepted to SOUPS 2025

点击查看摘要

Abstract:The proliferation of AI has sparked privacy concerns related to training data, model interfaces, downstream applications, and more. We interviewed 25 AI developers based in Europe to understand which privacy threats they believe pose the greatest risk to users, developers, and businesses and what protective strategies, if any, would help to mitigate them. We find that there is little consensus among AI developers on the relative ranking of privacy risks. These differences stem from salient reasoning patterns that often relate to human rather than purely technical factors. Furthermore, while AI developers are aware of proposed mitigation strategies for addressing these risks, they reported minimal real-world adoption. Our findings highlight both gaps and opportunities for empowering AI developers to better address privacy risks in AI.
zh

[AI-27] ubeDAgger: Reducing the Number of Expert Interventions with Stochastic Reach-Tubes

【速读】:该论文旨在解决交互式模仿学习(Interactive Imitation Learning)中如何高效决策何时由新手策略(novice policy)自主执行动作、何时需交还控制权给专家的问题,以减少对专家干预的依赖并提升训练效率。其解决方案的关键在于引入随机可达管(stochastic reachtubes),这是一种源自动态系统验证领域的技术,用于在线估计策略行为偏离安全范围的可能性,从而智能判定是否需要专家介入;该方法无需针对不同环境进行决策阈值微调,显著降低了专家干预次数,尤其优于依赖疑虑分类模型(doubt classification model)的相关方法。

链接: https://arxiv.org/abs/2510.00906
作者: Julian Lemmel,Manuel Kranzl,Adam Lamine,Philipp Neubauer,Radu Grosu,Sophie A. Neubauer
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Interactive Imitation Learning deals with training a novice policy from expert demonstrations in an online fashion. The established DAgger algorithm trains a robust novice policy by alternating between interacting with the environment and retraining of the network. Many variants thereof exist, that differ in the method of discerning whether to allow the novice to act or return control to the expert. We propose the use of stochastic reachtubes - common in verification of dynamical systems - as a novel method for estimating the necessity of expert intervention. Our approach does not require fine-tuning of decision thresholds per environment and effectively reduces the number of expert interventions, especially when compared with related approaches that make use of a doubt classification model.
zh

[AI-28] FusionAdapter for Few-Shot Relation Learning in Multimodal Knowledge Graphs

【速读】:该论文旨在解决多模态知识图谱(Multimodal Knowledge Graphs, MMKG)中因现有方法将不同模态对齐至共享空间而导致特定模态贡献被忽略的问题,尤其在低资源场景下表现受限。其解决方案的关键在于提出FusionAdapter框架,该框架包含两个核心组件:(1) 适配器模块(adapter module),用于高效地将每种模态适应到未见关系;(2) 融合策略,能够在保留各模态特异性特征的前提下整合多模态实体表示。通过有效适配与融合多样化模态信息,FusionAdapter显著提升了在极少标注数据下的关系泛化能力。

链接: https://arxiv.org/abs/2510.00894
作者: Ran Liu,Yuan Fang,Xiaoli Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Archived paper

点击查看摘要

Abstract:Multimodal Knowledge Graphs (MMKGs) incorporate various modalities, including text and images, to enhance entity and relation representations. Notably, different modalities for the same entity often present complementary and diverse information. However, existing MMKG methods primarily align modalities into a shared space, which tends to overlook the distinct contributions of specific modalities, limiting their performance particularly in low-resource settings. To address this challenge, we propose FusionAdapter for the learning of few-shot relationships (FSRL) in MMKG. FusionAdapter introduces (1) an adapter module that enables efficient adaptation of each modality to unseen relations and (2) a fusion strategy that integrates multimodal entity representations while preserving diverse modality-specific characteristics. By effectively adapting and fusing information from diverse modalities, FusionAdapter improves generalization to novel relations with minimal supervision. Extensive experiments on two benchmark MMKG datasets demonstrate that FusionAdapter achieves superior performance over state-of-the-art methods.
zh

[AI-29] GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling

【速读】:该论文旨在解决传统多层感知机(MLP)在训练过程中结构知识与定量知识混杂导致的效率低下问题。其解决方案的关键在于提出一种名为GreenLightningAI(GLAI)的新架构模块,通过将ReLU激活所诱导的稳定激活模式(结构知识)与权重和偏置携带的数值信息(定量知识)分离,在结构固定后仅优化定量参数,从而实现更高效的训练过程。此设计保留了MLP的通用逼近能力,同时显著缩短训练时间(平均减少约40%),且在多种任务中保持或超越原生MLP的精度表现。

链接: https://arxiv.org/abs/2510.00883
作者: Jose I. Mestre,Alberto Fernández-Hernández,Cristian Pérez-Corral,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:In this work we introduce GreenLightningAI (GLAI), a new architectural block designed as an alternative to conventional MLPs. The central idea is to separate two types of knowledge that are usually entangled during training: (i) structural knowledge, encoded by the stable activation patterns induced by ReLU activations; and (ii) quantitative knowledge, carried by the numerical weights and biases. By fixing the structure once stabilized, GLAI reformulates the MLP as a combination of paths, where only the quantitative component is optimized. This reformulation retains the universal approximation capabilities of MLPs, yet achieves a more efficient training process, reducing training time by ~40% on average across the cases examined in this study. Crucially, GLAI is not just another classifier, but a generic block that can replace MLPs wherever they are used, from supervised heads with frozen backbones to projection layers in self-supervised learning or few-shot classifiers. Across diverse experimental setups, GLAI consistently matches or exceeds the accuracy of MLPs with an equivalent number of parameters, while converging faster. Overall, GLAI establishes a new design principle that opens a direction for future integration into large-scale architectures such as Transformers, where MLP blocks dominate the computational footprint.
zh

[AI-30] Advancing Automated Ethical Profiling in SE: a Zero-Shot Evaluation of LLM Reasoning

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在软件工程(Software Engineering, SE)场景中进行伦理推理时的可解释性与一致性问题,特别是如何评估LLMs在零样本(zero-shot)条件下对真实世界伦理情境的理解能力。解决方案的关键在于构建一个全自动化的评估框架,通过30个现实伦理场景测试16个LLMs,要求其识别最适用的伦理理论、判断行为道德可接受性并提供解释;随后以专家伦理学家的选择为基准,采用理论一致性率(Theory Consistency Rate, TCR)和道德可接受性二值一致率(Binary Agreement Rate, BAR)量化模型表现,并结合定性分析揭示自由文本解释中的概念收敛性。结果表明,LLMs在伦理推理任务上具备较高的一致性和可解释性,尤其在伦理模糊案例中表现出可归因的分歧模式,验证了其作为SE流程中“伦理解释器”组件的可行性。

链接: https://arxiv.org/abs/2510.00881
作者: Patrizio Migliarini,Mashal Afzal Memon,Marco Autili,Paola Inverardi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at ASE 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into software engineering (SE) tools for tasks that extend beyond code synthesis, including judgment under uncertainty and reasoning in ethically significant contexts. We present a fully automated framework for assessing ethical reasoning capabilities across 16 LLMs in a zero-shot setting, using 30 real-world ethically charged scenarios. Each model is prompted to identify the most applicable ethical theory to an action, assess its moral acceptability, and explain the reasoning behind their choice. Responses are compared against expert ethicists’ choices using inter-model agreement metrics. Our results show that LLMs achieve an average Theory Consistency Rate (TCR) of 73.3% and Binary Agreement Rate (BAR) on moral acceptability of 86.7%, with interpretable divergences concentrated in ethically ambiguous cases. A qualitative analysis of free-text explanations reveals strong conceptual convergence across models despite surface-level lexical diversity. These findings support the potential viability of LLMs as ethical inference engines within SE pipelines, enabling scalable, auditable, and adaptive integration of user-aligned ethical reasoning. Our focus is the Ethical Interpreter component of a broader profiling pipeline: we evaluate whether current LLMs exhibit sufficient interpretive stability and theory-consistent reasoning to support automated profiling.
zh

[AI-31] A Technique Based on Trade-off Maps to Visualise and Analyse Relationships Between Objectives in Optimisation Problems

【速读】:该论文旨在解决多目标优化问题中目标间关系复杂、难以理解的问题,尤其是在现实物流场景中出现的多目标组合优化问题,其难点在于如何有效揭示目标之间的局部与全局交互机制,从而为决策者提供更精准的支持。解决方案的关键在于提出一种四步分析技术:首先利用Kendall相关方法分析全局成对目标关系;其次估计帕累托前沿上各目标值的范围并进行评估;接着基于Gray码绘制类似卡诺图(Karnaugh map)的可视化地图,以突出多目标间的权衡特性;最后通过散点图识别局部关系。该方法显著提升了对目标间复杂交互结构的理解能力,有助于洞察问题本身的难度来源。

链接: https://arxiv.org/abs/2510.00877
作者: Rodrigo Lankaites Pinheiro,Dario Landa-Silva,Jason Atkin
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Optimization and Control (math.OC)
备注: 30 pages, journal paper

点击查看摘要

Abstract:Understanding the relationships between objectives in a multiobjective optimisation problem is important for developing tailored and efficient solving techniques. In particular, when tackling combinatorial optimisation problems with many objectives, that arise in real-world logistic scenarios, better support for the decision maker can be achieved through better understanding of the often complex fitness landscape. This paper makes a contribution in this direction by presenting a technique that allows a visualisation and analysis of the local and global relationships between objectives in optimisation problems with many objectives. The proposed technique uses four steps: First, the global pairwise relationships are analysed using the Kendall correlation method; then, the ranges of the values found on the given Pareto front are estimated and assessed; next, these ranges are used to plot a map using Gray code, similar to Karnaugh maps, that has the ability to highlight the trade-offs between multiple objectives; and finally, local relationships are identified using scatter plots. Experiments are presented for three combinatorial optimisation problems: multiobjective multidimensional knapsack problem, multiobjective nurse scheduling problem, and multiobjective vehicle routing problem with time windows . Results show that the proposed technique helps in the gaining of insights into the problem difficulty arising from the relationships between objectives.
zh

[AI-32] Unveiling Interesting Insights: Monte Carlo Tree Search for Knowledge Discovery

【速读】:该论文旨在解决组织在利用过程数据进行知识发现时面临的挑战,即如何将海量数据高效转化为可操作的知识,以支持决策制定。当前存在的主要问题是数据量与处理能力之间的鸿沟,导致数据难以被有效理解和利用。为此,作者提出了一种名为AIDE(Automated Insights and Data Exploration)的新方法,其核心在于采用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)框架,通过智能探索和评估潜在的数据变换与模型组合,自动识别出具有洞察力的数据模式。该方案的关键优势在于其高度可扩展性,能够集成多种模式提取策略和领域知识,为实现自动化知识发现提供了一个坚实且灵活的基础。

链接: https://arxiv.org/abs/2510.00876
作者: Pietro Totis,Alberto Pozanco,Daniel Borrajo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organizations are increasingly focused on leveraging data from their processes to gain insights and drive decision-making. However, converting this data into actionable knowledge remains a difficult and time-consuming task. There is often a gap between the volume of data collected and the ability to process and understand it, which automated knowledge discovery aims to fill. Automated knowledge discovery involves complex open problems, including effectively navigating data, building models to extract implicit relationships, and considering subjective goals and knowledge. In this paper, we introduce a novel method for Automated Insights and Data Exploration (AIDE), that serves as a robust foundation for tackling these challenges through the use of Monte Carlo Tree Search (MCTS). We evaluate AIDE using both real-world and synthetic data, demonstrating its effectiveness in identifying data transformations and models that uncover interesting data patterns. Among its strengths, AIDE’s MCTS-based framework offers significant extensibility, allowing for future integration of additional pattern extraction strategies and domain knowledge. This makes AIDE a valuable step towards developing a comprehensive solution for automated knowledge discovery.
zh

[AI-33] Learning Compact Representations of LLM Abilities via Item Response Theory

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)资源管理与高效利用的挑战,特别是如何学习模型能力的紧凑表示以支持下游任务如模型路由(model routing)和新基准上的性能预测。其解决方案的关键在于将模型对特定查询的回答正确概率建模为三个因素的函数:模型的多技能能力向量(multi-skill ability vector)、查询的区分度向量(discrimination vector)以及查询的难度标量(difficulty scalar),并基于项目反应理论(Item Response Theory, IRT)构建一个耦合模型级和查询级嵌入的混合专家(Mixture-of-Experts, MoE)网络,从而联合学习这些参数,实现高精度的模型选择与性能预测。

链接: https://arxiv.org/abs/2510.00844
作者: Jianhao Chen,Chenxu Wang,Gengrui Zhang,Peng Ye,Lei Bai,Wei Hu,Yuzhong Qu,Shuyue Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed a surge in the number of large language models (LLMs), yet efficiently managing and utilizing these vast resources remains a significant challenge. In this work, we explore how to learn compact representations of LLM abilities that can facilitate downstream tasks, such as model routing and performance prediction on new benchmarks. We frame this problem as estimating the probability that a given model will correctly answer a specific query. Inspired by the item response theory (IRT) in psychometrics, we model this probability as a function of three key factors: (i) the model’s multi-skill ability vector, (2) the query’s discrimination vector that separates models of differing skills, and (3) the query’s difficulty scalar. To learn these parameters jointly, we introduce a Mixture-of-Experts (MoE) network that couples model- and query-level embeddings. Extensive experiments demonstrate that our approach leads to state-of-the-art performance in both model routing and benchmark accuracy prediction. Moreover, analysis validates that the learned parameters encode meaningful, interpretable information about model capabilities and query characteristics.
zh

[AI-34] Improving Cryptocurrency Pump-and-Dump Detection through Ensemble-Based Models and Synthetic Oversampling Techniques

【速读】:该论文旨在解决加密货币市场中“泵与_dump”(pump and dump, PD)操纵行为检测难题,其核心挑战在于此类事件样本稀缺导致的严重类别不平衡问题,进而影响模型检测精度。解决方案的关键在于结合数据平衡技术与集成学习方法:首先采用合成少数类过采样技术(Synthetic Minority Oversampling Technique, SMOTE)缓解类别不平衡,随后评估多种先进集成学习模型以区分操纵性交易行为与正常市场活动。实验表明,SMOTE显著提升了各模型的召回率并优化了精确率与召回率之间的平衡,其中XGBoost和LightGBM在保持高F1分数的同时展现出优异的召回率(分别为94.87%和93.59%)及快速计算性能,适用于近实时监控场景。

链接: https://arxiv.org/abs/2510.00836
作者: Jieun Yu,Minjung Park,Sangmi Chai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Risk Management (q-fin.RM)
备注:

点击查看摘要

Abstract:This study aims to detect pump and dump (PD) manipulation in cryptocurrency markets, where the scarcity of such events causes severe class imbalance and hinders accurate detection. To address this issue, the Synthetic Minority Oversampling Technique (SMOTE) was applied, and advanced ensemble learning models were evaluated to distinguish manipulative trading behavior from normal market activity. The experimental results show that applying SMOTE greatly enhanced the ability of all models to detect PD events by increasing recall and improving the overall balance between precision and recall. In particular, XGBoost and LightGBM achieved high recall rates (94.87% and 93.59%, respectively) with strong F1-scores and demonstrated fast computational performance, making them suitable for near real time surveillance. These findings indicate that integrating data balancing techniques with ensemble methods significantly improves the early detection of manipulative activities, contributing to a fairer, more transparent, and more stable cryptocurrency market.
zh

[AI-35] owards Verifiable Federated Unlearning: Framework Challenges and The Road Ahead

【速读】:该论文旨在解决联邦遗忘(Federated Unlearning, FUL)中缺乏可信验证机制的问题,即客户端无法可靠确认其数据影响已被彻底移除,现有评估指标和简单通知难以提供充分保障。这一缺陷严重削弱了FUL在高监管和数据敏感场景(如医疗健康)中的可信度与实用性。解决方案的关键在于提出veriFUL——一个可验证联邦遗忘的参考框架,其核心是形式化定义验证实体、目标、方法与度量指标,并整合现有研究进展与新提出的概念与指标,从而构建“设计即信任”(trust-by-design)的FUL生命周期验证体系,确保数据遗忘操作的可审计性与透明性。

链接: https://arxiv.org/abs/2510.00833
作者: Thanh Linh Nguyen,Marcela Tuler de Oliveira,An Braeken,Aaron Yi Ding,Quoc-Viet Pham
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Journal submission

点击查看摘要

Abstract:Federated unlearning (FUL) enables removing the data influence from the model trained across distributed clients, upholding the right to be forgotten as mandated by privacy regulations. FUL facilitates a value exchange where clients gain privacy-preserving control over their data contributions, while service providers leverage decentralized computing and data freshness. However, this entire proposition is undermined because clients have no reliable way to verify that their data influence has been provably removed, as current metrics and simple notifications offer insufficient assurance. We envision unlearning verification becoming a pivotal and trust-by-design part of the FUL life-cycle development, essential for highly regulated and data-sensitive services and applications like healthcare. This article introduces veriFUL, a reference framework for verifiable FUL that formalizes verification entities, goals, approaches, and metrics. Specifically, we consolidate existing efforts and contribute new insights, concepts, and metrics to this domain. Finally, we highlight research challenges and identify potential applications and developments for verifiable FUL and veriFUL.
zh

[AI-36] Benchmarking Machine Learning Models for Fault Classification and Localization in Power System Protection ICASSP2026

【速读】:该论文旨在解决分布式能源资源(Distributed Energy Resources, DERs)尤其是可再生能源大规模接入电网后,传统基于固定阈值的继电保护方案在故障分类(Fault Classification, FC)和故障定位(Fault Localization, FL)任务中可靠性下降的问题。其解决方案的关键在于首次系统性地对经典机器学习(Machine Learning, ML)模型在电磁暂态(EMT)数据基础上进行对比基准测试,通过滑动窗口分割电压与电流波形(窗口长度为10–50 ms),在满足实时性约束条件下评估不同模型的准确性、窗口尺寸鲁棒性及运行效率,最终实现高精度FC(F1分数达0.992 ± 0.001)与有效FL(R²达0.806 ± 0.008,平均处理时间仅0.563 ms)。

链接: https://arxiv.org/abs/2510.00831
作者: Julian Oelhaf,Georg Kordowich,Changhun Kim,Paula Andrea Pérez-Toro,Christian Bergler,Andreas Maier,Johann Jäger,Siming Bayer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Submitted to ICASSP 2026; under review

点击查看摘要

Abstract:The increasing integration of distributed energy resources (DERs), particularly renewables, poses significant challenges for power system protection, with fault classification (FC) and fault localization (FL) being among the most critical tasks. Conventional protection schemes, based on fixed thresholds, cannot reliably identify and localize short circuits with the increasing complexity of the grid under dynamic conditions. Machine learning (ML) offers a promising alternative; however, systematic benchmarks across models and settings remain limited. This work presents, for the first time, a comparative benchmarking study of classical ML models for FC and FL in power system protection based on EMT data. Using voltage and current waveforms segmented into sliding windows of 10 ms to 50 ms, we evaluate models under realistic real-time constraints. Performance is assessed in terms of accuracy, robustness to window size, and runtime efficiency. The best-performing FC model achieved an F1 score of 0.992 \pm 0.001, while the top FL model reached an R2 of 0.806 \pm 0.008 with a mean processing time of 0.563 ms.
zh

[AI-37] Logical Consistency Between Disagreeing Experts and Its Role in AI Safety

【速读】:该论文旨在解决无监督分类器评估问题,即在缺乏真实标签的情况下,如何基于多个专家(如大语言模型作为评判者)之间的一致性与分歧来推断出逻辑上一致的群体评估结果。其解决方案的关键在于构建一个基于逻辑一致性的形式化框架,将专家间的共识与分歧转化为整数线性规划(Integer Linear Programming, ILP)问题,通过引入普遍适用的线性等式约束(axioms)和显式逻辑不等式(如正确响应数不超过观察到的响应总数),精确计算所有可能与观测行为相容的评价集合,从而实现无需先验知识即可检测LLM-as-Judge是否违反用户设定的最低评分阈值的“零知识警报”机制。

链接: https://arxiv.org/abs/2510.00821
作者: Andrés Corrada-Emmanuel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:If two experts disagree on a test, we may conclude both cannot be 100 per cent correct. But if they completely agree, no possible evaluation can be excluded. This asymmetry in the utility of agreements versus disagreements is explored here by formalizing a logic of unsupervised evaluation for classifiers. Its core problem is computing the set of group evaluations that are logically consistent with how we observe them agreeing and disagreeing in their decisions. Statistical summaries of their aligned decisions are inputs into a Linear Programming problem in the integer space of possible correct or incorrect responses given true labels. Obvious logical constraints, such as, the number of correct responses cannot exceed the number of observed responses, are inequalities. But in addition, there are axioms, universally applicable linear equalities that apply to all finite tests. The practical and immediate utility of this approach to unsupervised evaluation using only logical consistency is demonstrated by building no-knowledge alarms that can detect when one or more LLMs-as-Judges are violating a minimum grading threshold specified by the user.
zh

[AI-38] Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在基于策略梯度(policy gradient)的强化学习训练中优化稳定性不足的问题。现有方法因缺乏对优化动态的可靠追踪,常采用保守超参数设置以保证稳定,导致样本效率低下且计算成本高。解决方案的关键在于形式化策略梯度的随机优化问题,并显式引入二阶几何信息(second-order geometry),构建一个可计算的框架来追踪和利用更新过程中的曲率信息;进而设计基于数据选择的干预机制,识别并屏蔽引发不稳定更新的样本。由此提出的Curvature-Aware Policy Optimization (CAPO)算法,在理论层面提供单调改进保证,并在标准数学推理基准上实现高达30倍的样本效率提升,同时仅需屏蔽少于8%的token。

链接: https://arxiv.org/abs/2510.00819
作者: Luckeciano C. Melo,Alessandro Abate,Yarin Gal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30x improvement in sample efficiency over standard GRPO for LLM reasoning.
zh

[AI-39] Semantic Bridges Between First Order c-Representations and Cost-Based Semantics: An Initial Perspective

【速读】:该论文旨在解决不一致知识库(inconsistent knowledge base)在描述逻辑(Description Logic, DL)语境下的查询问题,尤其是在传统逻辑推理无法有效处理矛盾信息时的语义建模挑战。其解决方案的关键在于通过引入加权知识库(weighted-knowledge bases)代价语义(cost-based semantics),将每个知识条目赋予权重,并基于解释(interpretation)违反规则的频率计算其代价,从而构建一个有序的解释空间;同时对比了这一方法与c-表示法(c-representations)——一种由Kern-Isberner提出的非单调推理形式,后者通过为违反条件句的解释分配惩罚来实现概念包含的可 defeasible(可撤销)解释。研究发现,在特定条件下,两种形式能够生成相同的解释排序,即在相对代价意义上具有语义等价性,且部分推理结论在两者中可等价表达,这为统一不同语义框架提供了理论基础并促进后续在代价语义与c-表示法方向的研究发展。

链接: https://arxiv.org/abs/2510.00817
作者: Nicholas Leisegang,Giovanni Casini,Thomas Meyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Weighted-knowledge bases and cost-based semantics represent a recent formalism introduced by Bienvenu et al. for Ontology Mediated Data Querying in the case where a given knowledge base is inconsistent. This is done by adding a weight to each statement in the knowledge base (KB), and then giving each DL interpretation a cost based on how often it breaks rules in the KB. In this paper we compare this approach with c-representations, a form of non-monotonic reasoning originally introduced by Kern-Isberner. c-Representations describe a means to interpret defeasible concept inclusions in the first-order case. This is done by assigning a numerical ranking to each interpretations via penalties for each violated conditional. We compare these two approaches on a semantic level. In particular, we show that under certain conditions a weighted knowledge base and a set of defeasible conditionals can generate the same ordering on interpretations, and therefore an equivalence of semantic structures up to relative cost. Moreover, we compare entailment described in both cases, where certain notions are equivalently expressible in both formalisms. Our results have the potential to benefit further work on both cost-based semantics and c-representations
zh

[AI-40] MG2FlowNet: Accelerating High-Reward Sample Generation via Enhanced MCTS and Greediness Control

【速读】:该论文旨在解决生成式流网络(GFlowNets)在大规模搜索空间中难以持续生成高奖励样本的问题,尤其是在高奖励区域稀疏时易出现过度探索且缺乏稳定性能的挑战。其解决方案的关键在于将改进的蒙特卡洛树搜索(MCTS)引入GFlowNets采样过程,利用基于MCTS的策略评估引导生成路径向高奖励区域聚焦,并采用多项式上置信界树(PUCT)自适应平衡探索与利用,同时引入可控机制调节贪婪程度,从而在不牺牲多样性的前提下增强对高奖励轨迹的利用能力。

链接: https://arxiv.org/abs/2510.00805
作者: Rui Zhu,Xuan Yu,Yudong Zhang,Chen Zhang,Xu Wang,Yang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) have emerged as a powerful tool for generating diverse and high-reward structured objects by learning to sample from a distribution proportional to a given reward function. Unlike conventional reinforcement learning (RL) approaches that prioritize optimization of a single trajectory, GFlowNets seek to balance diversity and reward by modeling the entire trajectory distribution. This capability makes them especially suitable for domains such as molecular design and combinatorial optimization. However, existing GFlowNets sampling strategies tend to overexplore and struggle to consistently generate high-reward samples, particularly in large search spaces with sparse high-reward regions. Therefore, improving the probability of generating high-reward samples without sacrificing diversity remains a key challenge under this premise. In this work, we integrate an enhanced Monte Carlo Tree Search (MCTS) into the GFlowNets sampling process, using MCTS-based policy evaluation to guide the generation toward high-reward trajectories and Polynomial Upper Confidence Trees (PUCT) to balance exploration and exploitation adaptively, and we introduce a controllable mechanism to regulate the degree of greediness. Our method enhances exploitation without sacrificing diversity by dynamically balancing exploration and reward-driven guidance. The experimental results show that our method can not only accelerate the speed of discovering high-reward regions but also continuously generate high-reward samples, while preserving the diversity of the generative distribution. All implementations are available at this https URL.
zh

[AI-41] Fast Secure and High-Capacity Image Watermarking with Autoencoded Text Vectors

【速读】:该论文旨在解决传统图像水印系统中“比特中心化”(bit-centric)设计理念所导致的容量瓶颈与信息无意义性问题,即现有方法将嵌入内容视为无语义的随机比特流,限制了水印的信息承载能力并阻碍其在可信AI治理等场景中的应用。其解决方案的关键在于提出LatentSeal框架,将水印重构为语义通信任务:通过轻量级文本自动编码器将完整句子映射为256维单位范数的语义潜在向量(latent vector),再由微调后的水印模型鲁棒嵌入,并借助秘密可逆旋转实现安全保护。该设计突破了长期存在的256比特负载上限,支持实时解码且能抵御价值度量和几何攻击,同时引入统计校准评分机制以提升部署实用性,从而实现了高容量、鲁棒性、安全性与可解释性的统一。

链接: https://arxiv.org/abs/2510.00799
作者: Gautier Evennou,Vivien Chappelier,Ewa Kijak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Most image watermarking systems focus on robustness, capacity, and imperceptibility while treating the embedded payload as meaningless bits. This bit-centric view imposes a hard ceiling on capacity and prevents watermarks from carrying useful information. We propose LatentSeal, which reframes watermarking as semantic communication: a lightweight text autoencoder maps full-sentence messages into a compact 256-dimensional unit-norm latent vector, which is robustly embedded by a finetuned watermark model and secured through a secret, invertible rotation. The resulting system hides full-sentence messages, decodes in real time, and survives valuemetric and geometric attacks. It surpasses prior state of the art in BLEU-4 and Exact Match on several benchmarks, while breaking through the long-standing 256-bit payload ceiling. It also introduces a statistically calibrated score that yields a ROC AUC score of 0.97-0.99, and practical operating points for deployment. By shifting from bit payloads to semantic latent vectors, LatentSeal enables watermarking that is not only robust and high-capacity, but also secure and interpretable, providing a concrete path toward provenance, tamper explanation, and trustworthy AI governance. Models, training and inference code, and data splits will be available upon publication.
zh

[AI-42] Benchmarking Agent ic Systems in Automated Scientific Information Extraction with ChemX NEURIPS2025

【速读】:该论文旨在解决化学信息提取(Chemical Information Extraction, CIE)领域中因化学数据固有的异质性而导致的自动化提取性能受限问题。当前通用型与领域专用的基于代理(agent-based)系统在该任务上表现有限,难以应对专业术语、复杂表格和图示结构及上下文依赖性歧义等挑战。解决方案的关键在于提出ChemX基准数据集,其包含10个由领域专家手工标注并验证的数据集,聚焦于纳米材料和小分子领域,用于严格评估和提升自动化提取方法;同时引入一种单代理(single-agent)策略以实现对文档预处理阶段的精确控制,并通过对比前沿模型如GPT-5及其思维链变体(GPT-5 Thinking)与现有代理系统的表现,揭示了当前方法在化学场景下的局限性与改进方向。

链接: https://arxiv.org/abs/2510.00795
作者: Anastasia Vepreva,Julia Razlivina,Maria Eremeeva,Nina Gubina,Anastasia Orlova,Aleksei Dmitrenko,Ksenya Kapranova,Susan Jyakhwo,Nikita Vasilev,Arsen Sarkisyan,Ivan Yu. Chernyshov,Vladimir Vinogradov,Andrei Dmitrenko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at The AI for Accelerated Materials Discovery (AI4Mat) Workshop, NeurIPS 2025

点击查看摘要

Abstract:The emergence of agent-based systems represents a significant advancement in artificial intelligence, with growing applications in automated data extraction. However, chemical information extraction remains a formidable challenge due to the inherent heterogeneity of chemical data. Current agent-based approaches, both general-purpose and domain-specific, exhibit limited performance in this domain. To address this gap, we present ChemX, a comprehensive collection of 10 manually curated and domain-expert-validated datasets focusing on nanomaterials and small molecules. These datasets are designed to rigorously evaluate and enhance automated extraction methodologies in chemistry. To demonstrate their utility, we conduct an extensive benchmarking study comparing existing state-of-the-art agentic systems such as ChatGPT Agent and chemical-specific data extraction agents. Additionally, we introduce our own single-agent approach that enables precise control over document preprocessing prior to extraction. We further evaluate the performance of modern baselines, such as GPT-5 and GPT-5 Thinking, to compare their capabilities with agentic approaches. Our empirical findings reveal persistent challenges in chemical information extraction, particularly in processing domain-specific terminology, complex tabular and schematic representations, and context-dependent ambiguities. The ChemX benchmark serves as a critical resource for advancing automated information extraction in chemistry, challenging the generalization capabilities of existing methods, and providing valuable insights into effective evaluation strategies.
zh

[AI-43] AI in data science education: experiences from the classroom

【速读】:该论文试图解决的问题是:如何在教育环境中负责任地整合生成式 AI(Generative AI),特别是大型语言模型(Large Language Models, LLMs)如 ChatGPT,以优化教学与学习效果,同时避免因学生过度依赖技术而导致认知能力与问题解决技能发展的削弱。解决方案的关键在于:通过调整评估方法、强化伦理规范并引导教师合理设计教学活动,使 AI 成为辅助而非替代核心学习过程的工具,从而确保教育目标的实现。

链接: https://arxiv.org/abs/2510.00793
作者: J.A. Hageman,C.F.W. Peeters
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 6 pages, 0 figures

点击查看摘要

Abstract:This study explores the integration of AI, particularly large language models (LLMs) like ChatGPT, into educational settings, focusing on the implications for teaching and learning. Through interviews with course coordinators from data science courses at Wageningen University, this research identifies both the benefits and challenges associated with AI in the classroom. While AI tools can streamline tasks and enhance learning, concerns arise regarding students’ overreliance on these technologies, potentially hindering the development of essential cognitive and problem solving skills. The study highlights the importance of responsible AI usage, ethical considerations, and the need for adapting assessment methods to ensure educational outcomes are met. With careful integration, AI can be a valuable asset in education, provided it is used to complement rather than replace fundamental learning processes.
zh

[AI-44] DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models ICCV2025

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像编辑应用中被恶意利用的问题,尤其是通过DDIM(Denoising Diffusion Implicit Models)反演操作生成虚假信息或深度伪造内容(deepfake)的风险。其核心挑战在于现有防御方法(如AdvDM和Photoguard)因与测试时的迭代去噪轨迹不一致而导致防御效果薄弱。解决方案的关键在于提出DDIM Inversion Attack(DIA),该方法直接攻击集成的DDIM反演路径,有效扰乱潜在空间中的编辑轨迹,从而显著优于以往防御策略,在多种编辑场景下均展现出更强的破坏能力,为工业界和学术界提供了实用的对抗性防御框架。

链接: https://arxiv.org/abs/2510.00778
作者: Seunghoo Hong,Geonho Son,Juhun Lee,Simon S. Woo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICCV2025

点击查看摘要

Abstract:Diffusion models have shown to be strong representation learners, showcasing state-of-the-art performance across multiple domains. Aside from accelerated sampling, DDIM also enables the inversion of real images back to their latent codes. A direct inheriting application of this inversion operation is real image editing, where the inversion yields latent trajectories to be utilized during the synthesis of the edited image. Unfortunately, this practical tool has enabled malicious users to freely synthesize misinformative or deepfake contents with greater ease, which promotes the spread of unethical and abusive, as well as privacy-, and copyright-infringing contents. While defensive algorithms such as AdvDM and Photoguard have been shown to disrupt the diffusion process on these images, the misalignment between their objectives and the iterative denoising trajectory at test time results in weak disruptive this http URL this work, we present the DDIM Inversion Attack (DIA) that attacks the integrated DDIM trajectory path. Our results support the effective disruption, surpassing previous defensive methods across various editing methods. We believe that our frameworks and results can provide practical defense methods against the malicious use of AI for both the industry and the research community. Our code is available here: this https URL.
zh

[AI-45] Neural Diffusion Processes for Physically Interpretable Survival Prediction

【速读】:该论文旨在解决生存分析(survival analysis)中模型可解释性与预测精度难以兼顾的问题,尤其针对传统方法如Cox回归对比例风险假设的依赖以及参数化模型灵活性不足的局限。其解决方案的关键在于将深度神经网络与随机过程理论中的首次通过时间(first hitting time, FHT)分布相结合,构建DeepFHT框架:通过神经网络映射输入特征到物理意义明确的FHT过程参数(如初始状态、漂移率和扩散系数),从而在布朗运动等设定下获得闭式形式的生存函数和风险函数;该方法无需假设比例风险,能自然捕捉时变风险,并提供基于物理机制的参数可解释性,实现了复杂系统中生存现象建模的原理性统一。

链接: https://arxiv.org/abs/2510.00733
作者: Alessio Cristofoletto,Cesare Rollo,Giovanni Birolo,Piero Fariselli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:We introduce DeepFHT, a survival-analysis framework that couples deep neural networks with first hitting time (FHT) distributions from stochastic process theory. Time to event is represented as the first passage of a latent diffusion process to an absorbing boundary. A neural network maps input variables to physically meaningful parameters including initial condition, drift, and diffusion, within a chosen FHT process such as Brownian motion, both with drift and driftless. This yields closed-form survival and hazard functions and captures time-varying risk without assuming proportional-hazards. We compare DeepFHT with Cox regression and other existing parametric survival models, using synthetic and real-world datasets. The method achieves predictive accuracy on par with state-of-the-art approaches, while maintaining a physics-based interpretable parameterization that elucidates the relation between input features and risk. This combination of stochastic process theory and deep learning provides a principled avenue for modeling survival phenomena in complex systems. Comments: 11 pages, 6 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM) Cite as: arXiv:2510.00733 [cs.LG] (or arXiv:2510.00733v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.00733 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-46] EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在形式化定理证明任务中普遍存在的泛化能力不足和对问题表述微小变化敏感的问题。其核心解决方案在于提出了一种新颖的数据增强流水线(data augmentation pipeline),通过两个关键维度提升模型鲁棒性:一是对称性(symmetry),包含基于抽象语法树(Abstract Syntax Tree, AST)的EvolAST方法以生成语义等价但语法不同的问题变体,以及利用LLM跨数学领域迁移定理的EvolDomain方法以应对语义对称性;二是难度多样性(difficulty),通过设计演化指令引导LLM生成不同难度的新定理。最终训练出的非推理型定理证明器EvolProver(7B参数)在多个基准上达到新的最先进水平,验证了该数据增强策略的有效性。

链接: https://arxiv.org/abs/2510.00732
作者: Yuchen Tian,Ruiyuan Huang,Xuanwu Wang,Jing Ma,Zengfeng Huang,Ziyang Luo,Hongzhan Lin,Da Zheng,Lun Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) for formal theorem proving have shown significant promise, yet they often lack generalizability and are fragile to even minor transformations of problem statements. To address this limitation, we introduce a novel data augmentation pipeline designed to enhance model robustness from two perspectives: symmetry and difficulty. From the symmetry perspective, we propose two complementary methods: EvolAST, an Abstract Syntax Tree (AST) based approach that targets syntactic symmetry to generate semantically equivalent problem variants, and EvolDomain, which leverages LLMs to address semantic symmetry by translating theorems across mathematical domains. From the difficulty perspective, we propose EvolDifficulty, which uses carefully designed evolutionary instructions to guide LLMs in generating new theorems with a wider range of difficulty. We then use the evolved data to train EvolProver, a 7B-parameter non-reasoning theorem prover. EvolProver establishes a new state-of-the-art (SOTA) on FormalMATH-Lite with a 53.8% pass@32 rate, surpassing all models of comparable size, including reasoning-based models. It also sets new SOTA records for non-reasoning models on MiniF2F-Test (69.8% pass@32), Ineq-Comp-Seed (52.2% pass@32), and Ineq-Comp-Transformed (34.0% pass@32). Ablation studies further confirm our data augmentation pipeline’s effectiveness across multiple benchmarks.
zh

[AI-47] CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation

【速读】:该论文旨在解决机器人操作策略在执行过程中遭遇未在训练阶段明确覆盖的变异性时,仍能保持鲁棒性的问题。现有基于监督学习的方法往往依赖于历史状态序列进行推理,但标准注意力机制对所有过去状态一视同仁,未能显式建模演示中可能存在的时序结构(如失败与恢复模式)。其解决方案的关键在于提出一种交叉状态转移注意力Transformer(Cross-State Transition Attention Transformer),通过引入状态转移注意力(State Transition Attention, STA)机制,依据学习到的状态演化模式调节标准注意力权重,从而增强策略对执行历史的适应能力;同时结合训练中的时间掩码策略(temporal masking),随机移除近期视觉信息以促使模型从历史上下文中进行更深层次的时间推理,实验表明该方法在各项任务上均显著优于标准交叉注意力及TCN、LSTM等时序建模方法。

链接: https://arxiv.org/abs/2510.00726
作者: Giovanni Minelli,Giulio Turrisi,Victor Barasuol,Claudio Semini
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and data available at this https URL

点击查看摘要

Abstract:Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard cross-attention and temporal modeling approaches like TCN and LSTM networks across all tasks, achieving more than 2x improvement over cross-attention on precision-critical tasks.
zh

[AI-48] AttentionDep: Domain-Aware Attention for Explainable Depression Severity Assessment

【速读】:该论文旨在解决从社交媒体文本中自动检测抑郁严重程度的问题,尤其关注模型的可解释性与临床相关性。其核心挑战在于如何融合上下文语义信息与领域知识,以提升预测准确性并提供可信的推理依据。解决方案的关键在于提出AttentionDep模型:首先通过层次化编码(n-gram特征)捕捉文本中的细粒度语义;其次引入跨注意力机制融合来自结构化心理健康知识图谱的领域知识,增强上下文表征;最后采用有序回归框架进行预测,确保结果符合临床对抑郁严重等级的自然排序。实验表明,该方法在多个数据集上相比现有最优基线模型在分级F1分数上提升超过5%,同时提供可解释的预测依据。

链接: https://arxiv.org/abs/2510.00706
作者: Yusif Ibrahimov,Tarique Anwar,Tommy Yuan,Turan Mutallimov,Elgun Hasanov
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In today’s interconnected society, social media platforms provide a window into individuals’ thoughts, emotions, and mental states. This paper explores the use of platforms like Facebook, X (formerly Twitter), and Reddit for depression severity detection. We propose AttentionDep, a domain-aware attention model that drives explainable depression severity estimation by fusing contextual and domain knowledge. Posts are encoded hierarchically using unigrams and bigrams, with attention mechanisms highlighting clinically relevant tokens. Domain knowledge from a curated mental health knowledge graph is incorporated through a cross-attention mechanism, enriching the contextual features. Finally, depression severity is predicted using an ordinal regression framework that respects the clinical-relevance and natural ordering of severity levels. Our experiments demonstrate that AttentionDep outperforms state-of-the-art baselines by over 5% in graded F1 score across datasets, while providing interpretable insights into its predictions. This work advances the development of trustworthy and transparent AI systems for mental health assessment from social media.
zh

[AI-49] ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning

【速读】:该论文旨在解决大规模视觉语言模型(Vision-Language Models, VLMs)在通过强化学习进行复杂推理对齐时,受限于现有策略优化算法的局限性问题,例如静态训练调度和近端策略优化(Proximal Policy Optimization, PPO)中固定且统一的裁剪机制导致的效率与稳定性不足。其解决方案的关键在于提出自适应课程策略优化(Adaptive Curriculum Policy Optimization, ACPO),包含两个核心组件:一是动态课程机制,通过逐步增加样本复用率,实现从稳定近策略探索到高效离策略利用的平稳过渡;二是优势感知自适应裁剪(Advantage-Aware Adaptive Clipping, AAAC)机制,以每个token的归一化优势为依据动态调整裁剪边界,从而在高潜力样本上允许更大梯度更新,同时抑制破坏性样本的影响,显著提升策略更新的粒度与鲁棒性。

链接: https://arxiv.org/abs/2510.00690
作者: Yunhao Wang,Ziting Li,Shuai Chen,Tao Liu,Chao Song,Junjie Jiang,Jian Zhu,Peng Gao,Bin Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning large-scale vision-language models (VLMs) for complex reasoning via reinforcement learning is often hampered by the limitations of existing policy optimization algorithms, such as static training schedules and the rigid, uniform clipping mechanism in Proximal Policy Optimization (PPO). In this work, we introduce Adaptive Curriculum Policy Optimization (ACPO), a novel framework that addresses these challenges through a dual-component adaptive learning strategy. First, ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase by progressively increasing sample reuse. Second, we propose an Advantage-Aware Adaptive Clipping (AAAC) mechanism that replaces the fixed clipping hyperparameter with dynamic, sample-wise bounds modulated by the normalized advantage of each token. This allows for more granular and robust policy updates, enabling larger gradients for high-potential samples while safeguarding against destructive ones. We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro. Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
zh

[AI-50] Relevance-Zone Reduction in Game Solving

【速读】:该论文旨在解决大规模博弈中因博弈树指数级增长而导致的求解困难问题,尤其针对生成式 AI(Generative AI)在复杂博弈场景下难以高效搜索最优策略的瓶颈。其核心解决方案是提出一种迭代式相关区域(Relevance-Zone, RZ)缩减方法,通过反复求解同一位置并逐步限制参与计算的区域范围,引导求解器收敛至更小的RZ;同时设计三种约束生成策略,并引入RZ模式表(RZ Pattern Table)以复用历史求解经验,从而显著压缩搜索空间并提升剪枝效率。实验表明,在7×7 Killall-Go上平均RZ大小降低至原始的85.95%,且缩减后的RZ可作为可复用知识永久存储,适用于更大棋盘或不同开局场景的后续求解任务。

链接: https://arxiv.org/abs/2510.00689
作者: Chi-Huang Lin,Ting Han Wei,Chun-Jui Wang,Hung Guei,Chung-Chin Shih,Yun-Jui Tsai,I-Chen Wu,Ti-Rong Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by the Advances in Computer Games (ACG 2025)

点击查看摘要

Abstract:Game solving aims to find the optimal strategies for all players and determine the theoretical outcome of a game. However, due to the exponential growth of game trees, many games remain unsolved, even though methods like AlphaZero have demonstrated super-human level in game playing. The Relevance-Zone (RZ) is a local strategy reuse technique that restricts the search to only the regions relevant to the outcome, significantly reducing the search space. However, RZs are not unique. Different solutions may result in RZs of varying sizes. Smaller RZs are generally more favorable, as they increase the chance of reuse and improve pruning efficiency. To this end, we propose an iterative RZ reduction method that repeatedly solves the same position while gradually restricting the region involved, guiding the solver toward smaller RZs. We design three constraint generation strategies and integrate an RZ Pattern Table to fully leverage past solutions. In experiments on 7x7 Killall-Go, our method reduces the average RZ size to 85.95% of the original. Furthermore, the reduced RZs can be permanently stored as reusable knowledge for future solving tasks, especially for larger board sizes or different openings.
zh

[AI-51] Collaborative-Distilled Diffusion Models (CDDM) for Accelerated and Lightweight Trajectory Prediction

【速读】:该论文旨在解决扩散模型在自动驾驶(Autonomous Vehicles, AV)和智能交通系统(Intelligent Transportation Systems, ITS)中部署时面临的两大挑战:模型参数量大导致资源消耗高,以及采样过程慢难以满足实时性要求。解决方案的关键在于提出协同蒸馏扩散模型(Collaborative-Distilled Diffusion Models, CDDM),其核心是基于协同渐进蒸馏(Collaborative Progressive Distillation, CPD)机制,通过多轮蒸馏迭代同时压缩模型规模并减少采样步数;此外引入双信号正则化蒸馏损失,融合教师模型与真实数据的双重指导,有效缓解过拟合问题并保障预测鲁棒性。实验表明,CDDM在保持接近原始模型性能的同时,实现161倍参数压缩、31倍加速及9毫秒延迟,显著提升了生成式AI在实际场景中的可部署性。

链接: https://arxiv.org/abs/2510.00627
作者: Bingzhang Wang,Kehua Chen,Yinhai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trajectory prediction is a fundamental task in Autonomous Vehicles (AVs) and Intelligent Transportation Systems (ITS), supporting efficient motion planning and real-time traffic safety management. Diffusion models have recently demonstrated strong performance in probabilistic trajectory prediction, but their large model size and slow sampling process hinder real-world deployment. This paper proposes Collaborative-Distilled Diffusion Models (CDDM), a novel method for real-time and lightweight trajectory prediction. Built upon Collaborative Progressive Distillation (CPD), CDDM progressively transfers knowledge from a high-capacity teacher diffusion model to a lightweight student model, jointly reducing both the number of sampling steps and the model size across distillation iterations. A dual-signal regularized distillation loss is further introduced to incorporate guidance from both the teacher and ground-truth data, mitigating potential overfitting and ensuring robust performance. Extensive experiments on the ETH-UCY pedestrian benchmark and the nuScenes vehicle benchmark demonstrate that CDDM achieves state-of-the-art prediction accuracy. The well-distilled CDDM retains 96.2% and 95.5% of the baseline model’s ADE and FDE performance on pedestrian trajectories, while requiring only 231K parameters and 4 or 2 sampling steps, corresponding to 161x compression, 31x acceleration, and 9 ms latency. Qualitative results further show that CDDM generates diverse and accurate trajectories under dynamic agent behaviors and complex social interactions. By bridging high-performing generative models with practical deployment constraints, CDDM enables resource-efficient probabilistic prediction for AVs and ITS. Code is available at this https URL.
zh

[AI-52] Is Model Editing Built on Sand? Revealing Its Illusory Success and Frag ile Foundation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在过时或错误知识难以更新、删除与遗忘的问题,这直接影响模型对齐性(alignment)、安全性等关键属性。当前主流的模型编辑(model editing)方法通过微调少量参数实现特定事实的精准修改,同时保留其他知识,但作者指出其可靠性建立在脆弱基础之上,本质上是依赖于隐藏捷径(hidden shortcuts)而非真正的语义理解。论文的关键贡献在于系统性地构建了一套新的评估方法,特别引入负例(negative examples)设计以揭示现有方法的局限性;实证结果表明,最先进的编辑方法在最简单的否定查询下即失效,说明当前技术更可能基于表面捷径而非深层语义整合,因此亟需重新审视模型编辑的理论基础。

链接: https://arxiv.org/abs/2510.00625
作者: Wei Liu,Haomei Xu,Bingqing Liu,Zhiying Deng,Haozhao Wang,Jun Wang,Ruixuan Li,Yee Whye Teh,Wee Sun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is a work in progress. Comments and suggestions are welcome

点击查看摘要

Abstract:Large language models (LLMs) inevitably encode outdated or incorrect knowledge. Updating, deleting, and forgetting such knowledge is important for alignment, safety, and other issues. To address this issue, model editing has emerged as a promising paradigm: by precisely editing a small subset of parameters such that a specific fact is updated while preserving other knowledge. Despite its great success reported in previous papers, we find the apparent reliability of editing rests on a fragile foundation and the current literature is largely driven by illusory success. The fundamental goal of steering the model’s output toward a target with minimal modification would encourage exploiting hidden shortcuts, rather than utilizing real semantics. This problem directly challenges the feasibility of the current model editing literature at its very foundation, as shortcuts are inherently at odds with robust knowledge integration. Coincidentally, this issue has long been obscured by evaluation frameworks that lack the design of negative examples. To uncover it, we systematically develop a suite of new evaluation methods. Strikingly, we find that state-of-the-art approaches collapse even under the simplest negation queries. Our empirical evidence shows that editing is likely to be based on shortcuts rather than full semantics, calling for an urgent reconsideration of the very basis of model editing before further advancements can be meaningfully pursued.
zh

[AI-53] FAME: Adaptive Functional Attention with Expert Routing for Function-on-Function Regression

【速读】:该论文旨在解决函数型数据(functional data)在表示学习中的挑战,特别是传统统计模型依赖预设基函数展开或核函数而缺乏灵活性,以及深度学习方法将函数视为固定网格向量从而忽略其内在连续性的问题。解决方案的关键在于提出一种端到端、全数据驱动的框架——Functional Attention with a Mixture-of-Experts (FAME),其核心创新包括:通过双向神经控制微分方程(neural controlled differential equation)构建连续注意力机制以捕捉函数内部的连续性特征,并利用多专家(Mixture-of-Experts, MoE)驱动的向量场建模函数间依赖关系;进一步结合多头交叉注意力机制融合函数间的动态变化,从而实现对函数型回归任务的高精度建模与强鲁棒性表现。

链接: https://arxiv.org/abs/2510.00621
作者: Yifei Gao,Yong Chen,Chen Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Functional data play a pivotal role across science and engineering, yet their infinite-dimensional nature makes representation learning challenging. Conventional statistical models depend on pre-chosen basis expansions or kernels, limiting the flexibility of data-driven discovery, while many deep-learning pipelines treat functions as fixed-grid vectors, ignoring inherent continuity. In this paper, we introduce Functional Attention with a Mixture-of-Experts (FAME), an end-to-end, fully data-driven framework for function-on-function regression. FAME forms continuous attention by coupling a bidirectional neural controlled differential equation with MoE-driven vector fields to capture intra-functional continuity, and further fuses change to inter-functional dependencies via multi-head cross attention. Extensive experiments on synthetic and real-world functional-regression benchmarks show that FAME achieves state-of-the-art accuracy, strong robustness to arbitrarily sampled discrete observations of functions.
zh

[AI-54] What Did I Learn? Operational Competence Assessment for AI-Based Trajectory Planners

【速读】:该论文旨在解决自动化驾驶中机器学习模型因训练数据覆盖不足而导致的运行风险难以评估的问题,尤其是在面对未充分训练的情境时缺乏可靠的识别与解释能力。解决方案的关键在于将驾驶数据建模为知识图谱(knowledge graphs),通过实体及其关系表示驾驶场景,并查询特定子场景配置在训练数据中的出现情况,从而量化车辆在该场景下的能力水平——该能力由子场景配置的覆盖率和复杂度共同决定,复杂度越高则所需覆盖率也越高。此方法提升了模型的可解释性,并为可信AI在自动驾驶中的部署提供了监控依据。

链接: https://arxiv.org/abs/2510.00619
作者: Michiel Braat,Maren Buermann,Marijke van Weperen,Jan-Pieter Paardekooper
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication in proceedings of the 2025 IEEE International Automated Vehicle Validation Conference

点击查看摘要

Abstract:Automated driving functions increasingly rely on machine learning for tasks like perception and trajectory planning, requiring large, relevant datasets. The performance of these algorithms depends on how closely the training data matches the task. To ensure reliable functioning, it is crucial to know what is included in the dataset to assess the trained model’s operational risk. We aim to enhance the safe use of machine learning in automated driving by developing a method to recognize situations that an automated vehicle has not been sufficiently trained on. This method also improves explainability by describing the dataset at a human-understandable level. We propose modeling driving data as knowledge graphs, representing driving scenes with entities and their relationships. These graphs are queried for specific sub-scene configurations to check their occurrence in the dataset. We estimate a vehicle’s competence in a driving scene by considering the coverage and complexity of sub-scene configurations in the training set. Higher complexity scenes require greater coverage for high competence. We apply this method to the NuPlan dataset, modeling it with knowledge graphs and analyzing the coverage of specific driving scenes. This approach helps monitor the competence of machine learning models trained on the dataset, which is essential for trustworthy AI to be deployed in automated driving.
zh

[AI-55] AI-Driven Self-Evolving Software: A Promising Path Toward Software Automation

【速读】:该论文试图解决的核心问题是:当前人工智能(AI)在软件开发中仍局限于辅助人类开发者,无法实现真正的软件自动化。为突破这一局限,论文提出将AI从“助手”角色转变为软件系统的核心组成部分,从而实现无需人工干预的持续演化与自适应能力。解决方案的关键在于引入一种新型软件形态——AI驱动的自演化软件(AI-Driven Self-Evolving Software),其通过多智能体架构实现对用户需求的自主理解、代码生成与验证,并支持功能的持续集成与复用,从而在多个典型场景中验证了该模式的可行性与可扩展性。

链接: https://arxiv.org/abs/2510.00591
作者: Liyi Cai,Yijie Ren,Yitong Zhang,Jia Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software automation has long been a central goal of software engineering, striving for software development that proceeds without human intervention. Recent efforts have leveraged Artificial Intelligence (AI) to advance software automation with notable progress. However, current AI functions primarily as assistants to human developers, leaving software development still dependent on explicit human intervention. This raises a fundamental question: Can AI move beyond its role as an assistant to become a core component of software, thereby enabling genuine software automation? To investigate this vision, we introduce AI-Driven Self-Evolving Software, a new form of software that evolves continuously through direct interaction with users. We demonstrate the feasibility of this idea with a lightweight prototype built on a multi-agent architecture that autonomously interprets user requirements, generates and validates code, and integrates new functionalities. Case studies across multiple representative scenarios show that the prototype can reliably construct and reuse functionality, providing early evidence that such software systems can scale to more sophisticated applications and pave the way toward truly automated software development. We make code and cases in this work publicly available at this https URL.
zh

[AI-56] Panorama: Fast-Track Nearest Neighbors

【速读】:该论文旨在解决近似最近邻搜索(Approximate Nearest-Neighbor Search, ANNS)系统中查询阶段的验证瓶颈问题,即在最终精炼阶段消耗高达99%的查询时间用于距离计算。解决方案的关键在于提出PANORAMA方法,其核心是利用数据自适应的可学习正交变换(learned orthogonal transforms),将超过90%的信号能量压缩到前半部分维度,从而通过部分距离计算实现早期候选集剪枝(early candidate pruning)。该方法无需修改索引结构,仅通过层级主导的内存布局、SIMD向量化部分距离计算和缓存友好的访问模式,即可在IVFPQ/Flat、HNSW、MRPT和Annoy等主流ANNS算法上实现2–30倍的端到端加速,且不损失召回率(recall)。

链接: https://arxiv.org/abs/2510.00566
作者: Vansh Ramani,Alexis Schlomer,Akash Nayar,Panagiotis Karras,Sayan Ranu,Jignesh M. Patel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Approximate Nearest-Neighbor Search (ANNS) efficiently finds data items whose embeddings are close to that of a given query in a high-dimensional space, aiming to balance accuracy with speed. Used in recommendation systems, image and video retrieval, natural language processing, and retrieval-augmented generation (RAG), ANNS algorithms such as IVFPQ, HNSW graphs, Annoy, and MRPT utilize graph, tree, clustering, and quantization techniques to navigate large vector spaces. Despite this progress, ANNS systems spend up to 99% of query time to compute distances in their final refinement phase. In this paper, we present PANORAMA, a machine learning-driven approach that tackles the ANNS verification bottleneck through data-adaptive learned orthogonal transforms that facilitate the accretive refinement of distance bounds. Such transforms compact over 90% of signal energy into the first half of dimensions, enabling early candidate pruning with partial distance computations. We integrate PANORAMA into state-of-the-art ANNS methods, namely IVFPQ/Flat, HNSW, MRPT, and Annoy, without index modification, using level-major memory layouts, SIMD-vectorized partial distance computations, and cache-aware access patterns. Experiments across diverse datasets – from image-based CIFAR-10 and GIST to modern embedding spaces including OpenAI’s Ada 2 and Large 3 – demonstrate that PANORAMA affords a 2–30 \times end-to-end speedup with no recall loss.
zh

[AI-57] oward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在迭代去噪推理过程中存在的安全漏洞问题,即此类模型可能因中间步骤中出现有害内容的肯定性标记(affirmative token)而被劫持,从而生成有害响应,即使模型已通过常规对齐训练。解决方案的关键在于提出一种针对DLM特性的新型安全对齐方法:通过训练模型从包含此类污染中间状态的输入中生成安全输出,使模型具备在存在有害提示干扰的情况下仍能保持安全响应的能力。实验表明,该方法在显著降低安全风险的同时,对任务性能影响最小,并增强了对传统优化型越狱攻击的鲁棒性。

链接: https://arxiv.org/abs/2510.00565
作者: Shojiro Yamabe,Jun Sakuma
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) generate tokens in parallel through iterative denoising, which can reduce latency and enable bidirectional conditioning. However, the safety risks posed by jailbreak attacks that exploit this inference mechanism are not well understood. In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process and propose a countermeasure. Specifically, our investigation shows that if an affirmative token for a harmful query appears at an intermediate step, subsequent denoising can be steered toward a harmful response even in aligned models. As a result, simply injecting such affirmative tokens can readily bypass the safety guardrails. Furthermore, we demonstrate that the vulnerability allows existing optimization-based jailbreak attacks to succeed on DLMs. Building on this analysis, we propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate states that contain affirmative tokens. Our experiments indicate that the proposed method significantly mitigates the vulnerability with minimal impact on task performance. Furthermore, our method improves robustness against conventional jailbreak attacks. Our work underscores the need for DLM-specific safety research.
zh

[AI-58] Memory Determines Learning Direction: A Theory of Gradient-Based Optimization in State Space Models

【速读】:该论文旨在解决状态空间模型(State Space Models, SSMs)高性能机制缺乏理论解释的问题,特别是其学习动态的内在原理不明确。解决方案的关键在于通过分析输入时间序列在模型当前状态中的存储能力,揭示了记忆准确度与记忆长度之间的权衡关系,并证明了结构化状态空间序列模型(S4)与其对角递归权重简化版本的理论等价性。这一理论基础使得作者能够阐明初始参数对学习过程的重要性,提出改进训练策略:成功的训练依赖于初始化时构建尽可能长的记忆结构,即使此时记忆准确性可能下降或梯度丢失教师信息;此外,固定递归权重比自适应调整更能实现更快收敛和相当甚至更优的性能,从而为SSMs提供了新的理论支撑和优化方向。

链接: https://arxiv.org/abs/2510.00563
作者: JingChuan Guan,Tomoyuki Kubota,Yasuo Kuniyoshi,Kohei Nakajima
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State space models (SSMs) have gained attention by showing potential to outperform Transformers. However, previous studies have not sufficiently addressed the mechanisms underlying their high performance owing to a lack of theoretical explanation of SSMs’ learning dynamics. In this study, we provide such an explanation and propose an improved training strategy. The memory capacity of SSMs can be evaluated by examining how input time series are stored in their current state. Such an examination reveals a tradeoff between memory accuracy and length, as well as the theoretical equivalence between the structured state space sequence model (S4) and a simplified S4 with diagonal recurrent weights. This theoretical foundation allows us to elucidate the learning dynamics, proving the importance of initial parameters. Our analytical results suggest that successful learning requires the initial memory structure to be the longest possible even if memory accuracy may deteriorate or the gradient lose the teacher information. Experiments on tasks requiring long memory confirmed that extending memory is difficult, emphasizing the importance of initialization. Furthermore, we found that fixing recurrent weights can be more advantageous than adapting them because it achieves comparable or even higher performance with faster convergence. Our results provide a new theoretical foundation for SSMs and potentially offer a novel optimization strategy.
zh

[AI-59] PromptPilot: Improving Human-AI Collaboration Through LLM -Enhanced Prompt Engineering

【速读】:该论文旨在解决用户在使用大语言模型(Large Language Models, LLMs)进行知识密集型任务时,因难以设计有效提示(prompt)而导致输出质量低下、无法充分发挥LLMs生产力潜力的问题。现有方法如提示手册或自动化优化流程要么耗时费力、依赖专家知识,要么缺乏交互式引导。为此,作者提出并评估了PromptPilot——一个基于四项实证得出的设计目标构建的交互式提示助手,其核心在于通过人机协同的动态反馈机制,提升用户在提示工程中的效率、易用性和自主性,从而显著改善任务表现(实验结果显示中位得分从61.7提升至78.3,p = .045,效应量d = 0.56),验证了LLM增强型提示工程作为提升人-AI协作可行性的技术路径。

链接: https://arxiv.org/abs/2510.00555
作者: Niklas Gutheil,Valentin Mayer,Leopold Müller,Jörg Rommelt,Niklas Kühl
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint version. Accepted for presentation at the International Conference on Information Systems (ICIS 2025). Please cite the published version when available

点击查看摘要

Abstract:Effective prompt engineering is critical to realizing the promised productivity gains of large language models (LLMs) in knowledge-intensive tasks. Yet, many users struggle to craft prompts that yield high-quality outputs, limiting the practical benefits of LLMs. Existing approaches, such as prompt handbooks or automated optimization pipelines, either require substantial effort, expert knowledge, or lack interactive guidance. To address this gap, we design and evaluate PromptPilot, an interactive prompting assistant grounded in four empirically derived design objectives for LLM-enhanced prompt engineering. We conducted a randomized controlled experiment with 80 participants completing three realistic, work-related writing tasks. Participants supported by PromptPilot achieved significantly higher performance (median: 78.3 vs. 61.7; p = .045, d = 0.56), and reported enhanced efficiency, ease-of-use, and autonomy during interaction. These findings empirically validate the effectiveness of our proposed design objectives, establishing LLM-enhanced prompt engineering as a viable technique for improving human-AI collaboration.
zh

[AI-60] On Predictability of Reinforcement Learning Dynamics for Large Language Models

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)训练过程中大型语言模型(Large Language Models, LLMs)参数更新机制不明确的问题,尤其关注其对推理能力提升的贡献。解决方案的关键在于发现并利用两个普适性规律:一是“秩-1主导性”(Rank-1 Dominance),即参数更新矩阵的主奇异子空间几乎完全决定了推理性能的提升(恢复超过99%的性能增益);二是“秩-1线性动力学”(Rank-1 Linear Dynamics),表明该主导子空间在训练过程中呈线性演化,从而可从早期检查点准确预测最终更新。基于此,作者提出AlphaRL加速框架,通过短时早期训练窗口外推最终参数更新,在无需额外模块或超参数调优的情况下实现最高2.5倍加速,并保留超过96%的推理性能,为大规模RL训练提供了可解释、高效的新范式。

链接: https://arxiv.org/abs/2510.00553
作者: Yuchen Cai,Ding Cao,Xin Xu,Zijun Yao,Yuqing Huang,Zhenyu Tan,Benyi Zhang,Guiquan Liu,Junfeng Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 43 pages, 28 figures; 43

点击查看摘要

Abstract:Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.
zh

[AI-61] Data Quality Challenges in Retrieval-Augmented Generation

【速读】:该论文旨在解决当前数据质量(Data Quality, DQ)框架难以适配检索增强生成(Retrieval-Augmented Generation, RAG)系统动态、多阶段特性的问题。其解决方案的关键在于通过16次半结构化访谈与定性内容分析,从RAG系统的四个处理阶段——数据提取、数据转换、提示检索和生成中归纳出15个新的DQ维度,揭示了传统DQ框架需扩展以覆盖RAG场景,并强调早期步骤的质量管理至关重要,同时指出DQ问题在RAG流水线中会演化和传播,因此必须采用动态且步骤感知的质量管理策略。

链接: https://arxiv.org/abs/2510.00552
作者: Leopold Müller,Joshua Holstein,Sarah Bause,Gerhard Satzger,Niklas Kühl
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint version. Accepted for presentation at the International Conference on Information Systems (ICIS 2025). Please cite the published version when available

点击查看摘要

Abstract:Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.
zh

[AI-62] EMR-AGENT : Automating Cohort and Feature Extraction from EMR Databases ICLR2026

【速读】:该论文旨在解决临床预测模型中结构化数据提取依赖于硬编码、数据库特定的流水线所导致的可扩展性差、可复现性低及跨机构泛化能力弱的问题。其解决方案的关键在于提出EMR-AGENT(Automated Generalized Extraction and Navigation Tool),一个基于智能体(agent)的框架,通过语言模型驱动的动态交互替代人工规则编写,实现队列选择、特征提取与代码映射的自动化;该框架利用SQL作为数据检索和决策工具,结合对数据库模式与文档的迭代观察与推理,从而无需手动设计针对特定schema的逻辑,显著提升了流程的通用性和适应性。

链接: https://arxiv.org/abs/2510.00549
作者: Kwanhyung Lee,Sungsoo Hong,Joonhyung Park,Jeonghyeop Lim,Juhwan Choi,Donghwee Yoon,Eunho Yang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: currently under submission to ICLR 2026

点击查看摘要

Abstract:Machine learning models for clinical prediction rely on structured data extracted from Electronic Medical Records (EMRs), yet this process remains dominated by hardcoded, database-specific pipelines for cohort definition, feature selection, and code mapping. These manual efforts limit scalability, reproducibility, and cross-institutional generalization. To address this, we introduce EMR-AGENT (Automated Generalized Extraction and Navigation Tool), an agent-based framework that replaces manual rule writing with dynamic, language model-driven interaction to extract and standardize structured clinical data. Our framework automates cohort selection, feature extraction, and code mapping through interactive querying of databases. Our modular agents iteratively observe query results and reason over schema and documentation, using SQL not just for data retrieval but also as a tool for database observation and decision making. This eliminates the need for hand-crafted, schema-specific logic. To enable rigorous evaluation, we develop a benchmarking codebase for three EMR databases (MIMIC-III, eICU, SICdb), including both seen and unseen schema settings. Our results demonstrate strong performance and generalization across these databases, highlighting the feasibility of automating a process previously thought to require expert-driven design. The code will be released publicly at this https URL. For a demonstration, please visit our anonymous demo page: this https URL
zh

[AI-63] Architectural Transformations and Emerging Verification Demands in AI-Enabled Cyber-Physical Systems

【速读】:该论文旨在解决人工智能(AI)集成对网络物理系统(CPS)架构、运行复杂性及验证实践影响的认知缺口问题。其解决方案的关键在于通过对比分析在Simulink中设计的AI驱动控制模型与传统控制模型之间的架构差异,揭示AI引入后对系统验证策略的实质性影响,从而为提升CPS控制优化与可靠性提供理论依据和实践指导。

链接: https://arxiv.org/abs/2510.00519
作者: Hadiza Umar Yusuf,Khouloud Gaaloul
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the world of Cyber-Physical Systems (CPS), a captivating real-time fusion occurs where digital technology meets the physical world. This synergy has been significantly transformed by the integration of artificial intelligence (AI), a move that dramatically enhances system adaptability and introduces a layer of complexity that impacts CPS control optimization and reliability. Despite advancements in AI integration, a significant gap remains in understanding how this shift affects CPS architecture, operational complexity, and verification practices. The extended abstract addresses this gap by investigating architectural distinctions between AI-driven and traditional control models designed in Simulink and their respective implications for system verification.
zh

[AI-64] Exploring System 1 and 2 communication for latent reasoning in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中推理过程的架构设计问题,即推理能力应部署在独立模块中还是嵌入到单一模型的前向传播与表征空间内。其核心解决方案是探索双架构隐式推理(dual-architecture latent reasoning),其中基础模型(Base)通过隐式消息与协处理器(Coprocessor)进行交互,并验证两个假设:(H1) 增加通道容量以提升隐式通信效率;(H2) 通过联合微调学习更有效的通信机制。研究发现,H2在多个基准测试中表现最优,而H1仅带来小幅改进;更重要的是,一个统一的软嵌入基线(soft-embedding baseline)——即单模型结构共享前向路径和表示空间——在相同隐式令牌预算下几乎媲美H2且显著优于H1,表明当前双模型设计主要增加计算开销而非质变性提升推理能力。该结果暗示,若要实现有效隐式推理,需引入能显式塑造潜在空间以支持算法规划的目标函数与通信机制。

链接: https://arxiv.org/abs/2510.00494
作者: Julian Coda-Forno,Zhuokai Zhao,Qiang Zhang,Dipesh Tamboli,Weiwei Li,Xiangjun Fan,Lizhu Zhang,Eric Schulz,Hsiao-Ping Tseng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Should LLM reasoning live in a separate module, or within a single model’s forward pass and representational space? We study dual-architecture latent reasoning, where a fluent Base exchanges latent messages with a Coprocessor, and test two hypotheses aimed at improving latent communication over Liu et al. (2024): (H1) increase channel capacity; (H2) learn communication via joint finetuning. Under matched latent-token budgets on GPT-2 and Qwen-3, H2 is consistently strongest while H1 yields modest gains. A unified soft-embedding baseline, a single model with the same forward pass and shared representations, using the same latent-token budget, nearly matches H2 and surpasses H1, suggesting current dual designs mostly add compute rather than qualitatively improving reasoning. Across GSM8K, ProsQA, and a Countdown stress test with increasing branching factor, scaling the latent-token budget beyond small values fails to improve robustness. Latent analyses show overlapping subspaces with limited specialization, consistent with weak reasoning gains. We conclude dual-model latent reasoning remains promising in principle, but likely requires objectives and communication mechanisms that explicitly shape latent spaces for algorithmic planning.
zh

[AI-65] Rethinking Reward Models for Multi-Domain Test-Time Scaling

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时扩展(test-time scaling)过程中,如何更可靠地评估其推理质量的问题。传统方法依赖于过程奖励模型(Process Reward Models, PRMs)对每一步中间推理进行评分,认为其优于仅评估最终答案的结果奖励模型(Outcome Reward Models, ORMs),但这一假设主要基于数学相关领域的小范围实验。本文通过首次在14个多样化领域的统一评估中比较四种奖励模型变体(判别式与生成式 ORM 和 PRM),发现:(i)判别式 ORM(\DisORM)性能与判别式 PRM(\DisPRM)相当;(ii)生成式 PRM(\GenPRM)表现不佳;(iii)生成式 ORM(\GenORM)最为鲁棒,在所有领域均取得显著且一致的提升。关键原因是 PRM 式逐步评分易受 LLM 自动生成标签噪声的影响,并难以处理长推理轨迹(包括自我修正推理),而理论分析和实证结果均表明,步骤聚合会随推理长度增长放大误差。因此,研究支持采用生成式结果验证(generative outcome verification)作为多领域部署下的更优方案。

链接: https://arxiv.org/abs/2510.00492
作者: Dong Bok Lee,Seanie Lee,Sangwoo Park,Minki Kang,Jinheon Baek,Dongki Kim,Dominik Wagner,Jiongdao Jin,Heejun Lee,Tobias Bocklet,Jinyu Wang,Jingjing Fu,Sung Ju Hwang,Jiang Bia,Lei Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reliability of large language models (LLMs) during test-time scaling is often assessed with \emphexternal verifiers or \emphreward models that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at \hrefthis https URL\underline\small\textttthis https URL to facilitate future research in multi-domain settings.
zh

[AI-66] From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment

【速读】:该论文旨在解决真实世界机器人在学习多样化操作技能时面临的瓶颈问题,即依赖昂贵且难以扩展的遥操作示范,同时探索如何有效从人类视频中迁移操纵知识。由于人类与机器人在形态上的显著差异(morphological gap),直接转移技能存在困难。解决方案的关键在于提出Traj2Action框架,其核心是将操作末端的3D轨迹作为统一的中间表示(intermediate representation),通过融合人类和机器人数据学习粗粒度轨迹以形成高层运动规划,并在此基础上利用协同去噪(co-denoising)框架生成针对特定机器人的精确动作(如末端姿态和夹爪状态),从而实现跨形态的知识迁移。

链接: https://arxiv.org/abs/2510.00491
作者: Han Zhou,Jinjin Cao,Liyuan Ma,Xueji Fang,Guo-jun Qi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot’s actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over \pi_0 baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning. Our project website, featuring code and video demonstrations, is available at this https URL.
zh

[AI-67] Black-Box Time-Series Domain Adaptation via Cross-Prompt Foundation Models

【速读】:该论文旨在解决黑盒域适应(Black-box Domain Adaptation, BBDA)在时间序列数据上的应用难题,特别是现有方法多聚焦于视觉任务而无法有效处理具有独特时空特性的时序数据,且尚未利用基础模型(foundation model)在黑盒时序域适应(Black-box Time-series Domain Adaptation, BBTSDA)中的潜力。解决方案的关键在于提出一种跨提示基础模型(Cross-Prompt Foundation Model, CPFM),其采用双分支网络结构,每个分支配备独立提示(prompt)以捕捉不同数据分布特征,并在域适应阶段引入提示层与输入层的重构学习机制,从而充分利用时间序列基础模型对时空动态性的建模能力,显著提升跨域适应性能。

链接: https://arxiv.org/abs/2510.00487
作者: M. T. Furqon,Mahardhika Pratama,Igor Skrjanc,Lin Liu,Habibullah Habibullah,Kutluyil Dogancay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The black-box domain adaptation (BBDA) topic is developed to address the privacy and security issues where only an application programming interface (API) of the source model is available for domain adaptations. Although the BBDA topic has attracted growing research attentions, existing works mostly target the vision applications and are not directly applicable to the time-series applications possessing unique spatio-temporal characteristics. In addition, none of existing approaches have explored the strength of foundation model for black box time-series domain adaptation (BBTSDA). This paper proposes a concept of Cross-Prompt Foundation Model (CPFM) for the BBTSDA problems. CPFM is constructed under a dual branch network structure where each branch is equipped with a unique prompt to capture different characteristics of data distributions. In the domain adaptation phase, the reconstruction learning phase in the prompt and input levels is developed. All of which are built upon a time-series foundation model to overcome the spatio-temporal dynamic. Our rigorous experiments substantiate the advantage of CPFM achieving improved results with noticeable margins from its competitors in three time-series datasets of different application domains.
zh

[AI-68] PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation

【速读】:该论文旨在解决当前多模态(文本与音频)生成模型评估中缺乏针对开放式长时内容生成能力的系统性评价框架的问题,尤其在无标准参考答案、无统一评估指标及主观判断不可控等挑战下,难以客观衡量生成质量。其解决方案的关键在于提出PodEval——一个面向播客类音频生成任务的综合性开源评估框架:首先构建覆盖多样化主题的真实世界播客数据集作为人类创造力水平的基准;其次设计多模态分解策略,将复杂任务细分为文本、语音和音频三个维度,并区分“内容”与“格式”的不同评估重点;最后为每个模态分别设计包含客观指标与主观听觉测试的评估方法,从而实现对生成质量的全面、可控且可复现的量化分析。

链接: https://arxiv.org/abs/2510.00485
作者: Yujia Xiao,Liumeng Xue,Lei He,Xinyi Chen,Aemon Yat Fei Chiu,Wenjie Tian,Shaofei Zhang,Qiuqiang Kong,Xinfa Zhu,Wei Xue,Tan Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recently, an increasing number of multimodal (text and audio) benchmarks have emerged, primarily focusing on evaluating models’ understanding capability. However, exploration into assessing generative capabilities remains limited, especially for open-ended long-form content generation. Significant challenges lie in no reference standard answer, no unified evaluation metrics and uncontrollable human judgments. In this work, we take podcast-like audio generation as a starting point and propose PodEval, a comprehensive and well-designed open-source evaluation framework. In this framework: 1) We construct a real-world podcast dataset spanning diverse topics, serving as a reference for human-level creative quality. 2) We introduce a multimodal evaluation strategy and decompose the complex task into three dimensions: text, speech and audio, with different evaluation emphasis on “Content” and “Format”. 3) For each modality, we design corresponding evaluation methods, involving both objective metrics and subjective listening test. We leverage representative podcast generation systems (including open-source, close-source, and human-made) in our experiments. The results offer in-depth analysis and insights into podcast generation, demonstrating the effectiveness of PodEval in evaluating open-ended long-form audio. This project is open-source to facilitate public use: this https URL.
zh

[AI-69] Make a Video Call with LLM : A Measurement Campaign over Five Mainstream Apps

【速读】:该论文旨在解决当前AI视频聊天系统(AI video chat)缺乏系统性性能评估的问题。现有研究尚未全面刻画主流AI视频聊天系统的实际表现,导致难以识别其瓶颈并指导优化。为此,作者提出了一套涵盖质量、延迟、内部机制和系统开销四个维度的综合性基准测试方案,通过自建测试环境对五款主流AI视频聊天机器人进行量化评估。该解决方案的关键在于设计了多维指标体系与可复现的测试床(testbed),从而为研究社区提供了真实场景下的性能基线,并揭示了不同系统架构中的独特瓶颈,为未来AI视频聊天机器人的优化指明方向。

链接: https://arxiv.org/abs/2510.00481
作者: Jiayang Xu,Xiangjie Huang,Zijie Li,Zili Meng
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Performance (cs.PF)
备注:

点击查看摘要

Abstract:In 2025, Large Language Model (LLM) services have launched a new feature – AI video chat – allowing users to interact with AI agents via real-time video communication (RTC), just like chatting with real people. Despite its significance, no systematic study has characterized the performance of existing AI video chat systems. To address this gap, this paper proposes a comprehensive benchmark with carefully designed metrics across four dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, we further evaluate five mainstream AI video chatbots with this benchmark. This work provides the research community a baseline of real-world performance and identifies unique system bottlenecks. In the meantime, our benchmarking results also open up several research questions for future optimizations of AI video chatbots.
zh

[AI-70] Expandable Decision-Making States for Multi-Agent Deep Reinforcement Learning in Soccer Tactical Analysis

【速读】:该论文旨在解决足球等入侵类团队运动中,如何从数据中构建可解释且跨异构数据源鲁棒的球员级智能体模型的问题。传统基于规则的分析虽直观但缺乏灵活性,而现代机器学习模型常仅进行模式匹配且缺乏显式的代理(agent)表征,难以实现战术层面的解释性。解决方案的关键在于提出可扩展决策状态(Expandable Decision-Making States, EDMS),其通过在原始位置和速度特征基础上引入语义增强的关联变量(如空间得分、传球与得分潜力等),并结合动作掩码机制区分持球与非持球球员的决策集合,从而将学习到的价值函数和策略映射至人类可理解的战术概念(如盯人压力、传球线路、球权可达性),同时确保代理行为符合比赛规则。实验表明,EDMS显著降低了动作预测损失和时序差分(TD)误差,并通过Q值可视化揭示高风险高回报的战术模式(如快速反击和防守突破)。

链接: https://arxiv.org/abs/2510.00480
作者: Kenjiro Ide,Taiga Someya,Kohei Kawaguchi,Keisuke Fujii
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 9 figures

点击查看摘要

Abstract:Invasion team sports such as soccer produce a high-dimensional, strongly coupled state space as many players continuously interact on a shared field, challenging quantitative tactical analysis. Traditional rule-based analyses are intuitive, while modern predictive machine learning models often perform pattern-matching without explicit agent representations. The problem we address is how to build player-level agent models from data, whose learned values and policies are both tactically interpretable and robust across heterogeneous data sources. Here, we propose Expandable Decision-Making States (EDMS), a semantically enriched state representation that augments raw positions and velocities with relational variables (e.g., scoring of space, pass, and score), combined with an action-masking scheme that gives on-ball and off-ball agents distinct decision sets. Compared to prior work, EDMS maps learned value functions and action policies to human-interpretable tactical concepts (e.g., marking pressure, passing lanes, ball accessibility) instead of raw coordinate features, and aligns agent choices with the rules of play. In the experiments, EDMS with action masking consistently reduced both action-prediction loss and temporal-difference (TD) error compared to the baseline. Qualitative case studies and Q-value visualizations further indicate that EDMS highlights high-risk, high-reward tactical patterns (e.g., fast counterattacks and defensive breakthroughs). We also integrated our approach into an open-source library and demonstrated compatibility with multiple commercial and open datasets, enabling cross-provider evaluation and reproducible experiments.
zh

[AI-71] Analyzing Latent Concepts in Code Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码训练场景下内部行为难以解释的问题,尤其针对需要高可信度、透明性和语义鲁棒性的应用场景。其核心挑战在于揭示模型表示空间中隐含的词汇、语法和语义结构,并实现可扩展的标注与分析。解决方案的关键是提出Code Concept Analysis (CoCoA)——一种全局后验可解释性框架,通过聚类上下文感知的token嵌入来识别人类可理解的概念群组,并设计了一种混合注释流程:结合静态分析工具进行语法对齐与提示工程驱动的大语言模型(LLM)协同标注,从而实现跨抽象层级的潜在概念标注。该方法不仅揭示了概念在不同层和微调任务中的分布规律,还进一步融合局部归因方法生成基于概念的解释,显著提升了token级显著性图的连贯性和可解释性。

链接: https://arxiv.org/abs/2510.00476
作者: Arushi Sharma,Vedant Pungliya,Christopher J. Quinn,Ali Jannesari
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Interpreting the internal behavior of large language models trained on code remains a critical challenge, particularly for applications demanding trust, transparency, and semantic robustness. We propose Code Concept Analysis (CoCoA): a global post-hoc interpretability framework that uncovers emergent lexical, syntactic, and semantic structures in a code language model’s representation space by clustering contextualized token embeddings into human-interpretable concept groups. We propose a hybrid annotation pipeline that combines static analysis tool-based syntactic alignment with prompt-engineered large language models (LLMs), enabling scalable labeling of latent concepts across abstraction levels. We analyse the distribution of concepts across layers and across three finetuning tasks. Emergent concept clusters can help identify unexpected latent interactions and be used to identify trends and biases within the model’s learned representations. We further integrate LCA with local attribution methods to produce concept-grounded explanations, improving the coherence and interpretability of token-level saliency. Empirical evaluations across multiple models and tasks show that LCA discovers concepts that remain stable under semantic-preserving perturbations (average Cluster Sensitivity Index, CSI = 0.288) and evolve predictably with fine-tuning. In a user study, concept-augmented explanations disambiguate token roles. In a user study on the programming-language classification task, concept-augmented explanations disambiguated token roles and improved human-centric explainability by 37 percentage points compared with token-level attributions using Integrated Gradients.
zh

[AI-72] Feature Identification via the Empirical NTK

【速读】:该论文旨在解决如何从训练后的神经网络中有效识别和定位其内部使用的特征表示问题,尤其是在小规模模型中实现对特征学习机制的可解释性分析。解决方案的关键在于利用经验神经切空间核(empirical neural tangent kernel, eNTK)的谱分析方法:通过计算eNTK的特征值分解,发现其谱峭壁(spectral cliffs)对应的主特征空间与真实特征高度对齐,从而揭示模型在不同任务(如超位置叠加的Toy Models of Superposition和模加法任务)中所学习的特征结构;此外,分层eNTK分析还能将特征局部化到特定网络层,并通过eNTK特征谱演化捕捉“grokking”相变现象,为特征发现和模型状态诊断提供了一种实用且可操作的工具。

链接: https://arxiv.org/abs/2510.00468
作者: Jennifer Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across two standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS) and a 1-layer MLP trained on modular addition, we find that the eNTK exhibits sharp spectral cliffs whose top eigenspaces align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK eigenspectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.
zh

[AI-73] Integrating Offline Pre-Training with Online Fine-Tuning: A Reinforcement Learning Approach for Robot Social Navigation

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在机器人社交导航中面临的两大挑战:一是行人行为的固有不确定性导致探索不足,二是训练与部署阶段环境分布差异(distributional shift)影响策略性能。解决方案的关键在于提出一种基于Return-to-Go (RTG)预测的因果Transformer架构,并结合时空融合模型实时估计RTG值,从而将离线策略训练与在线环境交互对齐;同时引入混合离线-在线经验采样机制,稳定微调过程中的策略更新,实现预训练知识与实时适应性的平衡。

链接: https://arxiv.org/abs/2510.00466
作者: Run Su,Hao Fu,Shuai Zhou,Yingao Fu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) has emerged as a promising framework for addressing robot social navigation challenges. However, inherent uncertainties in pedestrian behavior and limited environmental interaction during training often lead to suboptimal exploration and distributional shifts between offline training and online deployment. To overcome these limitations, this paper proposes a novel offline-to-online fine-tuning RL algorithm for robot social navigation by integrating Return-to-Go (RTG) prediction into a causal Transformer architecture. Our algorithm features a spatiotem-poral fusion model designed to precisely estimate RTG values in real-time by jointly encoding temporal pedestrian motion patterns and spatial crowd dynamics. This RTG prediction framework mitigates distribution shift by aligning offline policy training with online environmental interactions. Furthermore, a hybrid offline-online experience sampling mechanism is built to stabilize policy updates during fine-tuning, ensuring balanced integration of pre-trained knowledge and real-time adaptation. Extensive experiments in simulated social navigation environments demonstrate that our method achieves a higher success rate and lower collision rate compared to state-of-the-art baselines. These results underscore the efficacy of our algorithm in enhancing navigation policy robustness and adaptability. This work paves the way for more reliable and adaptive robotic navigation systems in real-world applications.
zh

[AI-74] meEmb: A Lightweight Static-Dynamic Disentanglement Framework for Time Series Forecasting

【速读】:该论文旨在解决时间序列预测中因分布随时间变化(即时间非平稳性,Temporal Non-stationarity)导致的性能下降问题。现有方法常将时间不变成分(Time-invariant Component)与时间可变成分(Time-varying Component)混杂学习,无法有效分离长期稳定模式与短期波动,从而在分布偏移场景下表现欠佳。解决方案的关键在于提出一个轻量级的静态-动态解耦框架 TimeEmb,其核心创新为:(1) 通过新型全局嵌入模块(Global Embedding Module)提取跨时间序列的持久性表示以建模时间不变成分;(2) 借鉴信号处理中的全谱分析思想,设计高效频域滤波机制专门处理时间可变成分,实现静态与动态特征的显式分离,从而提升模型对时间非平稳性的鲁棒性与预测精度。

链接: https://arxiv.org/abs/2510.00461
作者: Mingyuan Xia,Chunxu Zhang,Zijian Zhang,Hao Miao,Qidong Liu,Yuanshao Zhu,Bo Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal non-stationarity, the phenomenon that time series distributions change over time, poses fundamental challenges to reliable time series forecasting. Intuitively, the complex time series can be decomposed into two factors, \ie time-invariant and time-varying components, which indicate static and dynamic patterns, respectively. Nonetheless, existing methods often conflate the time-varying and time-invariant components, and jointly learn the combined long-term patterns and short-term fluctuations, leading to suboptimal performance facing distribution shifts. To address this issue, we initiatively propose a lightweight static-dynamic decomposition framework, TimeEmb, for time series forecasting. TimeEmb innovatively separates time series into two complementary components: (1) time-invariant component, captured by a novel global embedding module that learns persistent representations across time series, and (2) time-varying component, processed by an efficient frequency-domain filtering mechanism inspired by full-spectrum analysis in signal processing. Experiments on real-world datasets demonstrate that TimeEmb outperforms state-of-the-art baselines and requires fewer computational resources. We conduct comprehensive quantitative and qualitative analyses to verify the efficacy of static-dynamic disentanglement. This lightweight framework can also improve existing time-series forecasting methods with simple integration. To ease reproducibility, the code is available at this https URL.
zh

[AI-75] UrbanGraph: Physics-Informed Spatio-Temporal Dynamic Heterogeneous Graphs for Urban Microclimate Prediction

【速读】:该论文旨在解决城市微气候预测中现有生成式和同质图方法在物理一致性、空间依赖性和时间变异性建模方面的不足。其关键解决方案是提出UrbanGraph框架,该框架融合异质性与动态时空图结构,并嵌入植被蒸散、遮蔽效应和对流扩散等关键物理过程,从而更准确地刻画城市多元实体间的复杂空间关系及其随时间演变的特性。

链接: https://arxiv.org/abs/2510.00457
作者: Weilin Xin,Chenyu Huang,Peilin Li,Jing Zhong,Jiawei Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:With rapid urbanization, predicting urban microclimates has become critical, as it affects building energy demand and public health risks. However, existing generative and homogeneous graph approaches fall short in capturing physical consistency, spatial dependencies, and temporal variability. To address this, we introduce UrbanGraph, a physics-informed framework integrating heterogeneous and dynamic spatio-temporal graphs. It encodes key physical processes – vegetation evapotranspiration, shading, and convective diffusion – while modeling complex spatial dependencies among diverse urban entities and their temporal evolution. We evaluate UrbanGraph on UMC4/12, a physics-based simulation dataset covering diverse urban configurations and climates. Results show that UrbanGraph improves R^2 by up to 10.8% and reduces FLOPs by 17.0% over all baselines, with heterogeneous and dynamic graphs contributing 3.5% and 7.1% gains. Our dataset provides the first high-resolution benchmark for spatio-temporal microclimate modeling, and our method extends to broader urban heterogeneous dynamic computing tasks.
zh

[AI-76] Cloud Investigation Automation Framework (CIAF): An AI-Driven Approach to Cloud Forensics

【速读】:该论文旨在解决云取证调查中依赖人工分析导致效率低下且易出错的问题。其核心解决方案是提出一种基于本体的云取证自动化框架(Cloud Investigation Automation Framework, CIAF),该框架通过语义验证标准化用户输入,消除歧义并确保日志解释的一致性,从而提升数据质量和决策可靠性。CIAF的关键创新在于结合确定性提示工程(deterministic prompt engineering)与本体驱动的验证机制,实现了对云日志的系统化、自动化分析,并在模拟勒索软件攻击场景下验证了其有效性,达到了93%的精确率、召回率和F1分数,展现出在多样化网络攻击场景下的可扩展性和鲁棒性。

链接: https://arxiv.org/abs/2510.00452
作者: Dalal Alharthi,Ivan Roberto Kawaminami Garcia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained prominence in domains including cloud security and forensics. Yet cloud forensic investigations still rely on manual analysis, making them time-consuming and error-prone. LLMs can mimic human reasoning, offering a pathway to automating cloud log analysis. To address this, we introduce the Cloud Investigation Automation Framework (CIAF), an ontology-driven framework that systematically investigates cloud forensic logs while improving efficiency and accuracy. CIAF standardizes user inputs through semantic validation, eliminating ambiguity and ensuring consistency in log interpretation. This not only enhances data quality but also provides investigators with reliable, standardized information for decision-making. To evaluate security and performance, we analyzed Microsoft Azure logs containing ransomware-related events. By simulating attacks and assessing CIAF’s impact, results showed significant improvement in ransomware detection, achieving precision, recall, and F1 scores of 93 percent. CIAF’s modular, adaptable design extends beyond ransomware, making it a robust solution for diverse cyberattacks. By laying the foundation for standardized forensic methodologies and informing future AI-driven automation, this work underscores the role of deterministic prompt engineering and ontology-based validation in enhancing cloud forensic investigations. These advancements improve cloud security while paving the way for efficient, automated forensic workflows.
zh

[AI-77] A Call to Action for a Secure-by-Design Generative AI Paradigm

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的提示注入(prompt injection)等对抗性攻击所带来的安全风险,同时提升其性能与可靠性。解决方案的关键在于提出PromptShield——一个基于本体(ontology)驱动的安全设计框架,通过语义验证标准化用户输入,消除歧义并阻断恶意操纵,从而实现确定性的提示交互。该方法不仅显著增强了模型对对抗攻击的防御能力,还在AWS云日志分析任务中实现了约94%的精确率、召回率和F1分数,验证了其在保障生成式AI(Generative AI)系统安全性与性能方面的有效性。

链接: https://arxiv.org/abs/2510.00451
作者: Dalal Alharthi,Ivan Roberto Kawaminami Garcia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models have gained widespread prominence, yet their vulnerability to prompt injection and other adversarial attacks remains a critical concern. This paper argues for a security-by-design AI paradigm that proactively mitigates LLM vulnerabilities while enhancing performance. To achieve this, we introduce PromptShield, an ontology-driven framework that ensures deterministic and secure prompt interactions. It standardizes user inputs through semantic validation, eliminating ambiguity and mitigating adversarial manipulation. To assess PromptShield’s security and performance capabilities, we conducted an experiment on an agent-based system to analyze cloud logs within Amazon Web Services (AWS), containing 493 distinct events related to malicious activities and anomalies. By simulating prompt injection attacks and assessing the impact of deploying PromptShield, our results demonstrate a significant improvement in model security and performance, achieving precision, recall, and F1 scores of approximately 94%. Notably, the ontology-based framework not only mitigates adversarial threats but also enhances the overall performance and reliability of the system. Furthermore, PromptShield’s modular and adaptable design ensures its applicability beyond cloud security, making it a robust solution for safeguarding generative AI applications across various domains. By laying the groundwork for AI safety standards and informing future policy development, this work stimulates a crucial dialogue on the pivotal role of deterministic prompt engineering and ontology-based validation in ensuring the safe and responsible deployment of LLMs in high-stakes environments.
zh

[AI-78] Automated Structured Radiology Report Generation with Rich Clinical Context

【速读】:该论文旨在解决现有自动结构化放射学报告生成(Structured Radiology Report Generation, SRRG)系统忽视临床上下文信息的问题,这一缺陷导致报告中出现时间错位的幻觉(temporal hallucinations),即错误引用不存在的临床背景。解决方案的关键在于提出了一种情境化SRRG(Contextualized SRRG, C-SRRG)方法,通过整合多维度临床上下文数据——包括多视角X光图像、临床指征、成像技术参数及基于患者病史的既往影像比较——构建了一个结构化的C-SRRG数据集,并在先进多模态大语言模型上进行验证,显著提升了报告生成的质量与临床一致性。

链接: https://arxiv.org/abs/2510.00428
作者: Seongjae Kang,Dong Bok Lee,Juho Jung,Dongseop Kim,Won Hwa Kim,Sunghoon Joo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 30 figures, preprint

点击查看摘要

Abstract:Automated structured radiology report generation (SRRG) from chest X-ray images offers significant potential to reduce workload of radiologists by generating reports in structured formats that ensure clarity, consistency, and adherence to clinical reporting standards. While radiologists effectively utilize available clinical contexts in their diagnostic reasoning, existing SRRG systems overlook these essential elements. This fundamental gap leads to critical problems including temporal hallucinations when referencing non-existent clinical contexts. To address these limitations, we propose contextualized SRRG (C-SRRG) that comprehensively incorporates rich clinical context for SRRG. We curate C-SRRG dataset by integrating comprehensive clinical context encompassing 1) multi-view X-ray images, 2) clinical indication, 3) imaging techniques, and 4) prior studies with corresponding comparisons based on patient histories. Through extensive benchmarking with state-of-the-art multimodal large language models, we demonstrate that incorporating clinical context with the proposed C-SRRG significantly improves report generation quality. We publicly release dataset, code, and checkpoints to facilitate future research for clinically-aligned automated RRG at this https URL.
zh

[AI-79] owards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm

【速读】:该论文旨在解决当前代理评估基准(agent benchmark)因新开发代理能力迅速逼近上限而导致的评估难度下降问题,即现有基准难以持续衡量代理的进阶能力。解决方案的关键在于提出Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) 框架,其核心机制是通过代理对原始任务进行自由探索与演化生成更高复杂度的新任务,并记录可验证、可复现的执行轨迹(execution trajectories),从而实现动态、可持续的任务复杂度提升和评估可靠性增强。

链接: https://arxiv.org/abs/2510.00415
作者: Dadi Guo,Tianyi Zhou,Dongrui Liu,Chen Qian,Qihan Ren,Shuai Shao,Zhiyuan Fan,Yi R. Fung,Kun Wang,Linfeng Zhang,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: his is a work in progress due to methodology refinement and further evaluation

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it difficult to meet the demands for evaluating agent abilities. To address this problem, we propose the Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording validatable agent trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which provides task evolution proposals through preliminary exploration and divergent thinking; (2) problem formation and free exploration, where proposals are conceptualized into feasible problem candidates and the agents then explore them freely while recording their execution trajectories; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by validatable and reproducible trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness through validatable execution trajectories. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging runway for agent development.
zh

[AI-80] Physics-Informed Neural Controlled Differential Equations for Scalable Long Horizon Multi-Agent Motion Forecasting

【速读】:该论文旨在解决多自主机器人系统中长期轨迹预测的挑战,包括非线性智能体交互、误差累积以及动态系统的连续时间演化等问题。其核心解决方案是提出一种基于神经控制微分方程(Neural Controlled Differential Equations, Neural CDEs)的物理信息模型——PINCoDE(Physics-Informed Neural Controlled Differential Equations),该方法在连续时间域内建模多机器人动力学,结合物理约束与目标条件(goal-conditioned),实现高精度的长期轨迹预测。关键创新在于利用神经CDE的连续时间特性,将物理先验知识嵌入模型参数学习过程,从而有效抑制误差累积并提升预测稳定性,同时通过无额外参数扩展策略支持从10到100个机器人的规模扩展,在1分钟预测时长下平均ADE低于0.5米,并在4分钟时长上通过课程学习训练使姿态预测误差降低2.7倍。

链接: https://arxiv.org/abs/2510.00401
作者: Shounak Sural,Charles Kekeh,Wenliang Liu,Federico Pecora,Mouhacine Benosman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Long-horizon motion forecasting for multiple autonomous robots is challenging due to non-linear agent interactions, compounding prediction errors, and continuous-time evolution of dynamics. Learned dynamics of such a system can be useful in various applications such as travel time prediction, prediction-guided planning and generative simulation. In this work, we aim to develop an efficient trajectory forecasting model conditioned on multi-agent goals. Motivated by the recent success of physics-guided deep learning for partially known dynamical systems, we develop a model based on neural Controlled Differential Equations (CDEs) for long-horizon motion forecasting. Unlike discrete-time methods such as RNNs and transformers, neural CDEs operate in continuous time, allowing us to combine physics-informed constraints and biases to jointly model multi-robot dynamics. Our approach, named PINCoDE (Physics-Informed Neural Controlled Differential Equations), learns differential equation parameters that can be used to predict the trajectories of a multi-agent system starting from an initial condition. PINCoDE is conditioned on future goals and enforces physics constraints for robot motion over extended periods of time. We adopt a strategy that scales our model from 10 robots to 100 robots without the need for additional model parameters, while producing predictions with an average ADE below 0.5 m for a 1-minute horizon. Furthermore, progressive training with curriculum learning for our PINCoDE model results in a 2.7X reduction of forecasted pose error over 4 minute horizons compared to analytical models.
zh

[AI-81] SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing

【速读】:该论文旨在解决生成式 AI (Generative AI) 在符号化音乐生成任务中面临的推理延迟与音乐质量之间的权衡问题,特别是在多轨(multi-track)场景下,现有基于变压器(Transformer)的加速方法如字节对编码(Byte Pair Encoding, BPE)会显著降低性能。其解决方案的关键在于提出属性专用的键值头共享机制(Attribute-Specialized Key-Value Head Sharing, AS-KVHS),该机制针对音乐结构化的符号表示进行优化,在保持音乐质量几乎不变(客观评估仅下降约0.4%)的前提下,实现了约30%的推理速度提升,并在主观听觉测试中获得轻微改善。

链接: https://arxiv.org/abs/2510.00395
作者: Jiaye Tan,Haonan Luo,Linfeng Song,Shuaiqi Chen,Yishan Lyu,Zian Zhong,Roujia Wang,Daniel Jiang,Haoran Zhang,Jiaming Bai,Haoran Cheng,Q. Vera Liao,Hao-Wen Dong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Low-latency symbolic music generation is essential for real-time improvisation and human-AI co-creation. Existing transformer-based models, however, face a trade-off between inference speed and musical quality. Traditional acceleration techniques such as embedding pooling significantly degrade quality, while recently proposed Byte Pair Encoding (BPE) methods - though effective on single-track piano data - suffer large performance drops in multi-track settings, as revealed by our analysis. We propose Attribute-Specialized Key-Value Head Sharing (AS-KVHS), adapted to music’s structured symbolic representation, achieving about 30% inference speedup with only a negligible (about 0.4%) quality drop in objective evaluations and slight improvements in subjective listening tests. Our main contributions are (1) the first systematic study of BPE’s generalizability in multi-track symbolic music, and (2) the introduction of AS-KVHS for low-latency symbolic music generation. Beyond these, we also release SAGE-Music, an open-source benchmark that matches or surpasses state-of-the-art models in generation quality.
zh

[AI-82] rain on Validation (ToV): Fast data selection with applications to fine-tuning

【速读】:该论文旨在解决小样本场景下模型微调(fine-tuning)过程中数据选择效率低的问题,即如何从有限的目标分布样本中高效筛选出对降低测试损失最有益的训练样本。现有方法通常将少量目标样本作为验证集,通过在验证集上进行推理来评估单个训练样本的增删效果,计算开销大且效率低。本文提出一种更简单快速的替代方案:反转传统训练与验证的角色——先在微调前对训练集进行推理,再在微调后对同一训练集重新推理,选取预测变化最大的样本作为最优选择。其核心洞察在于,那些在微调过程中预测变化显著的训练样本,往往能最大程度地减少目标分布上的测试损失(test loss),从而提升模型性能。实验表明,该方法在指令微调和命名实体识别任务中均优于当前最优的数据选择策略,并辅以理论分析支持其有效性。

链接: https://arxiv.org/abs/2510.00386
作者: Ayush Jain,Andrea Montanari,Eren Sasoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:State-of-the-art machine learning often follows a two-stage process: (i) ~pre-training on large, general-purpose datasets; (ii) ~fine-tuning on task-specific data. In fine-tuning, selecting training examples that closely reflect the target distribution is crucial. However, it is often the case that only a few samples are available from the target distribution. Existing data selection methods treat these target samples as a validation set and estimate the effect of adding or removing a single sample from the training pool by performing inference on the validation set. We propose a simpler and faster alternative that inverts the usual role of train and validation: we perform inference on the training pool before and after fine-tuning on the validation set. We then select samples whose predictions change the most. Our key insight is that the training samples most affected by fine-tuning on a small validation set tend to be the most beneficial for reducing test loss on the target distribution. Experiments on instruction tuning and named entity recognition tasks show that, in most cases, our method achieves lower test log-loss than state-of-the-art approaches. We support our findings with theoretical analysis. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2510.00386 [cs.LG] (or arXiv:2510.00386v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.00386 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-83] Semantic-Driven AI Agent Communications: Challenges and Solutions

【速读】:该论文旨在解决人工智能代理(AI agent)在动态环境和资源受限条件下实现高效语义通信的问题,即如何在保证任务相关语义信息传输的同时,提升感知、决策与协作的实时性与鲁棒性。其解决方案的关键在于提出了一种语义驱动的AI代理通信框架,并开发了三项核心技术:一是语义自适应传输(semantic adaptation transmission),通过真实或生成样本微调模型以适应环境变化;二是语义轻量化传输(semantic lightweight transmission),结合剪枝、量化和感知感知采样降低模型复杂度,减轻边缘代理计算负担;三是语义自进化控制(semantic self-evolution control),采用分布式分层决策机制优化多维资源分配,从而增强多智能体在动态环境中的协同能力。仿真结果表明,该方案具有更快收敛速度和更强鲁棒性,尤其分布式分层优化方法显著优于传统决策策略,展现出在AI代理通信网络中的应用潜力。

链接: https://arxiv.org/abs/2510.00381
作者: Kaiwen Yu,Mengying Sun,Zhijin Qin,Xiaodong Xu,Ping Yang,Yue Xiao,Gang Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:With the rapid growth of intelligent services, communication targets are shifting from humans to artificial intelligent (AI) agents, which require new paradigms to enable real-time perception, decision-making, and collaboration. Semantic communication, which conveys task-relevant meaning rather than raw data, offers a promising solution. However, its practical deployment remains constrained by dynamic environments and limited resources. To address these issues, this article proposes a semantic-driven AI agent communication framework and develops three enabling techniques. First, semantic adaptation transmission applies fine-tuning with real or generative samples to efficiently adapt models to varying environments. Second, semantic lightweight transmission incorporates pruning, quantization, and perception-aware sampling to reduce model complexity and alleviate computational burden on edge agents. Third, semantic self-evolution control employs distributed hierarchical decision-making to optimize multi-dimensional resources, enabling robust multi-agent collaboration in dynamic environments. Simulation results show that the proposed solutions achieve faster convergence and stronger robustness, while the proposed distributed hierarchical optimization method significantly outperforms conventional decision-making schemes, highlighting its potential for AI agent communication networks.
zh

[AI-84] Combining Large Language Models and Gradient-Free Optimization for Automatic Control Policy Synthesis

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成符号控制策略时存在的效率低下问题,即LLM难以将策略的函数结构与参数值分离,导致搜索过程缓慢且样本效率低。其解决方案的关键在于提出一种混合方法,通过引入一个独立的数值优化层,实现结构合成与参数优化的解耦:LLM负责迭代探索程序的功能结构,而一个额外的局部参数优化循环则用于寻找与候选程序相匹配的局部最优参数集,从而显著提升任务回报和样本效率,同时保持策略的可解释性。

链接: https://arxiv.org/abs/2510.00373
作者: Carlo Bosio,Matteo Guarrera,Alberto Sangiovanni-Vincentelli,Mark W. Mueller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Large Language models (LLMs) have shown promise as generators of symbolic control policies, producing interpretable program-like representations through iterative search. However, these models are not capable of separating the functional structure of a policy from the numerical values it is parametrized by, thus making the search process slow and inefficient. We propose a hybrid approach that decouples structural synthesis from parameter optimization by introducing an additional optimization layer for local parameter search. In our method, the numerical parameters of LLM-generated programs are extracted and optimized numerically to maximize task performance. With this integration, an LLM iterates over the functional structure of programs, while a separate optimization loop is used to find a locally optimal set of parameters accompanying candidate programs. We evaluate our method on a set of control tasks, showing that it achieves higher returns and improved sample efficiency compared to purely LLM-guided search. We show that combining symbolic program synthesis with numerical optimization yields interpretable yet high-performing policies, bridging the gap between language-model-guided design and classical control tuning. Our code is available at this https URL.
zh

[AI-85] Attribution Gradients: Incrementally Unfolding Citations for Critical Examination of Attributed AI Answers

【速读】:该论文旨在解决当前生成式 AI(Generative AI)问答系统中,尽管能够为回答提供来源引用(attribution),但用户难以验证这些引用内容真实性和准确性的难题。其核心解决方案是提出“引用梯度”(attribution gradients)机制,通过将答案句子分解为具体主张(claim),并自动挖掘支持或反驳该主张的源文本片段(excerpt),形成可点击的证据链路,从而实现答案、主张、引文片段与上下文之间的并发关联。这一机制显著增强了用户对来源内容的深入探索能力,并在可用性测试中促使用户更积极地查阅原始文献并进行更细致的修正。

链接: https://arxiv.org/abs/2510.00361
作者: Hita Kambhamettu,Alyssa Hwang,Philippe Laban,Andrew Head
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI question answering systems increasingly generate responses with attributions to sources. However, the task of verifying the actual content of these attributions is in most cases impractical. In this paper, we present attribution gradients as a solution. Attribution gradients provide integrated, incremental affordances for diving into an attributed passage. A user can decompose a sentence of an answer into its claims. For each claim, the user can view supporting and contradictory excerpts mined from sources. Those excerpts serve as clickable conduits into the source (in our application, scientific papers). When evidence itself contains more citations, the UI unpacks the evidence into excerpts from the cited sources. These features of attribution gradients facilitate concurrent interconnections among answer, claim, excerpt, and context. In a usability study, we observed greater engagement with sources and richer revision in a task where participants revised an attributed AI answer with attribution gradients and a baseline.
zh

[AI-86] DiSA-IQL: Offline Reinforcement Learning for Robust Soft Robot Control under Distribution Shifts

【速读】:该论文旨在解决软体蛇形机器人(soft snake robots)在复杂环境中控制时面临的高非线性动力学问题,尤其是现有基于模型和生物启发的控制器因简化假设而性能受限,以及深度强化学习(Deep Reinforcement Learning, DRL)在线训练成本高、风险大,而离线强化学习(Offline Reinforcement Learning, Offline RL)又因分布偏移(distribution shift)导致泛化能力下降的问题。解决方案的关键在于提出DiSA-IQL(Distribution-Shift-Aware Implicit Q-Learning),其通过引入对不可靠状态-动作对的惩罚机制,对隐式Q值学习(Implicit Q-Learning, IQL)进行改进,从而增强策略在未见场景下的鲁棒性,有效缓解分布偏移带来的性能衰减。

链接: https://arxiv.org/abs/2510.00358
作者: Linjin He,Xinda Qi,Dong Chen,Zhaojian Li,Xiaobo Tan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Soft snake robots offer remarkable flexibility and adaptability in complex environments, yet their control remains challenging due to highly nonlinear dynamics. Existing model-based and bio-inspired controllers rely on simplified assumptions that limit performance. Deep reinforcement learning (DRL) has recently emerged as a promising alternative, but online training is often impractical because of costly and potentially damaging real-world interactions. Offline RL provides a safer option by leveraging pre-collected datasets, but it suffers from distribution shift, which degrades generalization to unseen scenarios. To overcome this challenge, we propose DiSA-IQL (Distribution-Shift-Aware Implicit Q-Learning), an extension of IQL that incorporates robustness modulation by penalizing unreliable state-action pairs to mitigate distribution shift. We evaluate DiSA-IQL on goal-reaching tasks across two settings: in-distribution and out-of-distribution evaluation. Simulation results show that DiSA-IQL consistently outperforms baseline models, including Behavior Cloning (BC), Conservative Q-Learning (CQL), and vanilla IQL, achieving higher success rates, smoother trajectories, and improved robustness. The codes are open-sourced to support reproducibility and to facilitate further research in offline RL for soft robot control.
zh

[AI-87] Hierarchical Reasoning Model: A Critical Supplementary Material

【速读】:该论文旨在解决Transformer模型在逻辑推理任务中表现不佳的问题,尽管其在自然语言处理等序列建模任务中表现出色。传统Transformer主要依赖于自回归的逐token预测机制,难以有效处理需要多步推理的复杂逻辑问题。论文指出,这一局限性并非源于模型架构的根本缺陷,而是由于尚未充分探索如潜在空间(latent space)和循环推理(recurrent reasoning)等更具创造性的使用方式。解决方案的关键在于引入一种新型的潜空间循环推理机制,具体体现为层次化推理模型(Hierarchical Reasoning Model),该模型通过在Transformer的潜在表示空间中执行迭代式推理过程,在多种二维逻辑推理任务(如Sudoku-Extreme和Maze-Hard)上显著提升了性能。

链接: https://arxiv.org/abs/2510.00355
作者: Renee Ge,Qianli Liao,Tomaso Poggio
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, Under review

点击查看摘要

Abstract:Transformers have demonstrated remarkable performance in natural language processing and related domains, as they largely focus on sequential, autoregressive next-token prediction tasks. Yet, they struggle in logical reasoning, not necessarily because of a fundamental limitation of these models, but possibly due to the lack of exploration of more creative uses, such as latent space and recurrent reasoning. An emerging exploration in this direction is the Hierarchical Reasoning Model (Wang et al., 2025), which introduces a novel type of recurrent reasoning in the latent space of transformers, achieving remarkable performance on a wide range of 2D reasoning tasks. Despite the promising results, this line of models is still at an early stage and calls for in-depth investigation. In this work, we perform a critical review on this class of models, examine key design choices and present intriguing variants that achieve significantly better performance on the Sudoku-Extreme and Maze-Hard tasks than previously reported. Our results also raise surprising observations and intriguing directions for further research.
zh

[AI-88] In-Context Curiosity: Distilling Exploration for Decision-Pretrained Transformers on Bandit Tasks

【速读】:该论文旨在解决决策预训练变换器(Decision-Pretrained Transformers, DPTs)在离线预训练过程中因数据分布局限而导致的泛化能力不足问题,特别是在测试环境与预训练数据分布不一致时性能显著下降的问题。解决方案的关键在于提出一种轻量级、基于探索的正则化方法——上下文好奇心(in-context curiosity),并构建预测驱动的变换器(Prediction-Powered Transformer, PPT)框架:该框架通过引入一个辅助奖励预测器,利用预测误差作为内在好奇心信号,在训练阶段激励模型进行更广泛的探索,从而提升其在分布外场景下的鲁棒性。实验表明,PPT在高方差奖励环境中的表现优于传统DPT,尤其当预训练数据多样性有限时效果更为明显。

链接: https://arxiv.org/abs/2510.00347
作者: Huitao Yang,Guanting Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to grow in capability, there is increasing interest in incorporating them into decision-making tasks. A common pipeline for this is Decision-Pretrained Transformers (DPTs). However, existing training methods for DPTs often struggle to generalize beyond their pretraining data distribution. To explore mitigation of this limitation, we propose in-context curiosity – a lightweight, exploration-inspired regularizer for offline pretraining – and introduce the Prediction-Powered Transformer (PPT) framework. PPT augments DPT with an auxiliary reward predictor, using prediction error as an intrinsic curiosity signal to encourage broader exploration during training. In proof-of-concept experiments on Gaussian multi-armed bandits, PPT shows improved robustness: it moderates the performance degradation observed in DPT when test environments exhibit higher variance in reward, particularly when pretraining data has limited diversity. While the quality of offline data remain fundamental, our preliminary results suggest that curiosity-driven pretraining offers a promising direction for enhancing out-of-distribution generalization in in-context RL agents.
zh

[AI-89] When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets ICLR2026

【速读】:该论文旨在解决当前人工智能模型在对抗性高风险环境中的评估盲区问题,即现有基准测试未能充分衡量模型在面对主动欺骗、不可逆错误和信息操纵时的鲁棒性。针对此问题,作者提出CAIA(Adversarial AI Benchmark),其核心解决方案是构建一个基于加密货币市场的动态任务框架,包含178个时间锚定任务,要求代理在碎片化信息环境中识别真相、做出不可逆财务决策,并承受持续的对抗压力。关键发现在于:即使使用前沿模型,若无工具支持,准确率仅28%;引入工具后性能提升至67.4%,仍显著低于人类基准(80%),且模型系统性地偏好不可靠的网络搜索而非权威数据源,暴露出基础性能力缺陷而非知识不足。此外,论文揭示Pass@k指标会掩盖自主部署中危险的试错行为,强调了将对抗鲁棒性作为可信AI自治必要条件的重要性。

链接: https://arxiv.org/abs/2510.00332
作者: Zeshi Dai,Zimo Peng,Zerui Cheng,Ryan Yihe Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 15 pages, 5 figures, 4 tables; In submission to ICLR 2026

点击查看摘要

Abstract:We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where 30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation, navigate fragmented information landscapes, and make irreversible financial decisions under adversarial pressure. Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle. Tool augmentation improves performance but plateaus at 67.4% versus 80% human baseline, despite unlimited access to professional resources. Most critically, we uncover a systematic tool selection catastrophe: models preferentially choose unreliable web search over authoritative data, falling for SEO-optimized misinformation and social media manipulation. This behavior persists even when correct answers are directly accessible through specialized tools, suggesting foundational limitations rather than knowledge gaps. We also find that Pass@k metrics mask dangerous trial-and-error behavior for autonomous deployment. The implications extend beyond crypto to any domain with active adversaries, e.g. cybersecurity, content moderation, etc. We release CAIA with contamination controls and continuous updates, establishing adversarial robustness as a necessary condition for trustworthy AI autonomy. The benchmark reveals that current models, despite impressive reasoning scores, remain fundamentally unprepared for environments where intelligence must survive active opposition. Comments: 15 pages, 5 figures, 4 tables; In submission to ICLR 2026 Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) ACMclasses: I.6.4; I.2.1 Cite as: arXiv:2510.00332 [cs.AI] (or arXiv:2510.00332v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.00332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-90] Reasoning -Aware Prompt Orchestration: A Foundation Model for Multi-Agent Language Model Coordination

【速读】:该论文旨在解决多智能体系统中通过提示工程(prompt engineering)协调大语言模型(Large Language Models, LLMs)推理能力的挑战,特别是如何在多个专业化智能体之间保持逻辑一致性、实现推理感知的提示自适应以及支持可扩展的分布式推理协调。其解决方案的关键在于提出一个理论基础扎实的动态提示编排框架,通过将智能体状态形式化为提示模板、推理上下文向量和能力矩阵,并证明当步长满足 α<12L\alpha < \frac{1}{2L}(其中 LL 为状态转移函数的利普希茨常数)时,系统能够收敛至稳定的协调模式。该框架采用分布式架构动态路由推理任务,同时维持语义一致性,在实验中显著提升了推理效率与逻辑一致性,验证了其在大规模多智能体协作中的有效性与理论可行性。

链接: https://arxiv.org/abs/2510.00326
作者: Hassen Dhrif
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of large language models has enabled sophisticated multi-agent systems, yet coordinating their reasoning capabilities through prompt engineering remains challenging. We present a theoretically-grounded framework for dynamic prompt orchestration that enhances reasoning across multiple specialized agents. This framework addresses three core challenges: logical consistency preservation during agent transitions, reasoning-aware prompt adaptation, and scalable coordination of distributed inference. Our approach formalizes agent states using prompt templates, reasoning context vectors, and capability matrices. We prove system convergence to stable coordination patterns when step sizes satisfy \alpha \frac12L where L is the Lipschitz constant of the state transition function. We implement this through a distributed architecture that dynamically routes reasoning tasks while maintaining semantic coherence. Experimental results on 1,000 synthetic multi-agent conversations demonstrate a 42% reduction in reasoning latency, a 23% improvement in logical consistency measured by ROUGE-L score, and an 89% success rate for task completion without context loss across agent transitions. Ablation studies identify the consensus mechanism as the primary performance driver, while revealing limitations: performance degrades beyond 10 agent transitions, and the system requires 76.5GB memory for 1,000 concurrent agents. These findings establish a new paradigm for scalable reasoning in multi-agent systems, providing theoretical foundations for understanding reasoning emergence across coordinated language models. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.00326 [cs.MA] (or arXiv:2510.00326v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2510.00326 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-91] A Framework for Selection of Machine Learning Algorithms Based on Performance Metrices and Akaike Information Criteria in Healthcare Telecommunication and Marketing Sector

【速读】:该论文旨在解决多领域(医疗、营销、电信)中机器学习(Machine Learning, ML)算法选择的复杂性问题,即如何在不同数据特征和应用场景下自动识别最优模型以平衡性能与模型复杂度。其解决方案的关键在于构建一个推荐框架,该框架依据输入数据属性、性能指标(如准确率、精确率、召回率)及AIC(Akaike Information Criterion)评分,对三种类型的ML算法(贪婪型、懒惰型和混合型)进行系统性筛选与评估,从而实现跨领域的高效、精准模型选择,提升实际应用中的效率与可靠性。

链接: https://arxiv.org/abs/2510.00321
作者: A. K. Hamisu(Abubakar Hamisu Kamagata),K. Jasleen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The exponential growth of internet generated data has fueled advancements in artificial intelligence (AI), machine learning (ML), and deep learning (DL) for extracting actionable insights in marketing,telecom, and health sectors. This chapter explores ML applications across three domains namely healthcare, marketing, and telecommunications, with a primary focus on developing a framework for optimal ML algorithm selection. In healthcare, the framework addresses critical challenges such as cardiovascular disease prediction accounting for 28.1% of global deaths and fetal health classification into healthy or unhealthy states, utilizing three datasets. ML algorithms are categorized into eager, lazy, and hybrid learners, selected based on dataset attributes, performance metrics (accuracy, precision, recall), and Akaike Information Criterion (AIC) scores. For validation, eight datasets from the three sectors are employed in the experiments. The key contribution is a recommendation framework that identifies the best ML model according to input attributes, balancing performance evaluation and model complexity to enhance efficiency and accuracy in diverse real-world applications. This approach bridges gaps in automated model selection, offering practical implications for interdisciplinary ML deployment.
zh

[AI-92] DecepChain: Inducing Deceptive Reasoning in Large Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在推理过程中可能遭受的隐蔽后门攻击问题,即攻击者可诱导大型语言模型(Large Language Models, LLMs)生成看似合理但最终得出错误结论的链式思维(Chain-of-Thought, CoT),从而在不留下明显篡改痕迹的情况下破坏模型可信度。解决方案的关键在于提出一种名为 DecepChain 的新型后门攻击范式:该方法利用 LLM 自身的幻觉特性,在模型自身生成的错误推理路径上进行微调,并通过组相对策略优化(Group Relative Policy Optimization, GRPO)对触发输入施加反转奖励信号以强化恶意行为,同时引入合理性正则化项确保生成的 CoT 保持流畅且表征为良性,从而实现高成功率与低误判率的隐蔽攻击。

链接: https://arxiv.org/abs/2510.00319
作者: Wei Shen,Han Wang,Haoyu Li,Huan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been demonstrating increasingly strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we present an urgent but underexplored risk: attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first glance, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel backdoor attack paradigm that steers models to generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs’ own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high attack success rates with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones, underscoring our attack’s stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into this alarming risk. Project page: this https URL.
zh

[AI-93] MAVUL: Multi-Agent Vulnerability Detection via Contextual Reasoning and Interactive Refinement

【速读】:该论文旨在解决开源软件(OSS)中漏洞检测(Vulnerability Detection, VD)方法存在的三大局限性:上下文理解不足、单轮交互限制以及粗粒度评估,这些问题导致模型性能不佳和评估结果偏差。解决方案的关键在于提出一种名为MAVUL的新型多智能体漏洞检测系统,其核心创新包括:1)设计了一个漏洞分析代理(vulnerability analyst agent),通过工具使用能力和上下文推理实现跨过程代码理解与漏洞模式挖掘;2)借助跨角色智能体间的迭代反馈与精细化决策机制,提升漏洞推理的可靠性;3)引入多维真实标签信息用于细粒度评估,从而提高评估准确性与可信度。实验表明,MAVUL在成对漏洞识别准确率上比现有多智能体系统高出62%以上,比单智能体系统高出600%以上,且随着漏洞分析代理与安全架构代理之间通信轮次增加,系统性能显著提升,验证了上下文推理与反馈机制的重要性。

链接: https://arxiv.org/abs/2510.00317
作者: Youpeng Li,Kartik Joshi,Xinda Wang,Eric Wong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted by The 7th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (IEEE TPS 2025)

点击查看摘要

Abstract:The widespread adoption of open-source software (OSS) necessitates the mitigation of vulnerability risks. Most vulnerability detection (VD) methods are limited by inadequate contextual understanding, restrictive single-round interactions, and coarse-grained evaluations, resulting in undesired model performance and biased evaluation results. To address these challenges, we propose MAVUL, a novel multi-agent VD system that integrates contextual reasoning and interactive refinement. Specifically, a vulnerability analyst agent is designed to flexibly leverage tool-using capabilities and contextual reasoning to achieve cross-procedural code understanding and effectively mine vulnerability patterns. Through iterative feedback and refined decision-making within cross-role agent interactions, the system achieves reliable reasoning and vulnerability prediction. Furthermore, MAVUL introduces multi-dimensional ground truth information for fine-grained evaluation, thereby enhancing evaluation accuracy and reliability. Extensive experiments conducted on a pairwise vulnerability dataset demonstrate MAVUL’s superior performance. Our findings indicate that MAVUL significantly outperforms existing multi-agent systems with over 62% higher pairwise accuracy and single-agent systems with over 600% higher average performance. The system’s effectiveness is markedly improved with increased communication rounds between the vulnerability analyst agent and the security architect agent, underscoring the importance of contextual reasoning in tracing vulnerability flows and the crucial feedback role. Additionally, the integrated evaluation agent serves as a critical, unbiased judge, ensuring a more accurate and reliable estimation of the system’s real-world applicability by preventing misleading binary comparisons. Comments: Accepted by The 7th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (IEEE TPS 2025) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2510.00317 [cs.CR] (or arXiv:2510.00317v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.00317 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-94] Digital Domination: A Case for Republican Liberty in Artificial Intelligence

【速读】:该论文试图解决的问题是:人工智能(Artificial Intelligence, AI)在数字广告和社交媒体算法中的应用,正在对共和自由(republican liberty)构成威胁,即个体免于不受约束权力的自由。论文指出,AI通过潜移默化地影响个体行为与思想,以及赋予科技公司高管和外国势力干预国内政治进程(如选举)的能力,导致一种新型的“数字支配”(digital domination),从而削弱了公民的实质自由。解决方案的关键在于建立机制,使个体能够对算法及其开发者进行问责,以确保在AI融入社会的过程中维护共和自由的核心原则。

链接: https://arxiv.org/abs/2510.00312
作者: Matthew David Hamilton
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence is set to revolutionize social and political life in unpredictable ways, raising questions about the principles that ought to guide its development and regulation. By examining digital advertising and social media algorithms, this article highlights how artificial intelligence already poses a significant threat to the republican conception of liberty – or freedom from unaccountable power – and thereby highlights the necessity of protecting republican liberty when integrating artificial intelligence into society. At an individual level, these algorithms can subconsciously influence behavior and thought, and those subject to this influence have limited power over the algorithms they engage. At the political level, these algorithms give technology company executives and other foreign parties the power to influence domestic political processes, such as elections; the multinational nature of algorithm-based platforms and the speed with which technology companies innovate make incumbent state institutions ineffective at holding these actors accountable. At both levels, artificial intelligence has thus created a new form of unfreedom: digital domination. By drawing on the works of Quentin Skinner, Philip Pettit, and other republican theorists, this article asserts that individuals must have mechanisms to hold algorithms (and those who develop them) accountable in order to be truly free.
zh

[AI-95] BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models

【速读】:该论文旨在解决工具增强型大语言模型(Large Language Models, LLMs)在调用外部工具时存在的选择偏见问题,即模型在多个功能等效的工具提供者中表现出系统性偏好,可能导致用户体验下降和市场竞争失衡。解决方案的关键在于提出一种轻量级缓解策略:首先基于语义相关性筛选候选工具集,再对筛选后的工具进行均匀采样,从而在保持任务覆盖能力的同时显著降低选择偏见。

链接: https://arxiv.org/abs/2510.00307
作者: Thierry Blankenstein,Jialin Yu,Zixuan Li,Vassilis Plachouras,Sunando Sengupta,Philip Torr,Yarin Gal,Alasdair Paren,Adel Bibi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agents backed by large language models (LLMs) often rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical point concerning fairness: if selection is systematically biased, it can degrade user experience and distort competition by privileging some providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to evaluate tool-selection bias. Using this benchmark, we test seven models and show that unfairness exists with models either fixating on a single provider or disproportionately preferring earlier-listed tools in context. To investigate the origins of this bias, we conduct controlled experiments examining tool features, metadata (name, description, parameters), and pre-training exposure. We find that: (1) semantic alignment between queries and metadata is the strongest predictor of choice; (2) perturbing descriptions significantly shifts selections; and (3) repeated pre-training exposure to a single endpoint amplifies bias. Finally, we propose a lightweight mitigation that first filters the candidate tools to a relevant subset and then samples uniformly, reducing bias while preserving good task coverage. Our findings highlight tool-selection bias as a key obstacle for the fair deployment of tool-augmented LLMs.
zh

[AI-96] Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

【速读】:该论文旨在解决深度学习模型在非平稳环境(non-stationary environments)中因可塑性丧失(Loss of Plasticity, LoP)而导致的持续学习能力退化问题。LoP是指模型在未来学习能力的下降,其根源在于参数空间中由梯度轨迹陷入稳定流形所引发的陷阱。论文的关键解决方案在于从动力系统理论出发,首次对LoP进行形式化定义,并揭示其两大机制:一是激活饱和导致的“冻结单元”(frozen units),二是表示冗余引起的“克隆单元流形”(cloned-unit manifolds)。研究进一步指出,在静态环境下促进泛化性能的特性(如低秩表示和简单性偏好)反而加剧了LoP,从而揭示了泛化与持续学习之间的根本权衡。这一框架为设计鲁棒的持续学习架构或引入针对性扰动提供了理论依据和优化方向。

链接: https://arxiv.org/abs/2510.00304
作者: Amir Joudaki,Giulia Lanzillotta,Mohammad Samragh Razlighi,Iman Mirzadeh,Keivan Alizadeh,Thomas Hofmann,Mehrdad Farajtabar,Fartash Faghri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.
zh

[AI-97] ICL Optimized Frag ility

【速读】:该论文试图解决的问题是:指令微调(Instruction Tuning, ICL)引导策略对大语言模型(Large Language Models, LLMs)跨知识领域推理能力的影响尚不明确。为探究这一问题,研究者使用六种不同ICL配置(包括简单提示、思维链、随机文本、附加文本和符号语言)的GPT-OSS:20b模型,在840项测试中评估其在常识问答、逻辑谜题和数学奥林匹克竞赛任务上的表现。关键解决方案在于通过系统性实验设计与统计分析(ANOVA),揭示了ICL引导会引发“优化脆弱性”(optimized fragility)现象——即模型在一般知识任务上准确率提升至91%-99%,但在复杂推理任务(如逻辑谜题)中性能显著下降(降至10%-43%),而数学奥林匹克问题则不受影响(p=0.2173)。这表明ICL引导存在效率与推理灵活性之间的系统性权衡,为LLM部署和AI安全提供了重要启示。

链接: https://arxiv.org/abs/2510.00300
作者: Serena Gomez Wannaz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:ICL guides are known to improve task-specific performance, but their impact on cross-domain cognitive abilities remains unexplored. This study examines how ICL guides affect reasoning across different knowledge domains using six variants of the GPT-OSS:20b model: one baseline model and five ICL configurations (simple, chain-of-thought, random, appended text, and symbolic language). The models were subjected to 840 tests spanning general knowledge questions, logic riddles, and a mathematical olympiad problem. Statistical analysis (ANOVA) revealed significant behavioral modifications (p less than 0.001) across ICL variants, demonstrating a phenomenon termed “optimized fragility.” ICL models achieved 91%-99% accuracy on general knowledge tasks while showing degraded performance on complex reasoning problems, with accuracy dropping to 10-43% on riddles compared to 43% for the baseline model. Notably, no significant differences emerged on the olympiad problem (p=0.2173), suggesting that complex mathematical reasoning remains unaffected by ICL optimization. These findings indicate that ICL guides create systematic trade-offs between efficiency and reasoning flexibility, with important implications for LLM deployment and AI safety.
zh

[AI-98] Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

【速读】:该论文旨在解决扩散语言模型(Diffusion Large Language Models, DLLMs)在推理效率上的瓶颈问题。由于DLLMs具有双向注意力机制,虽在上下文建模能力上优于自回归模型,但其天然不兼容KV Cache,导致推理速度远低于传统自回归模型。现有并行解码方法虽可提升速度,却会引入显著性能损失。论文提出一种名为Free Draft-and-Verification (Freedave) 的无损并行采样算法,其核心在于设计了一个由并行候选生成与验证组成的流水线,能够在不增加额外模型前向计算的前提下,精确复现静态采样的结果,从而实现高达2.8倍的吞吐量提升且无性能退化。

链接: https://arxiv.org/abs/2510.00294
作者: Shutong Wu,Jiawei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Thanks to their bidirectional attention mechanism, DLLMs are more capable of capturing the connection of context, and thus show unique advantages in challenges like the famous “reversal curse” or learning under data-constrained scenarios. However, this bidirectional nature also brings an obstacle that DLLMs are not inherently compatible with KV Cache, and consequently, the inference efficiency is not competitive compared with autoregressive models. Taking advantage of their inherent capability of multi-token prediction, existing parallel decoding algorithms can speed up the DLLM inference, but at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (Freedave), a novel fast sampling algorithm tailored for DLLMs that achieves lossless parallel decoding. Specifically, we propose a pipeline of parallel-decoded candidate generation and verification, which is guaranteed to reproduce the same sequence generated by static sampling, without introducing extra model forward calls. By applying Freedave, the throughput of DLLMs can be boosted up to 2.8\times without performance degradation on math reasoning tasks.
zh

[AI-99] SLogic: Subgraph-Informed Logical Rule Learning for Knowledge Graph Completion

【速读】:该论文旨在解决知识图谱补全(Knowledge Graph Completion)中逻辑规则(Logical Rule)应用的局限性问题,即现有方法通常将逻辑规则视为全局通用规则,为其分配固定的置信度分数,忽略了查询特定上下文对规则重要性的动态影响。为克服这一问题,论文提出SLogic(Subgraph-Informed Logical Rule learning)框架,其核心创新在于设计了一个基于查询头实体为中心的局部子图(subgraph)信息的评分函数,使逻辑规则的置信度能够根据具体查询动态调整,从而更精准地捕捉规则在不同上下文中的有效性。实验表明,该方法在多个基准数据集上显著优于当前主流的嵌入式与规则式基线模型。

链接: https://arxiv.org/abs/2510.00279
作者: Trung Hoang Le,Tran Cao Son,Huiping Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Logical rule-based methods offer an interpretable approach to knowledge graph completion by capturing compositional relationships in the form of human-readable inference rules. However, current approaches typically treat logical rules as universal, assigning each rule a fixed confidence score that ignores query-specific context. This is a significant limitation, as a rule’s importance can vary depending on the query. To address this, we introduce SLogic (Subgraph-Informed Logical Rule learning), a novel framework that assigns query-dependent scores to logical rules. The core of SLogic is a scoring function that utilizes the subgraph centered on a query’s head entity, allowing the significance of each rule to be assessed dynamically. Extensive experiments on benchmark datasets show that by leveraging local subgraph context, SLogic consistently outperforms state-of-the-art baselines, including both embedding-based and rule-based methods.
zh

[AI-100] MAGIC-MASK: Multi-Agent Guided Inter-Agent Collaboration with Mask-Based Explainability for Reinforcement Learning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning)代理在安全关键和多智能体环境中的决策过程可解释性问题。现有方法如StateMask虽能识别关键状态,但在计算成本、探索覆盖范围及多智能体场景适应性方面存在局限。解决方案的关键在于提出一个数学基础坚实的框架MAGIC-MASK(Multi-Agent Guided Inter-agent Collaboration with Mask-Based Explainability for Reinforcement Learning),其核心创新是通过轨迹扰动、奖励保真度分析与Kullback-Leibler散度正则化构建统一的多智能体可解释性形式化体系,实现基于掩码的解释从单智能体向多智能体系统的泛化。该方法结合近端策略优化(Proximal Policy Optimization)、自适应ε-greedy探索和轻量级跨智能体协作机制,使各智能体能够执行显著性引导的掩码操作并共享基于奖励的洞察,从而提升解释保真度、加速关键状态发现,并增强学习效率与策略鲁棒性。

链接: https://arxiv.org/abs/2510.00274
作者: Maisha Maliha,Dean Hougen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Understanding the decision-making process of Deep Reinforcement Learning agents remains a key challenge for deploying these systems in safety-critical and multi-agent environments. While prior explainability methods like StateMask, have advanced the identification of critical states, they remain limited by computational cost, exploration coverage, and lack of adaptation to multi-agent settings. To overcome these limitations, we propose a mathematically grounded framework, MAGIC-MASK (Multi-Agent Guided Inter-agent Collaboration with Mask-Based Explainability for Reinforcement Learning), that extends perturbation-based explanation to Multi-Agent Reinforcement Learning. Our method integrates Proximal Policy Optimization, adaptive epsilon-greedy exploration, and lightweight inter-agent collaboration to share masked state information and peer experience. This collaboration enables each agent to perform saliency-guided masking and share reward-based insights with peers, reducing the time required for critical state discovery, improving explanation fidelity, and leading to faster and more robust learning. The core novelty of our approach lies in generalizing explainability from single-agent to multi-agent systems through a unified mathematical formalism built on trajectory perturbation, reward fidelity analysis, and Kullback-Leibler divergence regularization. This framework yields localized, interpretable explanations grounded in probabilistic modeling and multi-agent Markov decision processes. We validate our framework on both single-agent and multi-agent benchmarks, including a multi-agent highway driving environment and Google Research Football, demonstrating that MAGIC-MASK consistently outperforms state-of-the-art baselines in fidelity, learning efficiency, and policy robustness while offering interpretable and transferable explanations.
zh

[AI-101] A Hierarchical Agent ic Framework for Autonomous Drone-Based Visual Inspection

【速读】:该论文旨在解决工业现场物理资产(如设备、仪表等)自主巡检中自动化程度不足的问题,尤其是如何将智能体(agentic)框架从数字任务扩展到真实环境中的无人机控制与执行。其关键解决方案是提出一种分层式智能体架构:由一个负责高层规划与评估的“主智能体”和多个执行具体任务的“工作智能体”组成,每个工作智能体控制一架无人机;同时引入ReActEval推理方法,使无人机在自然语言驱动下遵循“计划-推理-执行-评估”循环,从而完成从简单导航到复杂视觉识别(如读取压力表)的多样化任务。该方案通过自然语言交互实现灵活、可解释且用户友好的自主决策,显著提升了工业场景下无人机巡检的智能化水平与效率。

链接: https://arxiv.org/abs/2510.00259
作者: Ethan Herron,Xian Yeow Lee,Gregory Sin,Teresa Gonzalez Diaz,Ahmed Farahat,Chetan Gupta
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Autonomous inspection systems are essential for ensuring the performance and longevity of industrial assets. Recently, agentic frameworks have demonstrated significant potential for automating inspection workflows but have been limited to digital tasks. Their application to physical assets in real-world environments, however, remains underexplored. In this work, our contributions are two-fold: first, we propose a hierarchical agentic framework for autonomous drone control, and second, a reasoning methodology for individual function executions which we refer to as ReActEval. Our framework focuses on visual inspection tasks in indoor industrial settings, such as interpreting industrial readouts or inspecting equipment. It employs a multi-agent system comprising a head agent and multiple worker agents, each controlling a single drone. The head agent performs high-level planning and evaluates outcomes, while worker agents implement ReActEval to reason over and execute low-level actions. Operating entirely in natural language, ReActEval follows a plan, reason, act, evaluate cycle, enabling drones to handle tasks ranging from simple navigation (e.g., flying forward 10 meters and land) to complex high-level tasks (e.g., locating and reading a pressure gauge). The evaluation phase serves as a feedback and/or replanning stage, ensuring actions align with user objectives while preventing undesirable outcomes. We evaluate the framework in a simulated environment with two worker agents, assessing performance qualitatively and quantitatively based on task completion across varying complexity levels and workflow efficiency. By leveraging natural language processing for agent communication, our approach offers a novel, flexible, and user-accessible alternative to traditional drone-based solutions, enabling autonomous problem-solving for industrial inspection without extensive user intervention.
zh

[AI-102] Can AI agents understand spoken conversations about data visualizations in online meetings?

【速读】:该论文旨在解决AI代理在在线会议场景中对数据可视化相关口语对话的理解能力问题,这是提升AI助手机器人会议支持质量的关键前提。其解决方案的核心在于构建一个双轴测试框架,用于诊断模型对可视化讨论的 comprehensibility(理解能力),并通过一系列针对72个新型口语对话数据集的测试,系统评估不同模型架构(如大语言模型LLM与视觉语言模型VLM)及输入模态(图表图像、源代码或两者混合)对性能的影响。实验结果表明,仅使用文本输入模态时模型表现最佳(准确率达96%),揭示了当前任务下文本信息对于理解可视化对话的重要性。

链接: https://arxiv.org/abs/2510.00245
作者: Rizul Sharma,Tianyu Jiang,Seokki Lee,Jillian Aurisano
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this short paper, we present work evaluating an AI agent’s understanding of spoken conversations about data visualizations in an online meeting scenario. There is growing interest in the development of AI-assistants that support meetings, such as by providing assistance with tasks or summarizing a discussion. The quality of this support depends on a model that understands the conversational dialogue. To evaluate this understanding, we introduce a dual-axis testing framework for diagnosing the AI agent’s comprehension of spoken conversations about data. Using this framework, we designed a series of tests to evaluate understanding of a novel corpus of 72 spoken conversational dialogues about data visualizations. We examine diverse pipelines and model architectures, LLM vs VLM, and diverse input formats for visualizations (the chart image, its underlying source code, or a hybrid of both) to see how this affects model performance on our tests. Using our evaluation methods, we found that text-only input modalities achieved the best performance (96%) in understanding discussions of visualizations in online meetings.
zh

[AI-103] SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence

【速读】:该论文旨在解决通用语言模型在网络安全(cybersecurity)领域中因缺乏领域特定适应性而导致的高精度不足问题,尤其是在处理威胁情报数据、源代码与自然语言混合文档时的表现受限。其解决方案的关键在于提出SecureBERT 2.0,一个专为网络安全任务设计的编码器-only Transformer 模型,通过引入改进的长文本建模能力和分层编码机制,有效处理包含威胁报告和源代码等异构内容的长文档;同时,在超过130亿个文本标记和5300万个代码标记的领域特有语料库上进行预训练,显著提升了在语义搜索、实体识别、漏洞检测等关键任务上的性能表现。

链接: https://arxiv.org/abs/2510.00240
作者: Ehsan Aghaei,Sarthak Jain,Prashanth Arun,Arjun Sambamoorthy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective analysis of cybersecurity and threat intelligence data demands language models that can interpret specialized terminology, complex document structures, and the interdependence of natural language and source code. Encoder-only transformer architectures provide efficient and robust representations that support critical tasks such as semantic search, technical entity extraction, and semantic analysis, which are key to automated threat detection, incident triage, and vulnerability assessment. However, general-purpose language models often lack the domain-specific adaptation required for high precision. We present SecureBERT 2.0, an enhanced encoder-only language model purpose-built for cybersecurity applications. Leveraging the ModernBERT architecture, SecureBERT 2.0 introduces improved long-context modeling and hierarchical encoding, enabling effective processing of extended and heterogeneous documents, including threat reports and source code artifacts. Pretrained on a domain-specific corpus more than thirteen times larger than its predecessor, comprising over 13 billion text tokens and 53 million code tokens from diverse real-world sources, SecureBERT 2.0 achieves state-of-the-art performance on multiple cybersecurity benchmarks. Experimental results demonstrate substantial improvements in semantic search for threat intelligence, semantic analysis, cybersecurity-specific named entity recognition, and automated vulnerability detection in code within the cybersecurity domain.
zh

[AI-104] Debunk the Myth of SFT Generalization

【速读】:该论文旨在解决当前对监督微调(Supervised Fine-Tuning, SFT)普遍存在的误解,即认为SFT仅能记忆训练数据而无法泛化,相较强化学习(Reinforcement Learning, RL)缺乏鲁棒性。通过在Sokoban和General Points两个决策类基准上的系统评估,研究发现SFT性能不佳的主要原因是“冻结提示(frozen-prompt)”现象:当使用固定指令模板训练时,模型会固守训练语义而非适应新指令变体。解决方案的关键在于引入训练期间的提示多样性(prompt diversity),打破这一捷径,从而在不损害分布内性能的前提下实现对未见指令变体的强泛化能力;进一步地,结合链式思维(Chain-of-Thought, CoT)监督可提升模型对更复杂任务(如更大规模的Sokoban网格或组合复杂度更高的算术问题)的迁移能力。最终,将提示多样性与CoT结合,实现了跨指令变体和难度变体场景下的稳健泛化,其效果可媲美甚至超越RL基线,同时保持SFT的简洁性和稳定性。这表明SFT并非本质上劣于RL,而是依赖于高质量演示数据的设计与组织。

链接: https://arxiv.org/abs/2510.00237
作者: Xiaofeng Lin,Hejian Sang,Zhipeng Wang,Xuezhou Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A prevailing view holds that supervised fine-tuning (SFT) memorizes training data and fails to generalize, whereas reinforcement learning (RL) attains broader robustness. We revisit this claim through a systematic evaluation on two decision-making benchmarks, Sokoban and General Points, and arrive at a different conclusion. We show that much of SFT’s perceived failure stems from frozen-prompt artifacts: when trained on fixed instruction templates, SFT models cling to training semantics rather than adapting to new ones. Introducing prompt diversity during training breaks this shortcut and yields strong generalization to unseen instruction variants without harming in-distribution performance. Beyond instruction shifts, we ask whether SFT can generalize to strictly harder tasks. Here, chain-of-thought (CoT) supervision provides an algorithmic scaffold that markedly improves transfer to more difficult regimes, such as larger Sokoban grids with additional boxes and arithmetic with out-of-distribution values or five-card compositions that increase combinatorial complexity. Finally, combining prompt diversity with CoT achieves the best of both worlds: robust generalization across both instruction-variant and difficulty-variant settings, matching or surpassing RL baselines on our benchmarks while retaining SFT’s simplicity and stability. These findings challenge the narrative that SFT is inherently inferior to RL and support a data-centric perspective: with appropriately curated demonstrations, vanilla SFT can generalize as strongly as RL. Code reproducing the results in the paper can be found at: this https URL.
zh

[AI-105] he Pitfalls of KV Cache Compression

【速读】:该论文旨在解决KV缓存压缩(KV cache compression)在实际多指令提示(multi-instruction prompting)场景中可能引发的性能退化问题,尤其是某些指令因压缩导致显著失真甚至被模型完全忽略的现象。其关键解决方案在于识别出影响提示泄露(prompt leakage)的三个核心因素——压缩方法、指令顺序和KV缓存驱逐偏倚(KV eviction bias),并提出对KV缓存驱逐策略进行简单调整,从而有效缓解上述问题,提升模型在复杂多指令任务中的整体表现。

链接: https://arxiv.org/abs/2510.00231
作者: Alex Chen,Renato Geh,Aditya Grover,Guy Van den Broeck,Daniel Israel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls practitioners should be aware of when deploying KV cache compressed LLMs. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example of that, we highlight system prompt leakage as a case study, empirically showing the impact of compression on leakage and general instruction following. We show several factors that play a role in prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.
zh

[AI-106] DualTune: Decoupled Fine-Tuning for On-Device Agent ic Systems

【速读】:该论文旨在解决本地部署大型语言模型(Large Language Models, LLMs)在工具调用(tool calling)场景中性能不足的问题,特别是面对大规模工具集时的工具选择准确率低以及复杂参数结构下参数生成不准确的问题。其核心解决方案是提出“解耦微调”(decoupled fine-tuning)方法,将工具调用任务拆分为工具选择和参数生成两个独立子任务,并采用LoRA(Low-Rank Adaptation)微调技术为每个子任务分别训练专用的LoRA适配器,通过损失掩码(loss masking)实现任务间的隔离优化。进一步地,作者设计了DualTune推理框架,动态加载对应LoRA适配器并结合分层编排策略(hierarchical orchestration)提升本地模型在端设备上的高效代理调度能力,实验表明该方法显著提升了工具调用准确性,优于同类规模及更大模型的现有方案。

链接: https://arxiv.org/abs/2510.00229
作者: Rohan Kadekodi,Zhan Jin,Keisuke Kamahori,Yile Gu,Sean Khatiri,Noah H. Bayindirli,Sergey Gorbunov,Baris Kasikci
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) as agentic orchestrators has revolutionized task automation, but the need for privacy-preserving, cost-effective solutions demands on-device inference capabilities. However, local LLMs consistently underperform compared to frontier models in tool calling scenarios, struggling with both tool selection from large tool sets and accurate argument generation for complex parameter structures. We introduce a methodology that disaggregates a tool-calling task into two distinct subtasks: tool selection and argument generation. We propose “decoupled fine-tuning”, a novel post-training approach that employs LoRA fine-tuning to create dedicated LoRA adapters for tool selection and tool-specific argument generation using separate loss masking for each of the subtasks. Furthermore, we present DualTune, an inference framework that leverages the LoRA adapters created using decoupled fine-tuning to perform efficient agent orchestration with the help of local models on end-user devices. DualTune decomposes the tool-call generation step into tool selection and argument generation, and dynamically loads the corresponding LoRA adapters to generate tool calls. Additionally, DualTune implements hierarchical orchestration to restrict the number of tools required for tool selection. Our experiments on the MCP-Bench benchmark demonstrate that the Qwen-2.5-7B model trained using decoupled fine-tuning improves the tool calling accuracy of the base model by 46%, and outperforms other local reasoning, non-reasoning and fine-tuned models of similar size in all cases, and models that are 2x larger, in most cases.
zh

[AI-107] GPO: Temporal Grounded Policy Optimization for Signal Temporal Logic Tasks

【速读】:该论文旨在解决复杂、长时程任务中基于信号时序逻辑(Signal Temporal Logic, STL)的控制策略学习问题,这类任务因STL的非马尔可夫特性及稀疏奖励结构,难以通过标准强化学习(Reinforcement Learning, RL)算法有效求解。解决方案的关键在于提出Temporal Grounded Policy Optimization (TGPO),其核心创新是将STL分解为时序子目标和不变性约束,并构建分层框架:高层组件负责分配各子目标的时间资源,低层时条件策略则利用密集的阶段奖励信号实现子目标序列达成;推理阶段通过采样多种时间分配方案并选择最优者引导策略网络生成轨迹,同时借助已学 critic 指导高层时间搜索,采用 Metropolis-Hastings 采样聚焦于时序可行解空间,从而显著提升高维与长时程STL任务的学习效率与成功率。

链接: https://arxiv.org/abs/2510.00225
作者: Yue Meng,Fei Chen,Chuchu Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Learning control policies for complex, long-horizon tasks is a central challenge in robotics and autonomous systems. Signal Temporal Logic (STL) offers a powerful and expressive language for specifying such tasks, but its non-Markovian nature and inherent sparse reward make it difficult to be solved via standard Reinforcement Learning (RL) algorithms. Prior RL approaches focus only on limited STL fragments or use STL robustness scores as sparse terminal rewards. In this paper, we propose TGPO, Temporal Grounded Policy Optimization, to solve general STL tasks. TGPO decomposes STL into timed subgoals and invariant constraints and provides a hierarchical framework to tackle the problem. The high-level component of TGPO proposes concrete time allocations for these subgoals, and the low-level time-conditioned policy learns to achieve the sequenced subgoals using a dense, stage-wise reward signal. During inference, we sample various time allocations and select the most promising assignment for the policy network to rollout the solution trajectory. To foster efficient policy learning for complex STL with multiple subgoals, we leverage the learned critic to guide the high-level temporal search via Metropolis-Hastings sampling, focusing exploration on temporally feasible solutions. We conduct experiments on five environments, ranging from low-dimensional navigation to manipulation, drone, and quadrupedal locomotion. Under a wide range of STL tasks, TGPO significantly outperforms state-of-the-art baselines (especially for high-dimensional and long-horizon cases), with an average of 31.6% improvement in task success rate compared to the best baseline. The code will be available at this https URL
zh

[AI-108] Directed-MAML: Meta Reinforcement Learning Algorithm with Task-directed Approximation

【速读】:该论文旨在解决模型无关元学习(Model-Agnostic Meta-Learning, MAML)在元强化学习(meta-RL)应用中面临的两大挑战:一是依赖二阶梯度计算导致的显著计算和内存开销;二是嵌套优化结构增加问题复杂度,使得收敛至全局最优更加困难。解决方案的关键在于提出一种新的任务导向型元强化学习算法——Directed-MAML,其在执行二阶梯度步骤前引入一个额外的一阶任务导向近似(task-directed approximation),以估计二阶梯度的影响,从而加速收敛并降低计算成本。该方法不仅提升了MAML基线在CartPole-v1、LunarLander-v2及双车交叉路口场景中的计算效率和收敛速度,还可推广至FOMAML和Meta-SGD等其他元学习算法,进一步增强整体性能。

链接: https://arxiv.org/abs/2510.00212
作者: Yang Zhang,Huiwen Yan,Mushuang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model-Agnostic Meta-Learning (MAML) is a versatile meta-learning framework applicable to both supervised learning and reinforcement learning (RL). However, applying MAML to meta-reinforcement learning (meta-RL) presents notable challenges. First, MAML relies on second-order gradient computations, leading to significant computational and memory overhead. Second, the nested structure of optimization increases the problem’s complexity, making convergence to a global optimum more challenging. To overcome these limitations, we propose Directed-MAML, a novel task-directed meta-RL algorithm. Before the second-order gradient step, Directed-MAML applies an additional first-order task-directed approximation to estimate the effect of second-order gradients, thereby accelerating convergence to the optimum and reducing computational cost. Experimental results demonstrate that Directed-MAML surpasses MAML-based baselines in computational efficiency and convergence speed in the scenarios of CartPole-v1, LunarLander-v2 and two-vehicle intersection crossing. Furthermore, we show that task-directed approximation can be effectively integrated into other meta-learning algorithms, such as First-Order Model-Agnostic Meta-Learning (FOMAML) and Meta Stochastic Gradient Descent(Meta-SGD), yielding improved computational efficiency and convergence speed.
zh

[AI-109] LoRAFusion: Efficient LoRA Fine-Tuning for LLM s EUROSYS2026

【速读】:该论文针对当前低秩适配(Low-Rank Adaptation, LoRA)微调系统中存在的两大效率瓶颈展开研究:一是由于对大规模激活张量的冗余内存访问导致显著的运行时开销;二是未能充分利用多独立LoRA适配器共享同一基础大语言模型(Large Language Models, LLMs)时的并行机会,从而错失了减少流水线气泡、提升通信重叠和优化GPU负载均衡等性能增益。为解决上述问题,论文提出LoRAFusion系统,其核心创新在于两个层面:在内核级采用图分割(graph-splitting)方法融合内存密集型操作,在不引入重计算或同步代价的前提下消除冗余内存访问,同时保持计算密集型GEMM(General Matrix Multiply)运算性能;在调度层设计自适应批处理算法,通过分组策略有意错开不同任务的批次执行,并在每组内求解依赖感知的装箱问题以生成负载均衡的微批次。实验表明,LoRAFusion相较Megatron-LM实现最高1.96倍(平均1.47倍)端到端加速,相较mLoRA(当前最优多LoRA系统)提升最高1.46倍(平均1.29倍),且其融合内核可直接作为插件替换现有LoRA系统中的相关模块。

链接: https://arxiv.org/abs/2510.00206
作者: Zhanda Zhu,Qidong Su,Yaoyao Ding,Kevin Song,Shang Wang,Gennady Pekhimenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by EuroSys 2026

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become the leading Parameter-Efficient Fine-Tuning (PEFT) method for Large Language Models (LLMs), as it significantly reduces GPU memory usage while maintaining competitive fine-tuned model quality on downstream tasks. Despite these benefits, we identify two key inefficiencies in existing LoRA fine-tuning systems. First, they incur substantial runtime overhead due to redundant memory accesses on large activation tensors. Second, they miss the opportunity to concurrently fine-tune multiple independent LoRA adapters that share the same base model on the same set of GPUs. This leads to missed performance gains such as reduced pipeline bubbles, better communication overlap, and improved GPU load balance. To address these issues, we introduce LoRAFusion, an efficient LoRA fine-tuning system for LLMs. At the kernel level, we propose a graph-splitting method that fuses memory-bound operations. This design eliminates unnecessary memory accesses and preserves the performance of compute-bound GEMMs without incurring the cost of recomputation or synchronization. At the scheduling level, LoRAFusion introduces an adaptive batching algorithm for multi-job fine-tuning. It first splits LoRA adapters into groups to intentionally stagger batch execution across jobs, and then solves a bin-packing problem within each group to generate balanced, dependency-aware microbatches. LoRAFusion achieves up to 1.96\times ( 1.47\times on average) end-to-end speedup compared to Megatron-LM, and up to 1.46\times ( 1.29\times on average) improvement over mLoRA, the state-of-the-art multi-LoRA fine-tuning system. Our fused kernel achieves up to 1.39\times ( 1.27\times on average) kernel performance improvement and can directly serve as a plug-and-play replacement in existing LoRA systems. We open-source LoRAFusion at this https URL. Comments: Accepted by EuroSys 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2510.00206 [cs.LG] (or arXiv:2510.00206v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.00206 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3767295.3769331 Focus to learn more DOI(s) linking to related resources
zh

[AI-110] GRPO-λ: Credit Assignment improves LLM Reasoning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)后训练方法在大型语言模型(Large Language Models, LLMs)复杂推理任务中信用分配(credit assignment)粗粒度的问题,尤其是在基于可验证奖励的策略优化方法(如GRPO)中,由于缺乏显式的奖励函数或价值 critic 模型,导致难以对 token 序列进行精细的梯度更新。解决方案的关键在于提出 GRPO-λ,通过引入 λ-return 的近似机制,利用生成序列后计算的 token 级别对数概率重构优势估计,并设计了一种无需 critic 的时序差分误差(temporal-difference error)近似方法,从而实现更精确的信用分配。此外,论文还探索了多种 λ-return 权重策略在优势追踪中的应用,均显著优于原始 GRPO 方法,在多个数学推理数据集上实现了 3–4.5 点的性能提升。

链接: https://arxiv.org/abs/2510.00194
作者: Prasanna Parthasarathi,Mathieu Reymond,Boxing Chen,Yufei Cui,Sarath Chandar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO’s ability to assign fine-grained credit across token sequences. In this work, we present GRPO- \lambda , a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from \lambda -return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the \lambda -return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO- \lambda against GRPO by training models from 1.5B to 7B parameters on 4 different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO- \lambda , the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over 3 points and a 4.5 points improvement on the 7B model.
zh

[AI-111] PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning

【速读】:该论文旨在解决低秩适配(Low-rank adaptation, LoRA)在参数高效微调大语言模型时表达能力不足的问题,即如何从过参数化初始化空间中获取高表达力的低秩适配器。其解决方案的关键在于提出PrunedLoRA框架,通过结构化剪枝动态移除微调过程中不重要的组件并防止其重新激活,从而实现灵活且自适应的秩分配;同时,基于梯度的剪枝策略在细粒度更新中最小化整体损失的剪枝误差,并首次提供了结构化剪枝鲁棒性的理论分析,证明梯度驱动剪枝比基于激活的剪枝更具稳定性。

链接: https://arxiv.org/abs/2510.00192
作者: Xin Yu,Cong Xie,Ziyu Zhao,Tiantian Fan,Lingzhou Xue,Zhi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models, yet its representational capacity often lags behind full fine-tuning. Within the context of LoRA, a key open question is how to obtain expressive low-rank adapters from over-parameterized spaces. We propose \textitPrunedLoRA, a new framework that leverages structured pruning to obtain highly representative low-rank adapters from an over-parameterized initialization. Unlike prior approaches that impose a fixed low-rank budget, PrunedLoRA dynamically prunes less important components during fine-tuning and prevents their reactivation, enabling flexible and adaptive rank allocation. For structured pruning, by minimizing the pruning error for overall loss, we provide fine-grained pruning and recovery updates in a gradient-based pruning strategy with grounded interpretation. We provide the first theoretical analysis of the robustness of structured pruning and provably show that under the impact of weight perturbation, gradient-based pruning is more robust than activation-based pruning with respect to overall loss. Empirically, PrunedLoRA consistently outperforms LoRA and its variants across supervised fine-tuning tasks in mathematical reasoning, code generation, and natural language understanding, and it also demonstrates advantages over existing structured pruning methods across diverse sparsity levels.
zh

[AI-112] hinkquel: A Model Dedicated to Text-to-dbt Using Synthetic Data and a Span-Aware Objective

【速读】:该论文旨在解决自然语言请求到可靠、可生产的数据转换(data transformations)之间的映射难题,核心挑战在于正确性依赖于精确的模式链接(schema linking)和仓库特定的SQL方言(SQL dialects),而训练阶段可用的最强监督信号——执行成功和结果匹配——仅在序列层面提供。此外,构建大规模且经执行验证的数据集成本高昂,且词元级目标与全局信号不一致,导致优化不稳定和模型泛化能力弱。解决方案的关键在于提出Thinkquel框架,其创新点包括:(1) 一种新颖的合成数据生成管道TS-SQL,利用dbt(data build tool)作为可移植的中间表示;(2) 一种面向跨度感知的强化学习目标Token-Sequence GRPO(TS-GRPO),专门用于弥合词元级训练信号与序列级执行奖励之间的差距。实验表明,Thinkquel在TS-SQL测试集上达到93.2%执行成功率和61.8%精确结果匹配率,并显著提升训练稳定性和收敛速度。

链接: https://arxiv.org/abs/2510.00186
作者: Anni Li,Aria Attar,Paul Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transforming natural-language requests into reliable, production-ready data transformations remains challenging: correctness depends on precise schema linking and warehouse-specific SQL dialects, while the strongest supervision available during training–execution success and result matching–are provided only at the sequence level. At the same time, assembling large, execution-validated corpora is costly, and token-level objectives misalign with these global signals, yielding unstable optimization and limited portability. We introduce Thinkquel, a fine-tuned model for producing robust, portable, and execution-validated database queries. Methodologies in Thinkquel integrates a novel synthetic data pipeline, TS-SQL, that leverages dbt as a portable intermediate representation with a span-aware reinforcement learning objective, and Token-Sequence GRPO (TS-GRPO), specifically designed to bridge the gap between token-level training signals and sequence-level execution rewards when finetuning LLMs. On the 500-example TS-SQL test set, Thinkquel (32B) reaches 93.2% execution success and 61.8% exact-result match with a two-stage SFT curriculum, improving over the base model by 67.2% (exec.) and 44.4% (match). In Spider (14B) experiments, TS-GRPO increases training stability and speeds convergence of the execution-match reward relative to GRPO and GSPO.
zh

[AI-113] Object-Centric Case-Based Reasoning via Argumentation ECAI25

【速读】:该论文旨在解决图像分类任务中如何有效融合神经网络的特征提取能力与符号推理的可解释性问题,以提升模型在复杂场景下的泛化能力和决策透明度。其解决方案的关键在于提出了一种新颖的神经符号混合框架——Slot Attention Argumentation for Case-Based Reasoning (SAA-CBR),该框架通过神经模块Slot Attention (SA)实现对象中心的学习,同时利用符号推理机制Abstract Argumentation for Case-Based Reasoning (AA-CBR)进行基于案例的逻辑推演,从而在特征组合策略、案例库精简、偏序关系建模、多分类扩展及支持型推理等方面实现了创新性集成,最终在CLEVR-Hans数据集上展现出与基线模型相当甚至更优的分类性能。

链接: https://arxiv.org/abs/2510.00185
作者: Gabriel de Olim Gaul,Adam Gould,Avinash Kori,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ArgXAI@ECAI25

点击查看摘要

Abstract:We introduce Slot Attention Argumentation for Case-Based Reasoning (SAA-CBR), a novel neuro-symbolic pipeline for image classification that integrates object-centric learning via a neural Slot Attention (SA) component with symbolic reasoning conducted by Abstract Argumentation for Case-Based Reasoning (AA-CBR). We explore novel integrations of AA-CBR with the neural component, including feature combination strategies, casebase reduction via representative samples, novel count-based partial orders, a One-Vs-Rest strategy for extending AA-CBR to multi-class classification, and an application of Supported AA-CBR, a bipolar variant of AA-CBR. We demonstrate that SAA-CBR is an effective classifier on the CLEVR-Hans datasets, showing competitive performance against baseline models.
zh

[AI-114] Why Cant Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

【速读】:该论文旨在解决语言模型在多数字乘法任务中表现不佳的问题,尤其是其缺乏对长程依赖关系的有效建模能力。研究表明,标准微调策略会使模型收敛到一个局部最优解,该解无法捕捉乘法运算所需的远距离信息传递;而通过逆向工程发现,成功学习乘法的模型采用隐式思维链(implicit chain-of-thought)机制,利用注意力机制构建有向无环图来缓存和检索部分积,并通过Minkowski和与傅里叶基表示实现高效的数字编码。解决方案的关键在于引入辅助损失函数,预测“累加和”作为线性回归探针,从而提供正确的归纳偏置(inductive bias),促使模型学习到必要的长程结构,最终实现对多数字乘法的准确计算。

链接: https://arxiv.org/abs/2510.00184
作者: Xiaoyan Bai,Itamar Pres,Yuntian Deng,Chenhao Tan,Stuart Shieber,Fernanda Viégas,Martin Wattenberg,Andrew Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emphimplicit chain-of-thought, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to cache'' and retrieve’’ pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum’’ via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.
zh

[AI-115] A Systematic Study of Large Language Models for Task and Motion Planning With PDDLStream

【速读】:该论文旨在解决如何将大语言模型(Large Language Models, LLMs)的语义理解能力与任务与运动规划(Task and Motion Planning, TAMP)的符号推理能力有效结合,以提升复杂机器人任务的求解性能。其核心挑战在于LLMs在TAMP框架中的集成方式多样且缺乏系统性评估。解决方案的关键在于设计16种基于Gemini 2.5 Flash模型替换TAMP关键组件的算法,并通过4,950个问题的零样本实验揭示:尽管LLMs具备一定规划能力,但直接替代传统TAMP模块会导致成功率下降和计算时间增加;进一步发现,引入几何细节会加剧任务规划错误,而无需推理的快速LLM变体在多数情况下优于需推理的慢速版本,因TAMP系统可引导LLM修正其错误,从而实现更高效的协同规划。

链接: https://arxiv.org/abs/2510.00182
作者: Jorge Mendez-Mendez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using large language models (LLMs) to solve complex robotics problems requires understanding their planning capabilities. Yet while we know that LLMs can plan on some problems, the extent to which these planning capabilities cover the space of robotics tasks is unclear. One promising direction is to integrate the semantic knowledge of LLMs with the formal reasoning of task and motion planning (TAMP). However, the myriad of choices for how to integrate LLMs within TAMP complicates the design of such systems. We develop 16 algorithms that use Gemini 2.5 Flash to substitute key TAMP components. Our zero-shot experiments across 4,950 problems and three domains reveal that the Gemini-based planners exhibit lower success rates and higher planning times than their engineered counterparts. We show that providing geometric details increases the number of task-planning errors compared to pure PDDL descriptions, and that (faster) non-reasoning LLM variants outperform (slower) reasoning variants in most cases, since the TAMP system can direct the LLM to correct its mistakes.
zh

[AI-116] CHAI: Command Hijacking against embodied AI

【速读】:该论文旨在解决嵌入式人工智能(Embodied AI)系统在面对数据稀缺场景时,因依赖感知与行动的常识推理能力而引入的新安全风险问题。其解决方案的关键在于提出了一种名为CHAI(Command Hijacking against embodied AI)的新型提示攻击方法,该方法通过在视觉输入中嵌入欺骗性自然语言指令(如误导性标识),系统性地搜索词元空间、构建提示词典,并引导攻击模型生成视觉对抗提示(Visual Attack Prompts),从而利用大型视觉-语言模型(LVLMs)的多模态语义理解能力实现对机器人车辆等系统的行为操控。实验表明,CHAI在无人机紧急降落、自动驾驶和空中目标跟踪等多个任务中均显著优于现有最先进攻击方法,凸显了针对下一代嵌入式AI系统需发展超越传统对抗鲁棒性的新型防御机制的紧迫性。

链接: https://arxiv.org/abs/2510.00181
作者: Luis Burbano,Diego Ortiz,Qi Sun,Siwei Yang,Haoqin Tu,Cihang Xie,Yinzhi Cao,Alvaro A Cardenas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a new class of prompt-based attacks that exploit the multimodal language interpretation abilities of Large Visual-Language Models (LVLMs). CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents; drone emergency landing, autonomous driving, and aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.
zh

[AI-117] Drones that Think on their Feet: Sudden Landing Decisions with Embodied AI

【速读】:该论文旨在解决自主无人机在面对突发事件(如警报、故障或环境突变)时,缺乏即时且自适应决策能力的问题。传统方法依赖安全工程师手动编写大量恢复规则,难以覆盖真实世界中复杂的不确定性场景,导致系统易失效。解决方案的关键在于利用基于大规模视觉语言模型的具身人工智能(embodied AI),通过赋予无人机对环境的上下文理解能力和实时生成适当动作的常识推理能力,实现动态响应与安全着陆等应急决策,从而构建出此前无法人工设计的自适应恢复与决策流水线,显著提升自主飞行系统的鲁棒性与安全性。

链接: https://arxiv.org/abs/2510.00167
作者: Diego Ortiz Barbosa,Mohit Agrawal,Yash Malegaonkar,Luis Burbano,Axel Andersson,György Dán,Henrik Sandberg,Alvaro A. Cardenas
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous drones must often respond to sudden events, such as alarms, faults, or unexpected changes in their environment, that require immediate and adaptive decision-making. Traditional approaches rely on safety engineers hand-coding large sets of recovery rules, but this strategy cannot anticipate the vast range of real-world contingencies and quickly becomes incomplete. Recent advances in embodied AI, powered by large visual language models, provide commonsense reasoning to assess context and generate appropriate actions in real time. We demonstrate this capability in a simulated urban benchmark in the Unreal Engine, where drones dynamically interpret their surroundings and decide on sudden maneuvers for safe landings. Our results show that embodied AI makes possible a new class of adaptive recovery and decision-making pipelines that were previously infeasible to design by hand, advancing resilience and safety in autonomous aerial systems.
zh

[AI-118] Privacy-Preserving Learning-Augmented Data Structures

【速读】:该论文旨在解决学习增强型数据结构(learning-augmented data structures)在隐私与安全方面的空白问题,尤其是其内存布局随预测频率动态调整所带来的历史信息泄露风险。传统数据结构在遭遇安全漏洞时应仅暴露当前内容,而学习增强型结构因依赖预测值调整内部组织,可能无意中泄露操作历史。为此,作者提出首个具备强历史独立性(strongly history independent)、鲁棒性且支持动态更新的学习增强型数据结构。解决方案的关键在于引入两种核心技术:阈值化(thresholding),可自动提升任意学习增强型数据结构的鲁棒性以应对对抗性预测误差;配对机制(pairing),一种简单但有效的策略,在动态场景下实现强历史独立性,确保内存表示不泄露除当前状态外的任何历史操作信息。实验表明该方案在安全性与效率之间存在权衡,但仍优于现有最优方法。

链接: https://arxiv.org/abs/2510.00165
作者: Prabhav Goyal,Vinesh Sridhar,Wilson Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Learning-augmented data structures use predicted frequency estimates to retrieve frequently occurring database elements faster than standard data structures. Recent work has developed data structures that optimally exploit these frequency estimates while maintaining robustness to adversarial prediction errors. However, the privacy and security implications of this setting remain largely unexplored. In the event of a security breach, data structures should reveal minimal information beyond their current contents. This is even more crucial for learning-augmented data structures, whose layout adapts to the data. A data structure is history independent if its memory representation reveals no information about past operations except what is inferred from its current contents. In this work, we take the first step towards privacy and security guarantees in this setting by proposing the first learning-augmented data structure that is strongly history independent, robust, and supports dynamic updates. To achieve this, we introduce two techniques: thresholding, which automatically makes any learning-augmented data structure robust, and pairing, a simple technique that provides strong history independence in the dynamic setting. Our experimental results demonstrate a tradeoff between security and efficiency but are still competitive with the state of the art. Comments: 6 pages, 2 figures Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2510.00165 [cs.IR] (or arXiv:2510.00165v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.00165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-119] Partial Identification Approach to Counterfactual Fairness Assessment

【速读】:该论文旨在解决在缺乏算法内部机制信息的情况下,如何从可观测数据中评估反事实公平性(counterfactual fairness)这一挑战。由于在许多实际场景中,目标反事实公平性度量不可识别(non-identifiable),即无法仅凭观测数据和已有知识唯一确定,论文提出采用部分识别(partial identification)方法,通过构建可验证的边界来估计反事实公平性度量。其解决方案的关键在于引入贝叶斯框架,在高置信度下对未知的反事实公平性度量进行边界约束,并基于COMPAS数据集验证了该方法的有效性,揭示了种族和年龄等敏感属性对再犯风险评分的因果效应。

链接: https://arxiv.org/abs/2510.00163
作者: Saeyoung Rho,Junzhe Zhang,Elias Bareinboim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:The wide adoption of AI decision-making systems in critical domains such as criminal justice, loan approval, and hiring processes has heightened concerns about algorithmic fairness. As we often only have access to the output of algorithms without insights into their internal mechanisms, it was natural to examine how decisions would alter when auxiliary sensitive attributes (such as race) change. This led the research community to come up with counterfactual fairness measures, but how to evaluate the measure from available data remains a challenging task. In many practical applications, the target counterfactual measure is not identifiable, i.e., it cannot be uniquely determined from the combination of quantitative data and qualitative knowledge. This paper addresses this challenge using partial identification, which derives informative bounds over counterfactual fairness measures from observational data. We introduce a Bayesian approach to bound unknown counterfactual fairness measures with high confidence. We demonstrate our algorithm on the COMPAS dataset, examining fairness in recidivism risk scores with respect to race, age, and sex. Our results reveal a positive (spurious) effect on the COMPAS score when changing race to African-American (from all others) and a negative (direct causal) effect when transitioning from young to old age.
zh

[AI-120] AuditAgent : Expert-Guided Multi-Agent Reasoning for Cross-Document Fraudulent Evidence Discovery

【速读】:该论文旨在解决真实世界中财务欺诈检测面临的挑战,即证据分散且隐蔽,难以在复杂的多年度财务披露中进行细粒度定位。其解决方案的关键在于提出了一种增强审计领域专业知识的多智能体推理框架AuditAgent,通过引入主体级风险先验、混合检索策略以及专用代理模块,实现跨报告证据的有效识别与聚合,从而显著提升召回率和可解释性,为自动化、透明化的财务审计提供了新基准。

链接: https://arxiv.org/abs/2510.00156
作者: Songran Bai,Bingzhe Wu,Yiwei Zhang,Chengke Wu,Xiaolong Zheng,Yaze Yuan,Ke Wu,Jianqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Financial fraud detection in real-world scenarios presents significant challenges due to the subtlety and dispersion of evidence across complex, multi-year financial disclosures. In this work, we introduce a novel multi-agent reasoning framework AuditAgent, enhanced with auditing domain expertise, for fine-grained evidence chain localization in financial fraud cases. Leveraging an expert-annotated dataset constructed from enforcement documents and financial reports released by the China Securities Regulatory Commission, our approach integrates subject-level risk priors, a hybrid retrieval strategy, and specialized agent modules to efficiently identify and aggregate cross-report evidence. Extensive experiments demonstrate that our method substantially outperforms General-Purpose Agent paradigm in both recall and interpretability, establishing a new benchmark for automated, transparent financial forensics. Our results highlight the value of domain-specific reasoning and dataset construction for advancing robust financial fraud detection in practical, real-world regulatory applications.
zh

[AI-121] RoboPilot: Generalizable Dynamic Robotic Manipulation with Dual-thinking Modes

【速读】:该论文旨在解决当前自主机器人在执行复杂或长时程任务时面临的鲁棒性不足问题,尤其是在开放环路(open-loop)控制下缺乏推理能力和反馈机制导致的环境变化适应差与误差累积严重的问题。解决方案的关键在于提出RoboPilot框架,其核心是双思考模式(dual-thinking)的闭环结构:一方面利用基础动作(primitive actions)进行结构化任务规划和灵活动作生成,另一方面引入反馈机制以实现对动态变化和执行错误的重规划;同时结合思维链(Chain-of-Thought)推理提升高层任务规划能力并指导底层动作生成,系统还能根据需求动态切换快速思考与慢速思考模式,在效率与准确性之间取得平衡。

链接: https://arxiv.org/abs/2510.00154
作者: Xinyi Liu,Mohammadreza Fani Sani,Zewei Zhou,Julius Wirbel,Bahram Zarrin,Roberto Galeazzi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite rapid progress in autonomous robotics, executing complex or long-horizon tasks remains a fundamental challenge. Most current approaches follow an open-loop paradigm with limited reasoning and no feedback, resulting in poor robustness to environmental changes and severe error accumulation. We present RoboPilot, a dual-thinking closed-loop framework for robotic manipulation that supports adaptive reasoning for complex tasks in real-world dynamic environments. RoboPilot leverages primitive actions for structured task planning and flexible action generation, while introducing feedback to enable replanning from dynamic changes and execution errors. Chain-of-Thought reasoning further enhances high-level task planning and guides low-level action generation. The system dynamically switches between fast and slow thinking to balance efficiency and accuracy. To systematically evaluate the robustness of RoboPilot in diverse robot manipulation scenarios, we introduce RoboPilot-Bench, a benchmark spanning 21 tasks across 10 categories, including infeasible-task recognition and failure recovery. Experiments show that RoboPilot outperforms state-of-the-art baselines by 25.9% in task success rate, and the real-world deployment on an industrial robot further demonstrates its robustness in real-world settings.
zh

[AI-122] Stealing AI Model Weights Through Covert Communication Channels

【速读】:该论文旨在解决人工智能(AI)模型在无线设备中因硬件加速器存在而面临的模型窃取攻击问题,此类攻击可能通过隐蔽的硬件后门(Hardware Trojan, HT)泄露模型权重,从而造成知识产权损失。解决方案的关键在于设计一种两阶段攻击:第一阶段,在受害者设备中植入具有隐蔽通信信道的HT,以在不被察觉的情况下泄露模型权重;第二阶段,攻击者利用邻近无线设备截获正常运行时的数据帧,并逐步重建完整的权重矩阵。该方法对AI模型架构和硬件加速器类型均具有无关性,且通过实验证明了其有效性与隐蔽性。

链接: https://arxiv.org/abs/2510.00151
作者: Valentin Barbaza,Alan Rodrigo Diaz-Rizo,Hassan Aboushady,Spyridon Raptis,Haralampos-G. Stratigopoulos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI models are often regarded as valuable intellectual property due to the high cost of their development, the competitive advantage they provide, and the proprietary techniques involved in their creation. As a result, AI model stealing attacks pose a serious concern for AI model providers. In this work, we present a novel attack targeting wireless devices equipped with AI hardware accelerators. The attack unfolds in two phases. In the first phase, the victim’s device is compromised with a hardware Trojan (HT) designed to covertly leak model weights through a hidden communication channel, without the victim realizing it. In the second phase, the adversary uses a nearby wireless device to intercept the victim’s transmission frames during normal operation and incrementally reconstruct the complete weight matrix. The proposed attack is agnostic to both the AI model architecture and the hardware accelerator used. We validate our approach through a hardware-based demonstration involving four diverse AI models of varying types and sizes. We detail the design of the HT and the covert channel, highlighting their stealthy nature. Additionally, we analyze the impact of bit error rates on the reception and propose an error mitigation technique. The effectiveness of the attack is evaluated based on the accuracy of the reconstructed models with stolen weights and the time required to extract them. Finally, we explore potential defense mechanisms.
zh

[AI-123] Which Rewards Matter? Reward Selection for Reinforcement Learning under Limited Feedback

【速读】:该论文旨在解决在奖励信号受限条件下,如何高效选择有限的奖励标签以最大化强化学习策略性能的问题(即奖励选择问题,Reward Selection for Reinforcement Learning from Limited Feedback, RLLF)。其核心解决方案在于识别并标注两类关键奖励样本:一是能引导智能体沿最优轨迹行进的奖励,二是能在智能体偏离最优行为后支持其恢复至近优策略的奖励。通过引入新的问题形式化框架,研究发现基于奖励自由信息(如状态访问频率和部分价值函数)的启发式策略与利用辅助评估反馈预训练的策略均能显著提升奖励标注效率,在远少于全监督所需标签的情况下实现接近最优的策略性能,从而确立了奖励选择作为反馈受限场景下扩展强化学习的有效范式。

链接: https://arxiv.org/abs/2510.00144
作者: Shreyas Chaudhari,Renhao Zhang,Philip S. Thomas,Bruno Castro da Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability of reinforcement learning algorithms to learn effective policies is determined by the rewards available during training. However, for practical problems, obtaining large quantities of reward labels is often infeasible due to computational or financial constraints, particularly when relying on human feedback. When reinforcement learning must proceed with limited feedback – only a fraction of samples get rewards labeled – a fundamental question arises: which samples should be labeled to maximize policy performance? We formalize this problem of reward selection for reinforcement learning from limited feedback (RLLF), introducing a new problem formulation that facilitates the study of strategies for selecting impactful rewards. Two types of selection strategies are investigated: (i) heuristics that rely on reward-free information such as state visitation and partial value functions, and (ii) strategies pre-trained using auxiliary evaluative feedback. We find that critical subsets of rewards are those that (1) guide the agent along optimal trajectories, and (2) support recovery toward near-optimal behavior after deviations. Effective selection methods yield near-optimal policies with significantly fewer reward labels than full supervision, establishing reward selection as a powerful paradigm for scaling reinforcement learning in feedback-limited settings.
zh

[AI-124] Nonparametric Identification of Latent Concepts ICML2025

【速读】:该论文旨在解决概念学习(concept learning)中缺乏一般性理论支持的问题,尤其是在多类别观测数据下如何识别隐藏的概念结构。其解决方案的关键在于引入人类认知中的“比较机制”(comparison mechanism),通过分析跨类别的多样性来实现对潜在概念的可识别性(identifiability)证明,而无需假设特定的概念类型、函数关系或参数化生成模型。该方法不仅在全局条件下保证概念识别的正确性,还能基于局部比较提供部分概念的替代性保障,从而扩展理论在更灵活场景下的适用性,并能非参数化地恢复类别与概念之间的隐藏结构。

链接: https://arxiv.org/abs/2510.00136
作者: Yujia Zheng,Shaoan Xie,Kun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
备注: ICML 2025

点击查看摘要

Abstract:We are born with the ability to learn concepts by comparing diverse observations. This helps us to understand the new world in a compositional manner and facilitates extrapolation, as objects naturally consist of multiple concepts. In this work, we argue that the cognitive mechanism of comparison, fundamental to human learning, is also vital for machines to recover true concepts underlying the data. This offers correctness guarantees for the field of concept learning, which, despite its impressive empirical successes, still lacks general theoretical support. Specifically, we aim to develop a theoretical framework for the identifiability of concepts with multiple classes of observations. We show that with sufficient diversity across classes, hidden concepts can be identified without assuming specific concept types, functional relations, or parametric generative models. Interestingly, even when conditions are not globally satisfied, we can still provide alternative guarantees for as many concepts as possible based on local comparisons, thereby extending the applicability of our theory to more flexible scenarios. Moreover, the hidden structure between classes and concepts can also be identified nonparametrically. We validate our theoretical results in both synthetic and real-world settings.
zh

[AI-125] BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner

【速读】:该论文旨在解决当前通用大语言模型(Large Language Models, LLMs)在科学计算任务中缺乏领域专精性与多任务协同能力的问题,试图构建一个能够统一建模跨学科、跨尺度科学任务的语言引导型科学计算框架。其解决方案的关键在于提出BigBang-Proton架构,包含三大核心创新:理论-实验学习范式(Theory-Experiment Learning paradigm),将大规模数值实验数据与理论文本语料对齐;二进制块编码(Binary Patch Encoding)替代传统字节对子编码(Byte Pair Encoding, BPE)进行更高效的token化;蒙特卡洛注意力机制(Monte Carlo Attention)取代标准Transformer结构以提升计算效率与泛化能力。通过在跨学科科学数据集上进行自回归预训练,并结合下游任务微调,该模型在多项科学任务中达到或超越专用模型性能,验证了语言引导科学计算的有效性与多任务潜力。

链接: https://arxiv.org/abs/2510.00129
作者: Hengkui Wu,Liujiang Liu,Jihua He,Qihao Wang,Keke Zhao,Shuyang Hu,Renle Fu,Dahao Liang,Lingyu Zeng,Bruce Liu,Yuan Liu,Jin Zhan,Jiaqiang Niu,Xinglong Jia,Yaqin Hu,Wenjun Ji,Panpan Chi,Ken Chen,Hengyuan Wu,Yingsi Xin,Yongfeng Zhu,Yuexin Wang,Manqi Ruan,Ningtao Bian,Xiaohua Wu,Weipeng Xu
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 93 pages, 39 figures

点击查看摘要

Abstract:We introduce BigBang-Proton, a unified sequence-based architecture for auto-regressive language modeling pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton incorporates three fundamental innovations compared to mainstream general-purpose LLMs: Theory-Experiment Learning paradigm aligns large-scale numerical experimental data with theoretical text corpora; Binary Patch Encoding replaces byte pair encoding(BPE) tokenization; Monte Carlo Attention substitutes traditional transformer architectures. Through next-word-prediction pretraining on cross-discipline scientific datasets of real-world problems mixed with general textual corpus, followed by fine-tuning and inference on downstream tasks, BigBang-Proton demonstrates 100% accuracy in up to 50-digit arithmetic addition operations, performance on par with leading specialized models in particle physics jet tagging, matching MAE of specialized models in inter-atomic potential simulation, performance comparable to traditional spatiotemporal models in water quality prediction, and benchmark-exceeding performance in genome modeling. These results prove that language-guided scientific computing can match or exceed the performance of task-specific scientific models while maintaining multitask learning capabilities. We further hypothesize to scale the pretraining to the universe scale as a fundamental step toward developing material world foundational model.
zh

[AI-126] Simulating Student Success in the Age of GenAI: A Kantian-Axiomatic Perspective

【速读】:该论文旨在解决如何从形式逻辑角度重新诠释生成式 AI (Generative AI) 使用感知的蒙特卡洛模拟结果,特别是揭示有限、离散数据与理想连续有序结构之间的本质差异。其解决方案的关键在于引入康德-公理化视角(Kantian-axiomatic lens),以稠密线性序无端点(DLO)公理体系对模拟数据进行检验:发现基本序关系(如非自反性、传递性和完全可比性)得以满足,但无端点性(A4–A5)和稠密性(A6)因 Likert 缩放和有限采样而失败;这种“失败”并非方法缺陷,而是反映了经验观测的先验边界——即有限量化数据无法实现理想连续结构,后者仅存在于建构直觉中。

链接: https://arxiv.org/abs/2510.00091
作者: Seyma Yaman Kayadibi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 23 pages in total, including 3 embedded Python code blocks, 4 figures, and 2 tables. The article analyzes student perception data simulated from survey-derived Likert statistics, evaluated against six axioms of Dense Linear Order (DLO). Preliminary version published on Zenodo; see External DOI

点击查看摘要

Abstract:This study reinterprets a Monte Carlo simulation of students’ perceived success with generative AI (GenAI) through a Kantian-axiomatic lens. Building on prior work, theme-level survey statistics Ease of Use and Learnability, System Efficiency and Learning Burden, and Perceived Complexity and Integration from a representative dataset are used to generate 10,000 synthetic scores per theme on the [1,5] Likert scale. The simulated outputs are evaluated against the axioms of dense linear order without endpoints (DLO): irreflexivity, transitivity, total comparability (connectedness), no endpoints (no greatest and no least; A4-A5), and density (A6). At the data level, the basic ordering axioms (A1-A3) are satisfied, whereas no-endpoints (A4-A5) and density (A6) fail as expected. Likert clipping introduces minimum and maximum observed values, and a finite, discretized sample need not contain a value strictly between any two distinct scores. These patterns are read not as methodological defects but as markers of an epistemological boundary. Following Kant and Friedman, the findings suggest that what simulations capture finite, quantized observations cannot instantiate the ideal properties of an unbounded, dense continuum. Such properties belong to constructive intuition rather than to finite sampling alone. A complementary visualization contrasts the empirical histogram with a sine-curve proxy to clarify this divide. The contribution is interpretive rather than data-expansive: it reframes an existing simulation as a probe of the synthetic a priori structure underlying students’ perceptions, showing how formal order-theoretic coherence coexists with principled failures of endpoint-freeness and density in finite empirical models.
zh

[AI-127] Judging by Appearances? Auditing and Intervening Vision-Language Models for Bail Prediction

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在保释决策预测任务中表现不佳且存在偏见的问题,尤其关注其对不同交叉群体的不公平判决,例如错误地以高置信度拒绝给予应获保释者。解决方案的关键在于引入基于检索增强生成(Retrieval-Augmented Generation, RAG)的法律判例知识注入机制,并结合创新的微调策略对VLM进行干预,从而显著提升模型在保释预测中的准确性和公平性。这一方法为未来在真实法律场景中部署VLM提供了更可靠的智能干预路径。

链接: https://arxiv.org/abs/2510.00088
作者: Sagnik Basu,Shubham Prakash,Ashish Maruti Barge,Siddharth D Jaiswal,Abhisek Dash,Saptarshi Ghosh,Animesh Mukherjee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been extensively used for legal judgment prediction tasks based on case reports and crime history. However, with a surge in the availability of large vision language models (VLMs), legal judgment prediction systems can now be made to leverage the images of the criminals in addition to the textual case reports/crime history. Applications built in this way could lead to inadvertent consequences and be used with malicious intent. In this work, we run an audit to investigate the efficiency of standalone VLMs in the bail decision prediction task. We observe that the performance is poor across multiple intersectional groups and models \textitwrongly deny bail to deserving individuals with very high confidence. We design different intervention algorithms by first including legal precedents through a RAG pipeline and then fine-tuning the VLMs using innovative schemes. We demonstrate that these interventions substantially improve the performance of bail prediction. Our work paves the way for the design of smarter interventions on VLMs in the future, before they can be deployed for real-world legal judgment prediction.
zh

[AI-128] owards a Framework for Supporting the Ethical and Regulatory Certification of AI Systems

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在欧洲社会与经济领域快速普及过程中所引发的伦理、法律和监管挑战,特别是如何实现AI系统的合规性、透明性和可问责性。其解决方案的关键在于构建一个综合框架,通过三个核心组件实现:(i) 语义机器学习运维(semantic Machine Learning Operations, MLOps),用于结构化管理AI生命周期;(ii) 基于本体的数据血缘追踪(ontology-driven data lineage tracking),确保数据来源与处理过程的可追溯性与责任归属;(iii) 监管运维(regulatory operations, RegOps)工作流,将法规要求转化为可执行的操作流程。该框架已在多个试点项目中实施与验证,以推动符合欧洲标准的负责任AI创新。

链接: https://arxiv.org/abs/2510.00084
作者: Fabian Kovac,Sebastian Neumaier,Timea Pahi,Torsten Priebe,Rafael Rodrigues,Dimitrios Christodoulou,Maxime Cordy,Sylvain Kubler,Ali Kordia,Georgios Pitsiladis,John Soldatos,Petros Zervoudakis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
备注: Accepted for publication in the proceedings of the Workshop on AI Certification, Fairness and Regulations, co-located with the Austrian Symposium on AI and Vision (AIRoV 2025)

点击查看摘要

Abstract:Artificial Intelligence has rapidly become a cornerstone technology, significantly influencing Europe’s societal and economic landscapes. However, the proliferation of AI also raises critical ethical, legal, and regulatory challenges. The CERTAIN (Certification for Ethical and Regulatory Transparency in Artificial Intelligence) project addresses these issues by developing a comprehensive framework that integrates regulatory compliance, ethical standards, and transparency into AI systems. In this position paper, we outline the methodological steps for building the core components of this framework. Specifically, we present: (i) semantic Machine Learning Operations (MLOps) for structured AI lifecycle management, (ii) ontology-driven data lineage tracking to ensure traceability and accountability, and (iii) regulatory operations (RegOps) workflows to operationalize compliance requirements. By implementing and validating its solutions across diverse pilots, CERTAIN aims to advance regulatory compliance and to promote responsible AI innovation aligned with European standards.
zh

[AI-129] SoREX: Towards Self-Explainable Social Recommendation with Relevant Ego-Path Extraction

【速读】:该论文旨在解决当前基于图神经网络(Graph Neural Networks, GNNs)的社会推荐算法在提升预测准确性的同时,缺乏可解释性的问题。其解决方案的关键在于提出了一种自解释的GNN框架SoREX,该框架采用双塔结构并引入好友推荐机制,独立建模社交关系与用户-物品交互,同时联合优化辅助任务以强化社交信号;此外,通过新颖的中心路径(ego-path)提取方法,将目标用户的中心网络(ego-net)转化为多跳中心路径集合,并从中提取特定因子和候选感知的子集作为解释,结合解释重聚合机制明确关联解释与下游预测,从而实现内在的自解释能力。

链接: https://arxiv.org/abs/2510.00080
作者: Hanze Guo,Yijun Ma,Xiao Zhou
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 27 pages, 10 figures

点击查看摘要

Abstract:Social recommendation has been proven effective in addressing data sparsity in user-item interaction modeling by leveraging social networks. The recent integration of Graph Neural Networks (GNNs) has further enhanced prediction accuracy in contemporary social recommendation algorithms. However, many GNN-based approaches in social recommendation lack the ability to furnish meaningful explanations for their predictions. In this study, we confront this challenge by introducing SoREX, a self-explanatory GNN-based social recommendation framework. SoREX adopts a two-tower framework enhanced by friend recommendation, independently modeling social relations and user-item interactions, while jointly optimizing an auxiliary task to reinforce social signals. To offer explanations, we propose a novel ego-path extraction approach. This method involves transforming the ego-net of a target user into a collection of multi-hop ego-paths, from which we extract factor-specific and candidate-aware ego-path subsets as explanations. This process facilitates the summarization of detailed comparative explanations among different candidate items through intricate substructure analysis. Furthermore, we conduct explanation re-aggregation to explicitly correlate explanations with downstream predictions, imbuing our framework with inherent self-explainability. Comprehensive experiments conducted on four widely adopted benchmark datasets validate the effectiveness of SoREX in predictive accuracy. Additionally, qualitative and quantitative analyses confirm the efficacy of the extracted explanations in SoREX. Our code and data are available at this https URL.
zh

[AI-130] Adaptive and Resource-efficient Agent ic AI Systems for Mobile and Embedded Devices: A Survey

【速读】:该论文旨在解决基础模型(Foundation Models, FMs)与智能体(AI Agents)融合背景下,资源受限环境(如移动设备和边缘计算场景)中实现高效、自适应的智能体系统所面临的挑战。核心问题在于:随着FMs复杂度不断提升,其在真实应用场景中对长期适应性、实时交互能力的要求日益增长,但部署环境却受限于内存、能耗、带宽和延迟等资源瓶颈,导致性能与效率之间存在根本性矛盾。解决方案的关键在于构建“弹性推理(elastic inference)”、“测试时适应(test-time adaptation)”、“动态多模态融合(dynamic multimodal integration)”以及“智能体应用(agentic AI applications)”四大技术体系,通过算法-系统协同设计,实现认知适应与边缘协作部署,从而在保障准确性的同时优化延迟与通信开销,提升系统在分布偏移下的鲁棒性。

链接: https://arxiv.org/abs/2510.00078
作者: Sicong Liu,Weiye Wu,Xiangrui Xu,Teng Li,Bowen Pang,Bin Guo,Zhiwen Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Foundation models have reshaped AI by unifying fragmented architectures into scalable backbones with multimodal reasoning and contextual adaptation. In parallel, the long-standing notion of AI agents, defined by the sensing-decision-action loop, is entering a new paradigm: with FMs as their cognitive core, agents transcend rule-based behaviors to achieve autonomy, generalization, and self-reflection. This dual shift is reinforced by real-world demands such as autonomous driving, robotics, virtual assistants, and GUI agents, as well as ecosystem advances in embedded hardware, edge computing, mobile deployment platforms, and communication protocols that together enable large-scale deployment. Yet this convergence collides with reality: while applications demand long-term adaptability and real-time interaction, mobile and edge deployments remain constrained by memory, energy, bandwidth, and latency. This creates a fundamental tension between the growing complexity of FMs and the limited resources of deployment environments. This survey provides the first systematic characterization of adaptive, resource-efficient agentic AI systems. We summarize enabling techniques into elastic inference, test-time adaptation, dynamic multimodal integration, and agentic AI applications, and identify open challenges in balancing accuracy-latency-communication trade-offs and sustaining robustness under distribution shifts. We further highlight future opportunities in algorithm-system co-design, cognitive adaptation, and collaborative edge deployment. By mapping FM structures, cognition, and hardware resources, this work establishes a unified perspective toward scalable, adaptive, and resource-efficient agentic AI. We believe this survey can help readers to understand the connections between enabling technologies while promoting further discussions on the fusion of agentic intelligence and intelligent agents.
zh

[AI-131] NeurIPS should lead scientific consensus on AI policy NEURIPS2025

【速读】:该论文试图解决的问题是:当前在人工智能(Artificial Intelligence, AI)政策制定过程中缺乏有效的科学共识形成机制,尽管已有证据生成和合成的机制,但尚无系统性方法来凝聚学界对AI政策的广泛共识。解决方案的关键在于:由NeurIPS(神经信息处理系统大会)主动推动科学共识的形成,利用其在AI领域的学术领导力和影响力,借鉴政府间气候变化专门委员会(Intergovernmental Panel on Climate Change, IPCC)在气候政策共识构建中的成功经验,开展试点项目以建立可复制的共识形成范式,从而提升AI政策的质量与可信度。

链接: https://arxiv.org/abs/2510.00075
作者: Rishi Bommasani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at NeurIPS 2025

点击查看摘要

Abstract:Designing wise AI policy is a grand challenge for society. To design such policy, policymakers should place a premium on rigorous evidence and scientific consensus. While several mechanisms exist for evidence generation, and nascent mechanisms tackle evidence synthesis, we identify a complete void on consensus formation. In this position paper, we argue NeurIPS should actively catalyze scientific consensus on AI policy. Beyond identifying the current deficit in consensus formation mechanisms, we argue that NeurIPS is the best option due its strengths and the paucity of compelling alternatives. To make progress, we recommend initial pilots for NeurIPS by distilling lessons from the IPCC’s leadership to build scientific consensus on climate policy. We dispel predictable counters that AI researchers disagree too much to achieve consensus and that policy engagement is not the business of NeurIPS. NeurIPS leads AI on many fronts, and it should champion scientific consensus to create higher quality AI policy.
zh

[AI-132] AutoPK: Leverag ing LLM s and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents ICTAI

【速读】:该论文旨在解决药物代谢动力学(Pharmacokinetics, PK)数据在复杂、异构表格中难以自动化提取与标准化的问题,这些问题限制了其在人用和兽用药物开发及监管决策中的高效应用。解决方案的关键在于提出一个两阶段框架AutoPK:第一阶段利用大语言模型(Large Language Models, LLMs)、混合相似性度量和LLM验证机制识别并提取PK参数变体;第二阶段通过行过滤、键值文本转换及LLM重构实现结构化表格的标准化输出。该方法显著提升了精度与召回率,在多个PK参数上优于直接使用LLMs的基线,并使开源模型如Gemma 3-27B在性能上超越商业系统如GPT-4o Mini,展现出良好的可扩展性和跨参数泛化能力。

链接: https://arxiv.org/abs/2510.00039
作者: Hossein Sholehrasa,Amirhossein Ghanaatian,Doina Caragea,Lisa A. Tell,Jim E. Riviere,Majid Jaberi-Douraki
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the 2025 IEEE 37th ICTAI

点击查看摘要

Abstract:Pharmacokinetics (PK) plays a critical role in drug development and regulatory decision-making for human and veterinary medicine, directly affecting public health through drug safety and efficacy assessments. However, PK data are often embedded in complex, heterogeneous tables with variable structures and inconsistent terminologies, posing significant challenges for automated PK data retrieval and standardization. AutoPK, a novel two-stage framework for accurate and scalable extraction of PK data from complex scientific tables. In the first stage, AutoPK identifies and extracts PK parameter variants using large language models (LLMs), a hybrid similarity metric, and LLM-based validation. The second stage filters relevant rows, converts the table into a key-value text format, and uses an LLM to reconstruct a standardized table. Evaluated on a real-world dataset of 605 PK tables, including captions and footnotes, AutoPK shows significant improvements in precision and recall over direct LLM baselines. For instance, AutoPK with LLaMA 3.1-70B achieved an F1-score of 0.92 on half-life and 0.91 on clearance parameters, outperforming direct use of LLaMA 3.1-70B by margins of 0.10 and 0.21, respectively. Smaller models such as Gemma 3-27B and Phi 3-12B with AutoPK achieved 2-7 fold F1 gains over their direct use, with Gemma’s hallucination rates reduced from 60-95% down to 8-14%. Notably, AutoPK enabled open-source models like Gemma 3-27B to outperform commercial systems such as GPT-4o Mini on several PK parameters. AutoPK enables scalable and high-confidence PK data extraction, making it well-suited for critical applications in veterinary pharmacology, drug safety monitoring, and public health decision-making, while addressing heterogeneous table structures and terminology and demonstrating generalizability across key PK parameters. Code and data: this https URL
zh

[AI-133] DexBench: Benchmarking LLM s for Personalized Decision Making in Diabetes Management

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在糖尿病患者日常管理场景中缺乏针对性评估工具的问题。现有健康类基准多聚焦于临床任务(如诊断或分诊),或面向医护人员,难以反映患者在血糖管理、代谢健康等实际情境下对AI辅助决策的需求。其解决方案的关键在于提出DexBench——首个专为评估LLM在糖尿病患者日常决策任务中表现而设计的基准,涵盖7类真实世界问题(如基础血糖解读、行为关联分析与长期规划),基于15,000名个体连续葡萄糖监测(Continuous Glucose Monitoring, CGM)及行为日志数据生成36万条个性化、上下文相关的问答样本,并通过准确性、可解释性、安全性、清晰度和可操作性五个维度进行系统评估。此框架推动了面向患者端AI应用的可靠性、安全性和实用性发展。

链接: https://arxiv.org/abs/2510.00038
作者: Maria Ana Cardei,Josephine Lamp,Mark Derdzinski,Karan Bhatia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We present DexBench, the first benchmark designed to evaluate large language model (LLM) performance across real-world decision-making tasks faced by individuals managing diabetes in their daily lives. Unlike prior health benchmarks that are either generic, clinician-facing or focused on clinical tasks (e.g., diagnosis, triage), DexBench introduces a comprehensive evaluation framework tailored to the unique challenges of prototyping patient-facing AI solutions in diabetes, glucose management, metabolic health and related domains. Our benchmark encompasses 7 distinct task categories, reflecting the breadth of real-world questions individuals with diabetes ask, including basic glucose interpretation, educational queries, behavioral associations, advanced decision making and long term planning. Towards this end, we compile a rich dataset comprising one month of time-series data encompassing glucose traces and metrics from continuous glucose monitors (CGMs) and behavioral logs (e.g., eating and activity patterns) from 15,000 individuals across three different diabetes populations (type 1, type 2, pre-diabetes/general health and wellness). Using this data, we generate a total of 360,600 personalized, contextual questions across the 7 tasks. We evaluate model performance on these tasks across 5 metrics: accuracy, groundedness, safety, clarity and actionability. Our analysis of 8 recent LLMs reveals substantial variability across tasks and metrics; no single model consistently outperforms others across all dimensions. By establishing this benchmark, we aim to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care.
zh

[AI-134] VibeCodeHPC: An Agent -Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLM s

【速读】:该论文旨在解决高性能计算(High Performance Computing, HPC)程序自动调优中代码生成质量低、调试效率差的问题。其解决方案的关键在于构建一个基于多智能体大语言模型(multi-agent Large Language Models, LLMs)的自动调优系统 VibeCodeHPC,通过角色分工(包括项目管理、系统工程、编程和持续交付四个角色)与迭代提示优化机制实现协同编码,并引入动态智能体部署与活动监控功能以提升问题识别效率和协作有效性。

链接: https://arxiv.org/abs/2510.00031
作者: Shun-ichiro Hayashi,Koki Morita,Daichi Mukunoki,Tetsuya Hoshino,Takahiro Katagiri
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We propose VibeCodeHPC, an automatic tuning system for HPC programs based on multi-agent LLMs for code generation. VibeCodeHPC tunes programs through multi-agent role allocation and iterative prompt refinement. We describe the system configuration with four roles: Project Manager (PM), System Engineer (SE), Programmer (PG), and Continuous Delivery (CD). We introduce dynamic agent deployment and activity monitoring functions to facilitate effective multi-agent collaboration. In our case study, we convert and optimize CPU-based matrix-matrix multiplication code written in C to GPU code using CUDA. The multi-agent configuration of VibeCodeHPC achieved higher-quality code generation per unit time compared to a solo-agent configuration. Additionally, the dynamic agent deployment and activity monitoring capabilities facilitated more effective identification of requirement violations and other issues.
zh

[AI-135] mporal-Aware Iterative Speech Model for Dementia Detection

【速读】:该论文旨在解决深度学习模型在处理长序列语音数据时面临的计算复杂度高及难以捕捉认知衰退过程中语音动态演变特征的问题。传统用于痴呆自动检测的语音方法多依赖静态、与时间无关的特征或聚合的语言内容,忽略了语音生产中细微且渐进性的变化,从而遗漏了早期认知功能下降的关键动态模式。解决方案的核心在于提出一种名为TAI-Speech(Temporal Aware Iterative framework)的新框架,其关键创新包括:1)受光流启发的迭代精化机制,通过卷积门控循环单元(convolutional GRU)将频谱图视为连续帧,捕获声学特征的细粒度帧间演化;2)基于交叉注意力的韵律对齐机制,动态对齐频谱特征与韵律模式(如音高和停顿),构建更丰富的语音产出缺陷表征,从而提升对日常生活活动能力(IADL)相关认知标志物的识别能力。该方法直接建模原始音频的时序动态特性,无需依赖自动语音识别(ASR),在DementiaBank数据集上实现了AUC 0.839和准确率80.6%,显著优于文本基线方法。

链接: https://arxiv.org/abs/2510.00030
作者: Chukwuemeka Ugwu,Oluwafemi Oyeleke
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Deep learning systems often struggle with processing long sequences, where computational complexity can become a bottleneck. Current methods for automated dementia detection using speech frequently rely on static, time-agnostic features or aggregated linguistic content, lacking the flexibility to model the subtle, progressive deterioration inherent in speech production. These approaches often miss the dynamic temporal patterns that are critical early indicators of cognitive decline. In this paper, we introduce TAI-Speech, a Temporal Aware Iterative framework that dynamically models spontaneous speech for dementia detection. The flexibility of our method is demonstrated through two key innovations: 1) Optical Flow-inspired Iterative Refinement: By treating spectrograms as sequential frames, this component uses a convolutional GRU to capture the fine-grained, frame-to-frame evolution of acoustic features. 2) Cross-Attention Based Prosodic Alignment: This component dynamically aligns spectral features with prosodic patterns, such as pitch and pauses, to create a richer representation of speech production deficits linked to functional decline (IADL). TAI-Speech adaptively models the temporal evolution of each utterance, enhancing the detection of cognitive markers. Experimental results on the DementiaBank dataset show that TAI-Speech achieves a strong AUC of 0.839 and 80.6% accuracy, outperforming text-based baselines without relying on ASR. Our work provides a more flexible and robust solution for automated cognitive assessment, operating directly on the dynamics of raw audio.
zh

[AI-136] Rethinking RoPE Scaling in Quantized LLM : Theory Outlier and Channel-Band Analysis with Weight Rescaling

【速读】:该论文旨在解决将旋转位置编码(Rotary Position Embedding, RoPE)的位置插值(Position Interpolation, PI)与后训练量化(Post-Training Quantization, PTQ)结合时导致的模型精度下降问题。这种精度损失源于多个耦合效应,包括长距离上下文混叠(long-context aliasing)、动态范围膨胀(dynamic-range dilation)、轴对齐量化器与旋转RoPE对之间的各向异性(anisotropy),以及异常值偏移(outlier shifting)引发的位置依赖对数概率噪声。解决方案的关键在于提出Q-ROAR(Quantization, RoPE-interpolation, and Outlier Aware Rescaling),这是一种仅基于权重的、对插值敏感的稳定机制:通过将RoPE维度分组为少量频带,并在每个频带上轻量搜索Key和Query权重的缩放因子(支持对称变体以保持对数概率尺度),其优化过程由文中提出的两个诊断指标——插值压力(interpolation pressure)和尾部膨胀比(tail-inflation ratios)指导,并利用一个微小的长上下文开发数据集完成,无需模型微调、架构或内核修改,亦无额外部署开销。实验证明,Q-ROAR在保持短上下文性能和推理吞吐量的同时,使长上下文任务上的困惑度降低超过14%。

链接: https://arxiv.org/abs/2510.00028
作者: Ye Qiao,Haocheng Xu,Xiaofan Zhang,Sitao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model’s perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.
zh

[AI-137] Learning Inter-Atomic Potentials without Explicit Equivariance

【速读】:该论文旨在解决当前机器学习势函数(MLIPs)在分子模拟中面临的可扩展性与灵活性不足的问题,特别是现有基于等变神经网络架构的模型因硬编码的旋转变换对称性诱导偏置(inductive bias),导致计算效率低、模型适应性差。其解决方案的关键在于提出TransIP:一种基于Transformer架构的新型势函数训练范式,通过优化嵌入空间中的表示来隐式学习SO(3)等变性,而非依赖显式的等变结构约束或数据增强策略,从而在保持对称性合规的同时显著提升性能与可扩展性。

链接: https://arxiv.org/abs/2510.00027
作者: Ahmed A. Elhag,Arun Raja,Alex Morehead,Samuel M. Blau,Garrett M. Morris,Michael M. Bronstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注: 19 pages, 3 tables, 10 figures. Under review

点击查看摘要

Abstract:Accurate and scalable machine-learned inter-atomic potentials (MLIPs) are essential for molecular simulations ranging from drug discovery to new material design. Current state-of-the-art models enforce roto-translational symmetries through equivariant neural network architectures, a hard-wired inductive bias that can often lead to reduced flexibility, computational efficiency, and scalability. In this work, we introduce TransIP: Transformer-based Inter-Atomic Potentials, a novel training paradigm for interatomic potentials achieving symmetry compliance without explicit architectural constraints. Our approach guides a generic non-equivariant Transformer-based model to learn SO(3)-equivariance by optimizing its representations in the embedding space. Trained on the recent Open Molecules (OMol25) collection, a large and diverse molecular dataset built specifically for MLIPs and covering different types of molecules (including small organics, biomolecular fragments, and electrolyte-like species), TransIP attains comparable performance in machine-learning force fields versus state-of-the-art equivariant baselines. Further, compared to a data augmentation baseline, TransIP achieves 40% to 60% improvement in performance across varying OMol25 dataset sizes. More broadly, our work shows that learned equivariance can be a powerful and efficient alternative to equivariant or augmentation-based MLIP models.
zh

[AI-138] EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

【速读】:该论文旨在解决复杂跨学科研究领域(如流行病建模)中自动化程度低、流程繁琐且依赖人工干预的问题。流行病建模涉及网络科学、动力系统、流行病学和随机模拟等多个领域,传统方法效率低下且难以规模化。解决方案的关键在于提出一种多智能体大语言模型(Large Language Model, LLM)框架 EpidemIQs,其核心创新是引入两类分工明确的代理:科学家代理(scientist agent)负责整体规划、协调与反思,并生成最终结果;任务专家代理(task-expert agent)专注于单一职责,作为工具服务于科学家代理。这种结构化协作机制显著提升了自动化水平,实现了从文献综述到报告撰写全流程的自主执行,在五种不同流行病情景下均表现出更高的完成成功率与计算效率,相较单代理LLM方案展现出更优性能。

链接: https://arxiv.org/abs/2510.00024
作者: Mohammad Hossein Samaei,Faryad Darabi Sahneh,Lee W. Cohnstaedt,Caterina Scoglio
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer new opportunities to automate complex interdisciplinary research domains. Epidemic modeling, characterized by its complexity and reliance on network science, dynamical systems, epidemiology, and stochastic simulations, represents a prime candidate for leveraging LLM-driven automation. We introduce \textbfEpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and analysis, and finally documentation of findings in a structured manuscript. We introduced two types of agents: a scientist agent for planning, coordination, reflection, and generation of final results, and a task-expert agent to focus exclusively on one specific duty serving as a tool to the scientist agent. The framework consistently generated complete reports in scientific article format. Specifically, using GPT 4.1 and GPT 4.1 mini as backbone LLMs for scientist and task-expert agents, respectively, the autonomous process completed with average total token usage 870K at a cost of about \ 1.57 per study, achieving a 100% completion success rate through our experiments. We evaluate EpidemIQs across different epidemic scenarios, measuring computational cost, completion success rate, and AI and human expert reviews of generated reports. We compare EpidemIQs to the single-agent LLM, which has the same system prompts and tools, iteratively planning, invoking tools, and revising outputs until task completion. The comparison shows consistently higher performance of the proposed framework across five different scenarios. EpidemIQs represents a step forward in accelerating scientific research by significantly reducing costs and turnaround time of discovery processes, and enhancing accessibility to advanced modeling tools.
zh

[AI-139] oolBrain: A Flexible Reinforcement Learning Framework for Agent ic Tools

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在工具使用(tool use)过程中存在的三大挑战:人工设计奖励机制导致训练效率低、训练数据有限以及多工具选择能力差,从而引发适应速度慢、计算资源浪费和性能不佳等问题。其解决方案的核心在于提出一个轻量且易用的框架 ToolBrain,该框架通过灵活的强化学习(Reinforcement Learning, RL)策略(如 GRPO 和 DPO)与监督学习相结合,支持自定义奖励函数或基于 LLM-as-a-judge 的自动化奖励生成,并集成知识蒸馏、自动任务生成、无缝工具检索、QLoRA 高效微调及量化推理等关键技术,显著提升了工具使用技能的训练效率与效果(实测提升最高达 30.0%),同时保持代码结构简洁可扩展,便于研究人员和实践者快速适配特定领域。

链接: https://arxiv.org/abs/2510.00023
作者: Quy Minh Le,Minh Sao Khue Luu,Khanh-Tung Tran,Duc-Hai Nguyen,Hoang-Quoc-Viet Pham,Quan Le,Hoang Thanh Lam,Hoang D. Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective tool use is essential for agentic AI, yet training agents to utilize tools remains challenging due to manually designed rewards, limited training data, and poor multi-tool selection, resulting in slow adaptation, wasted computational resources, and suboptimal performance. We introduce ToolBrain, a lightweight and user-friendly framework for coaching tool use in agentic models with flexible reinforcement learning (RL), easing the barriers for researchers and practitioners to adapt LLM-based agents to specific domains. It supports a wide range of training strategies, including RL algorithms such as GRPO and DPO, as well as supervised learning. ToolBrain enables custom reward callables directly on an agent’s execution traces or simply utilizes an automated LLM-as-a-judge system for reward generation. It is packed with useful capabilities, including knowledge distillation from large to small models for efficient development, automatic task generation from tool descriptions, seamless tool retrieval, efficient fine-tuning pipelines with QLoRA through Unsloth, and quantized inference via bitsandbytes. We demonstrate ToolBrain through diverse use cases, such as training a CodeAct agent to autonomously execute email search tasks, showing fast, targeted improvements (up to 30.0%) in tool-use skills while keeping the codebase simple and extensible in Agentic AI. Our framework is publicly available at this https URL.
zh

[AI-140] Learning to Lead Themselves: Agent ic AI in MAS using MARL

【速读】:该论文旨在解决多智能体系统中自主代理在无显式通信情况下实现去中心化协作决策的问题,特别是在无人机配送任务中的任务分配与协调优化。其解决方案的关键在于采用合作式多智能体强化学习框架,并基于集中训练、分散执行(centralized-training, decentralized-execution)范式,设计了一种轻量级的多智能体近端策略优化方法(IPPO),在PettingZoo环境中实现了同质无人机代理的自组织目标覆盖。

链接: https://arxiv.org/abs/2510.00022
作者: Ansh Kamthan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Exploring foundational behaviours of agentic ai using MARL 39 pages - 25 minute read, 5 tables, 24 equation, 9 figures

点击查看摘要

Abstract:As autonomous systems move from prototypes to real deployments, the ability of multiple agents to make decentralized, cooperative decisions becomes a core requirement. This paper examines how agentic artificial intelligence, agents that act independently, adaptively and proactively can improve task allocation and coordination in multi-agent systems, with primary emphasis on drone delivery and secondary relevance to warehouse automation. We formulate the problem in a cooperative multi-agent reinforcement learning setting and implement a lightweight multi-agent Proximal Policy Optimization, called IPPO, approach in PyTorch under a centralized-training, decentralized-execution paradigm. Experiments are conducted in PettingZoo environment, where multiple homogeneous drones or agents must self-organize to cover distinct targets without explicit communication.
zh

[AI-141] Methodological Framework for Quantifying Semantic Test Coverag e in RAG Systems

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统评估中缺乏系统性方法来确保测试问题充分覆盖底层知识库的问题,从而导致开发者存在显著的盲区。解决方案的关键在于提出一种新颖且可落地的方法论,通过将文档片段与测试问题映射到统一的向量空间,并结合向量嵌入(vector embeddings)和聚类算法,量化测试问题在语义层面的知识覆盖度。该方法引入多种覆盖指标(如基础邻近度、内容加权覆盖率和多主题问题覆盖率),并集成异常值检测以剔除无关问题,从而有效识别覆盖不足的内容区域,并为生成高价值的新测试问题提供具体建议。

链接: https://arxiv.org/abs/2510.00001
作者: Noah Broestl,Adel Nasser Abdalla,Rajprakash Bale,Hersh Gupta,Max Struever
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 7 pages, 3 figures, 1 table, 1 algo

点击查看摘要

Abstract:Reliably determining the performance of Retrieval-Augmented Generation (RAG) systems depends on comprehensive test questions. While a proliferation of evaluation frameworks for LLM-powered applications exists, current practices lack a systematic method to ensure these test sets adequately cover the underlying knowledge base, leaving developers with significant blind spots. To address this, we present a novel, applied methodology to quantify the semantic coverage of RAG test questions against their underlying documents. Our approach leverages existing technologies, including vector embeddings and clustering algorithms, to create a practical framework for validating test comprehensiveness. Our methodology embeds document chunks and test questions into a unified vector space, enabling the calculation of multiple coverage metrics: basic proximity, content-weighted coverage, and multi-topic question coverage. Furthermore, we incorporate outlier detection to filter irrelevant questions, allowing for the refinement of test sets. Experimental evidence from two distinct use cases demonstrates that our framework effectively quantifies test coverage, identifies specific content areas with inadequate representation, and provides concrete recommendations for generating new, high-value test questions. This work provides RAG developers with essential tools to build more robust test suites, thereby improving system reliability and extending to applications such as identifying misaligned documents.
zh

[AI-142] Autonomous Multi-Robot Infrastructure for AI-Enabled Healthcare Delivery and Diagnostics

【速读】:该论文旨在解决医院内患者监护效率低、人工干预成本高以及突发状况响应延迟等问题,提出了一种基于群体智能(Swarm Intelligence)的多机器人系统用于住院患者护理。其解决方案的关键在于采用领导者-跟随者(Leader-Follower)的群集策略,结合可穿戴健康传感器、RF通信模块与AI决策支持系统,实现对患者生命体征的连续监测、药物自动配送及紧急情况下的快速响应。硬件平台由Arduino、Raspberry Pi、NRF24L01射频模块和HuskyLens AI摄像头组成,实验表明该系统在传感器精度(>94%)、任务成功率(92%)和通信可靠性(96%)方面表现优异,并能通过AI提供早期异常健康预警,具备良好的临床应用潜力与成本效益。

链接: https://arxiv.org/abs/2509.26106
作者: Nakhul Kalaivanan,Senthil Arumugam Muthukumaraswamy,Girish Balasubramanian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, MSc dissertation submission draft, prepared for conference/journal consideration

点击查看摘要

Abstract:This research presents a multi-robot system for inpatient care, designed using swarm intelligence principles and incorporating wearable health sensors, RF-based communication, and AI-driven decision support. Within a simulated hospital environment, the system adopts a leader-follower swarm configuration to perform patient monitoring, medicine delivery, and emergency assistance. Due to ethical constraints, live patient trials were not conducted; instead, validation was carried out through controlled self-testing with wearable sensors. The Leader Robot acquires key physiological parameters, including temperature, SpO2, heart rate, and fall detection, and coordinates other robots when required. The Assistant Robot patrols corridors for medicine delivery, while a robotic arm provides direct drug administration. The swarm-inspired leader-follower strategy enhanced communication reliability and ensured continuous monitoring, including automated email alerts to healthcare staff. The system hardware was implemented using Arduino, Raspberry Pi, NRF24L01 RF modules, and a HuskyLens AI camera. Experimental evaluation showed an overall sensor accuracy above 94%, a 92% task-level success rate, and a 96% communication reliability rate, demonstrating system robustness. Furthermore, the AI-enabled decision support was able to provide early warnings of abnormal health conditions, highlighting the potential of the system as a cost-effective solution for hospital automation and patient safety.
zh

[AI-143] MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms

【速读】:该论文旨在解决高保真音频生成中传统波形建模方法难以有效捕捉谐波与时间结构,以及现有谱图(spectrogram)方法在多尺度建模上缺乏高效协同机制的问题。其解决方案的关键在于提出MARS(Multi-channel AutoRegression on Spectrograms)框架,通过将谱图视为多通道图像并引入通道复用(channel multiplexing, CMX)技术,在不丢失信息的前提下降低谱图的空间维度;同时采用共享分词器(shared tokenizer)实现跨尺度的一致离散表示,使基于Transformer的自回归模型能够从粗粒度到细粒度逐步优化谱图,从而提升生成音频的保真度与一致性。

链接: https://arxiv.org/abs/2509.26007
作者: Eleonora Ristori,Luca Bindini,Paolo Frasconi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping technique that lowers height and width without discarding information. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity audio generation.
zh

[AI-144] UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching ICASSP2026

【速读】:该论文旨在解决音频超分辨率(audio super-resolution)任务中传统两阶段扩散模型依赖预训练声码器(vocoder)所带来的性能瓶颈问题。现有方法通常先预测梅尔频谱(mel-spectrogram),再通过声码器合成波形,导致最终音频质量受限于声码器的重建能力。解决方案的关键在于提出一种无需声码器的端到端框架,利用流匹配(flow matching)生成模型直接建模复数谱系数的条件分布,并通过逆短时傅里叶变换(iSTFT)直接重构波形,从而实现高质量、高保真度的48 kHz音频重建,且在多种上采样倍数下均达到当前最优性能。

链接: https://arxiv.org/abs/2510.00771
作者: Woongjib Choi,Sangmin Lee,Hyungseob Lim,Hong-Goo Kang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
zh

[AI-145] Adaptive Data-Knowledge Alignment in Genetic Perturbation Prediction

【速读】:该论文旨在解决当前遗传扰动响应预测方法在生物理解深度和知识系统性更新方面的局限性,即现有方法虽能预测扰动后的转录反应,但缺乏生物学可解释性,且无法对已有知识进行系统性修正。其解决方案的关键在于提出一种基于溯因学习(Abductive Learning, ABL)范式的神经符号框架ALIGNED(Adaptive aLignment for Inconsistent Genetic kNowledgE and Data),通过端到端地对齐神经网络与符号知识组件,实现数据驱动学习与先验知识的自适应融合,并引入平衡一致性度量以同时评估预测结果与数据及知识库的一致性,从而在提升预测性能的同时推动生物学机制认知的透明化与演进。

链接: https://arxiv.org/abs/2510.00512
作者: Yuanfang Xiang,Lun Ai
机构: 未知
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The transcriptional response to genetic perturbation reveals fundamental insights into complex cellular systems. While current approaches have made progress in predicting genetic perturbation responses, they provide limited biological understanding and cannot systematically refine existing knowledge. Overcoming these limitations requires an end-to-end integration of data-driven learning and existing knowledge. However, this integration is challenging due to inconsistencies between data and knowledge bases, such as noise, misannotation, and incompleteness. To address this challenge, we propose ALIGNED (Adaptive aLignment for Inconsistent Genetic kNowledgE and Data), a neuro-symbolic framework based on the Abductive Learning (ABL) paradigm. This end-to-end framework aligns neural and symbolic components and performs systematic knowledge refinement. We introduce a balanced consistency metric to evaluate the predictions’ consistency against both data and knowledge. Our results show that ALIGNED outperforms state-of-the-art methods by achieving the highest balanced consistency, while also re-discovering biologically meaningful knowledge. Our work advances beyond existing methods to enable both the transparency and the evolution of mechanistic biological understanding.
zh

[AI-146] Structural Refinement of Bayesian Networks for Efficient Model Parameterisation

【速读】:该论文旨在解决贝叶斯网络(Bayesian network)建模中因数据稀缺而导致的条件概率表(conditional probability table, CPT)参数难以准确确定的问题。当可用数据不足时,通常需依赖专家判断来估计CPT参数,但这一过程往往面临参数数量庞大、复杂度高的挑战。为应对这一问题,论文系统回顾并评估了多种CPT近似方法(CPT approximation methods),其关键在于通过结构精化(structural refinement)策略降低参数量和建模复杂度,从而实现对CPT的高效近似,同时保持模型的表达能力和实用性。研究通过心血管风险评估贝叶斯网络的实例验证各方法的有效性,并为实践者提供在直接参数化不可行时选择合适替代方案的指导。

链接: https://arxiv.org/abs/2510.00334
作者: Kieran Drury,Martine J. Barons,Jim Q. Smith
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: 38 pages, 10 figures, 3 tables, one appendix

点击查看摘要

Abstract:Many Bayesian network modelling applications suffer from the issue of data scarcity. Hence the use of expert judgement often becomes necessary to determine the parameters of the conditional probability tables (CPTs) throughout the network. There are usually a prohibitively large number of these parameters to determine, even when complementing any available data with expert judgements. To address this challenge, a number of CPT approximation methods have been developed that reduce the quantity and complexity of parameters needing to be determined to fully parameterise a Bayesian network. This paper provides a review of a variety of structural refinement methods that can be used in practice to efficiently approximate a CPT within a Bayesian network. We not only introduce and discuss the intrinsic properties and requirements of each method, but we evaluate each method through a worked example on a Bayesian network model of cardiovascular risk assessment. We conclude with practical guidance to help Bayesian network practitioners choose an alternative approach when direct parameterisation of a CPT is infeasible.
zh

[AI-147] Data driven approaches in nanophotonics: A review of AI-enabled metadevices

【速读】:该论文旨在解决传统光子超材料(metamaterial)设计中依赖试错法和计算密集型电磁仿真所导致的效率低下问题,尤其在高自由度设计与复杂耦合效应下难以实现高效优化。其解决方案的关键在于引入深度学习框架,通过模型驱动的方法替代传统手段,从而在广阔的器件设计空间中实现快速、精准的优化路径;同时,文中强调了Transformer模型应用、制造限制以及复杂互耦效应等挑战的应对策略,为多功能且可制造的纳米光子器件提供了高效可行的设计范式。

链接: https://arxiv.org/abs/2510.00283
作者: Huanshu Zhang,Lei Kang,Sawyer D. Campbell,Jacob T. Young,Douglas H. Werner
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-driven approaches have revolutionized the design and optimization of photonic metadevices by harnessing advanced artificial intelligence methodologies. This review takes a model-centric perspective that synthesizes emerging design strategies and delineates how traditional trial-and-error and computationally intensive electromagnetic simulations are being supplanted by deep learning frameworks that efficiently navigate expansive design spaces. We discuss artificial intelligence implementation in several metamaterial design aspects from high-degree-of-freedom design to large language model-assisted design. By addressing challenges such as transformer model implementation, fabrication limitations, and intricate mutual coupling effects, these AI-enabled strategies not only streamline the forward modeling process but also offer robust pathways for the realization of multifunctional and fabrication-friendly nanophotonic devices. This review further highlights emerging opportunities and persistent challenges, setting the stage for next-generation strategies in nanophotonic engineering.
zh

[AI-148] Identifying All ε-Best Arms in (Misspecified) Linear Bandits

【速读】:该论文旨在解决高试错成本任务(如药物发现)中高效识别多个近优臂(即与最优臂差距不超过ε的臂)的问题。其解决方案的关键在于提出LinFACT算法,该算法通过建立新的信息论下界并证明其在实例最优性上达到该下界(仅相差对数因子),实现了对线性Bandit场景中所有ε-最优臂的高效识别。算法核心创新在于将下界直接嵌入上界推导的缩放过程中,从而确定终止轮次及样本复杂度,同时扩展至模型误设和广义线性模型场景,实验验证其在合成与真实药物发现数据上均能以更少样本识别更多候选者,显著提升计算效率。

链接: https://arxiv.org/abs/2510.00073
作者: Zhekai Li,Tianyi Ma,Cheng Hua,Ruihao Zhu
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 80 pages (33 pages for main text), 12 figures, 3 tables

点击查看摘要

Abstract:Motivated by the need to efficiently identify multiple candidates in high trial-and-error cost tasks such as drug discovery, we propose a near-optimal algorithm to identify all \epsilon-best arms (i.e., those at most \epsilon worse than the optimum). Specifically, we introduce LinFACT, an algorithm designed to optimize the identification of all \epsilon-best arms in linear bandits. We establish a novel information-theoretic lower bound on the sample complexity of this problem and demonstrate that LinFACT achieves instance optimality by matching this lower bound up to a logarithmic factor. A key ingredient of our proof is to integrate the lower bound directly into the scaling process for upper bound derivation, determining the termination round and thus the sample complexity. We also extend our analysis to settings with model misspecification and generalized linear models. Numerical experiments, including synthetic and real drug discovery data, demonstrate that LinFACT identifies more promising candidates with reduced sample complexity, offering significant computational efficiency and accelerating early-stage exploratory experiments.
zh

[AI-149] AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在天文学图像理解任务中评估不足的问题。现有基准测试主要关注通用多模态能力,未能充分反映天文数据的复杂性和专业性。为填补这一空白,作者提出AstroMMBench——首个专门用于评估MLLMs在天文学图像理解能力的综合性基准,其关键在于构建了一个包含621道多项选择题的高质量评测集,覆盖六个天体物理子领域,并由15位领域专家审核确保内容的专业性与相关性。通过在25种不同MLLM上进行系统评估,该基准揭示了模型在不同子领域的性能差异,凸显了领域专用评测的重要性,从而为科学场景下MLLM的优化与定向发展提供了可靠依据。

链接: https://arxiv.org/abs/2510.00063
作者: Jinghang Shi,Xiao Yu Tang,Yang Hunag,Yuyang Li,Xiaokong,Yanxia Zhang,Caizhan Yue
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Astronomical image interpretation presents a significant challenge for applying multimodal large language models (MLLMs) to specialized scientific tasks. Existing benchmarks focus on general multimodal capabilities but fail to capture the complexity of astronomical data. To bridge this gap, we introduce AstroMMBench, the first comprehensive benchmark designed to evaluate MLLMs in astronomical image understanding. AstroMMBench comprises 621 multiple-choice questions across six astrophysical subfields, curated and reviewed by 15 domain experts for quality and relevance. We conducted an extensive evaluation of 25 diverse MLLMs, including 22 open-source and 3 closed-source models, using AstroMMBench. The results show that Ovis2-34B achieved the highest overall accuracy (70.5%), demonstrating leading capabilities even compared to strong closed-source models. Performance showed variations across the six astrophysical subfields, proving particularly challenging in domains like cosmology and high-energy astrophysics, while models performed relatively better in others, such as instrumentation and solar astrophysics. These findings underscore the vital role of domain-specific benchmarks like AstroMMBench in critically evaluating MLLM performance and guiding their targeted development for scientific applications. AstroMMBench provides a foundational resource and a dynamic tool to catalyze advancements at the intersection of AI and astronomy.
zh

机器学习

[LG-0] Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs

链接: https://arxiv.org/abs/2510.01185
作者: Leyla Mirvakhabova,Babak Ehteshami Bejnordi,Gaurav Kumar,Hanxue Liang,Wanru Zhao,Paul Whatmough
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.

[LG-1] mporal Score Rescaling for Temperature Sampling in Diffusion and Flow Models

链接: https://arxiv.org/abs/2510.01184
作者: Yanbo Xu,Yu Wu,Sungjae Park,Zhizhuo Zhou,Shubham Tulsiani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a mechanism to steer the sampling diversity of denoising diffusion and flow matching models, allowing users to sample from a sharper or broader distribution than the training distribution. We build on the observation that these models leverage (learned) score functions of noisy data distributions for sampling and show that rescaling these allows one to effectively control a `local’ sampling temperature. Notably, this approach does not require any finetuning or alterations to training strategy, and can be applied to any off-the-shelf model and is compatible with both deterministic and stochastic samplers. We first validate our framework on toy 2D data, and then demonstrate its application for diffusion models trained across five disparate tasks – image generation, pose estimation, depth prediction, robot manipulation, and protein design. We find that across these tasks, our approach allows sampling from sharper (or flatter) distributions, yielding performance gains e.g., depth prediction models benefit from sampling more likely depth estimates, whereas image generation models perform better when sampling a slightly flatter distribution. Project page: this https URL

[LG-2] On the Benefits of Weight Normalization for Overparameterized Matrix Sensing

链接: https://arxiv.org/abs/2510.01175
作者: Yudong Wei,Liang Zhang,Bingcong Li,Niao He
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While normalization techniques are widely used in deep learning, their theoretical understanding remains relatively limited. In this work, we establish the benefits of (generalized) weight normalization (WN) applied to the overparameterized matrix sensing problem. We prove that WN with Riemannian optimization achieves linear convergence, yielding an exponential speedup over standard methods that do not use WN. Our analysis further demonstrates that both iteration and sample complexity improve polynomially as the level of overparameterization increases. To the best of our knowledge, this work provides the first characterization of how WN leverages overparameterization for faster convergence in matrix sensing.

[LG-3] How Does the Pretraining Distribution Shape In-Context Learning? Task Selection Generalization and Robustness

链接: https://arxiv.org/abs/2510.01163
作者: Waïss Azizian,Ali Hasan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 52 pages, 12 figures

点击查看摘要

Abstract:The emergence of in-context learning (ICL) in large language models (LLMs) remains poorly understood despite its consistent effectiveness, enabling models to adapt to new tasks from only a handful of examples. To clarify and improve these capabilities, we characterize how the statistical properties of the pretraining distribution (e.g., tail behavior, coverage) shape ICL on numerical tasks. We develop a theoretical framework that unifies task selection and generalization, extending and sharpening earlier results, and show how distributional properties govern sample efficiency, task retrieval, and robustness. To this end, we generalize Bayesian posterior consistency and concentration results to heavy-tailed priors and dependent sequences, better reflecting the structure of LLM pretraining data. We then empirically study how ICL performance varies with the pretraining distribution on challenging tasks such as stochastic differential equations and stochastic processes with memory. Together, these findings suggest that controlling key statistical properties of the pretraining distribution is essential for building ICL-capable and reliable LLMs.

[LG-4] Multi-Marginal Flow Matching with Adversarially Learnt Interpolants

链接: https://arxiv.org/abs/2510.01159
作者: Oskar Kviman,Kirill Tamogashev,Nicola Branchini,Víctor Elvira,Jens Lagergren,Nikolay Malkin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning the dynamics of a process given sampled observations at several time points is an important but difficult task in many scientific applications. When no ground-truth trajectories are available, but one has only snapshots of data taken at discrete time steps, the problem of modelling the dynamics, and thus inferring the underlying trajectories, can be solved by multi-marginal generalisations of flow matching algorithms. This paper proposes a novel flow matching method that overcomes the limitations of existing multi-marginal trajectory inference algorithms. Our proposed method, ALI-CFM, uses a GAN-inspired adversarial loss to fit neurally parametrised interpolant curves between source and target points such that the marginal distributions at intermediate time points are close to the observed distributions. The resulting interpolants are smooth trajectories that, as we show, are unique under mild assumptions. These interpolants are subsequently marginalised by a flow matching algorithm, yielding a trained vector field for the underlying dynamics. We showcase the versatility and scalability of our method by outperforming the existing baselines on spatial transcriptomics and cell tracking datasets, while performing on par with them on single-cell trajectory prediction. Code: this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.01159 [cs.LG] (or arXiv:2510.01159v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.01159 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Neural Hamilton–Jacobi Characteristic Flows for Optimal Transport

链接: https://arxiv.org/abs/2510.01153
作者: Yesom Park,Shu Liu,Mo Zhou,Stanley Osher
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We present a novel framework for solving optimal transport (OT) problems based on the Hamilton–Jacobi (HJ) equation, whose viscosity solution uniquely characterizes the OT map. By leveraging the method of characteristics, we derive closed-form, bidirectional transport maps, thereby eliminating the need for numerical integration. The proposed method adopts a pure minimization framework: a single neural network is trained with a loss function derived from the method of characteristics of the HJ equation. This design guarantees convergence to the optimal map while eliminating adversarial training stages, thereby substantially reducing computational complexity. Furthermore, the framework naturally extends to a wide class of cost functions and supports class-conditional transport. Extensive experiments on diverse datasets demonstrate the accuracy, scalability, and efficiency of the proposed method, establishing it as a principled and versatile tool for OT applications with provable optimality.

[LG-6] Sample-Efficient Differentially Private Fine-Tuning via Gradient Matrix Denoising

链接: https://arxiv.org/abs/2510.01137
作者: Ali Dadsetan,Frank Rudzicz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the challenge of sample efficiency in differentially private fine-tuning of large language models (LLMs) using DP-SGD. While DP-SGD provides strong privacy guarantees, the added noise significantly increases the entropy of gradient matrices, disrupting their low-rank structure and slowing optimization. We propose a post-processing algorithm that leverages random matrix theory to denoise gradients, restore low-rank structure, and improve alignment with the original signal. Applied to DP-SGD fine-tuning of RoBERTa on GLUE tasks, our method improves sample efficiency compared to state-of-the-art approaches, substantially reducing training time when optimal performance is not required. This work demonstrates that matrix recovery techniques can enhance the utility of private language model training without compromising privacy guarantees.

[LG-7] Breaking the Euclidean Barrier: Hyperboloid-Based Biological Sequence Analysis

链接: https://arxiv.org/abs/2510.01118
作者: Sarwan Ali,Haris Mansoor,Murray Patterson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Genomic sequence analysis plays a crucial role in various scientific and medical domains. Traditional machine-learning approaches often struggle to capture the complex relationships and hierarchical structures of sequence data when working in high-dimensional Euclidean spaces. This limitation hinders accurate sequence classification and similarity measurement. To address these challenges, this research proposes a method to transform the feature representation of biological sequences into the hyperboloid space. By applying a transformation, the sequences are mapped onto the hyperboloid, preserving their inherent structural information. Once the sequences are represented in the hyperboloid space, a kernel matrix is computed based on the hyperboloid features. The kernel matrix captures the pairwise similarities between sequences, enabling more effective analysis of biological sequence relationships. This approach leverages the inner product of the hyperboloid feature vectors to measure the similarity between pairs of sequences. The experimental evaluation of the proposed approach demonstrates its efficacy in capturing important sequence correlations and improving classification accuracy.

[LG-8] Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning

链接: https://arxiv.org/abs/2510.01116
作者: Felix Parker,Nimeesha Chan,Chi Zhang,Kimia Ghobadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complex numerical time series analysis often demands multi-step reasoning capabilities beyond current models’ reach. Tasks like medical diagnosis and weather forecasting require sequential reasoning processes – including counterfactual analysis, logical deduction, knowledge application, and multi-modal contextual integration – that existing time series models cannot explicitly perform. While recent research has shown large language models (LLMs) can achieve sophisticated Chain-of-Thought (CoT) reasoning through reinforcement learning (RL), these advances have primarily focused on mathematical and coding domains, with LLMs still demonstrating poor performance on time series tasks. We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains LLMs to perform CoT reasoning across diverse time series tasks using RL with verifiable rewards. Our approach employs a Residual Vector-Quantized VAE to create high-fidelity discrete tokens that seamlessly integrate into a pre-trained LLM’s vocabulary. COUNTS undergoes a two-stage training process: first, supervised fine-tuning on time series analysis tasks to master our novel representations, followed by Group Relative Policy Optimization training on verifiable problems using prompting strategies that encourage explicit reasoning steps before producing final answers. Our experiments demonstrate that this RL-driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.

[LG-9] Privacy Preserved Federated Learning with Attention-Based Aggregation for Biometric Recognition

链接: https://arxiv.org/abs/2510.01113
作者: Kassahun Azezew,Minyechil Alehegn,Tsega Asresa,Bitew Mekuria,Tizazu Bayh,Ayenew Kassie,Amsalu Tesema,Animut Embiyale
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Because biometric data is sensitive, centralized training poses a privacy risk, even though biometric recognition is essential for contemporary applications. Federated learning (FL), which permits decentralized training, provides a privacy-preserving substitute. Conventional FL, however, has trouble with interpretability and heterogeneous data (non-IID). In order to handle non-IID biometric data, this framework adds an attention mechanism at the central server that weights local model updates according to their significance. Differential privacy and secure update protocols safeguard data while preserving accuracy. The A3-FL framework is evaluated in this study using FVC2004 fingerprint data, with each client’s features extracted using a Siamese Convolutional Neural Network (Siamese-CNN). By dynamically modifying client contributions, the attention mechanism increases the accuracy of the global this http URL accuracy, convergence speed, and robustness of the A3-FL framework are superior to those of standard FL (FedAvg) and static baselines, according to experimental evaluations using fingerprint data (FVC2004). The accuracy of the attention-based approach was 0.8413, while FedAvg, Local-only, and Centralized approaches were 0.8164, 0.7664, and 0.7997, respectively. Accuracy stayed high at 0.8330 even with differential privacy. A scalable and privacy-sensitive biometric system for secure and effective recognition in dispersed environments is presented in this work.

[LG-10] Augmenting LLM s for General Time Series Understanding and Prediction

链接: https://arxiv.org/abs/2510.01111
作者: Felix Parker,Nimeesha Chan,Chi Zhang,Kimia Ghobadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series data is fundamental to decision-making in many crucial domains including healthcare, finance, and environmental science. However, analyzing this data often requires incorporating unstructured contextual information, answering domain-specific questions, and generating natural language explanations – capabilities that traditional time series models lack due to their inability to process text. While Large Language Models (LLMs) excel at contextual reasoning and knowledge integration, they struggle with numerical time series due to inefficient text-based representations and limited exposure to temporal data during pretraining. We address this gap by augmenting an LLM with specialized time series perception through a patch-based encoder-decoder architecture. We train this Time Series-augmented LLM (TsLLM) on a large corpus of over 2 million interleaved time series and text examples spanning diverse analysis tasks: forecasting with contextual information, time series question-answering, pattern explanation, classification with natural language outputs, and report generation. This training enables TsLLM to leverage both its language understanding and newly acquired temporal reasoning capabilities. While not designed to surpass specialized models on traditional benchmarks, TsLLM demonstrates strong performance on tasks requiring the integration of time series analysis with natural language – capabilities that existing approaches cannot provide. Our work establishes a new paradigm for time series analysis that bridges numerical computation and natural language understanding, democratizing access to sophisticated temporal reasoning through natural language interaction.

[LG-11] Geometric Properties of Neural Multivariate Regression

链接: https://arxiv.org/abs/2510.01105
作者: George Andriopoulos,Zixuan Dong,Bimarsha Adhikari,Keith Ross
类目: Machine Learning (cs.LG)
*备注: 22 pages, 12 figures

点击查看摘要

Abstract:Neural multivariate regression underpins a wide range of domains such as control, robotics, and finance, yet the geometry of its learned representations remains poorly characterized. While neural collapse has been shown to benefit generalization in classification, we find that analogous collapse in regression consistently degrades performance. To explain this contrast, we analyze models through the lens of intrinsic dimension. Across control tasks and synthetic datasets, we estimate the intrinsic dimension of last-layer features (ID_H) and compare it with that of the regression targets (ID_Y). Collapsed models exhibit ID_H ID_Y, leading to over-compression and poor generalization, whereas non-collapsed models typically maintain ID_H ID_Y. For the non-collapsed models, performance with respect to ID_H depends on the data quantity and noise levels. From these observations, we identify two regimes (over-compressed and under-compressed) that determine when expanding or reducing feature dimensionality improves performance. Our results provide new geometric insights into neural regression and suggest practical strategies for enhancing generalization.

[LG-12] Dynamical system reconstruction from partial observations using stochastic dynamics

链接: https://arxiv.org/abs/2510.01089
作者: Viktor Sip,Martin Breyton,Spase Petkoski,Viktor Jirsa
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Learning stochastic models of dynamical systems underlying observed data is of interest in many scientific fields. Here we propose a novel method for this task, based on the framework of variational autoencoders for dynamical systems. The method estimates from the data both the system state trajectories and noise time series. This approach allows to perform multi-step system evolution and supports a teacher forcing strategy, alleviating limitations of autoencoder-based approaches for stochastic systems. We demonstrate the performance of the proposed approach on six test problems, covering simulated and experimental data. We further show the effects of the teacher forcing interval on the nature of the internal dynamics, and compare it to the deterministic models with equivalent architecture.

[LG-13] Multi-Actor Multi-Critic Deep Deterministic Reinforcement Learning with a Novel Q-Ensemble Method

链接: https://arxiv.org/abs/2510.01083
作者: Andy Wu,Chun-Cheng Lin,Rung-Tzuo Liaw,Yuehua Huang,Chihjung Kuo,Chia Tong Weng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning has gathered much attention in recent years due to its rapid development and rich applications, especially on control systems and robotics. When tackling real-world applications with reinforcement learning method, the corresponded Markov decision process may have huge discrete or even continuous state/action space. Deep reinforcement learning has been studied for handling these issues through deep learning for years, and one promising branch is the actor-critic architecture. Many past studies leveraged multiple critics to enhance the accuracy of evaluation of a policy for addressing the overestimation and underestimation issues. However, few studies have considered the architecture with multiple actors together with multiple critics. This study proposes a novel multi-actor multi-critic (MAMC) deep deterministic reinforcement learning method. The proposed method has three main features, including selection of actors based on non-dominated sorting for exploration with respect to skill and creativity factors, evaluation for actors and critics using a quantile-based ensemble strategy, and exploiting actors with best skill factor. Theoretical analysis proves the learning stability and bounded estimation bias for the MAMC. The present study examines the performance on a well-known reinforcement learning benchmark MuJoCo. Experimental results show that the proposed framework outperforms state-of-the-art deep deterministic based reinforcement learning methods. Experimental analysis also indicates the proposed components are effective. Empirical analysis further investigates the validity of the proposed method, and shows its benefit on complicated problems. The source code can be found at this https URL.

[LG-14] Predicting Diabetic Retinopathy Using a Two-Level Ensemble Model

链接: https://arxiv.org/abs/2510.01074
作者: Mahyar Mahmoudi,Tieming Liu
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at the IISE Annual Conference Expo 2025, 6 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Preprint Note: This is the author preprint version of a paper accepted for presentation at the IISE Annual Conference Expo 2025. The final version will appear in the official proceedings. Diabetic retinopathy (DR) is a leading cause of blindness in working-age adults, and current diagnostic methods rely on resource-intensive eye exams and specialized equipment. Image-based AI tools have shown limitations in early-stage detection, motivating the need for alternative approaches. We propose a non-image-based, two-level ensemble model for DR prediction using routine laboratory test results. In the first stage, base models (Linear SVC, Random Forest, Gradient Boosting, and XGBoost) are hyperparameter tuned and internally stacked across different configurations to optimize metrics such as accuracy, recall, and precision. In the second stage, predictions are aggregated using Random Forest as a meta-learner. This hierarchical stacking strategy improves generalization, balances performance across multiple metrics, and remains computationally efficient compared to deep learning approaches. The model achieved Accuracy 0.9433, F1 Score 0.9425, Recall 0.9207, Precision 0.9653, ROC-AUC 0.9844, and AUPRC 0.9875, surpassing one-level stacking and FCN baselines. These results highlight the model potential for accurate and interpretable DR risk prediction in clinical settings. Comments: Accepted for presentation at the IISE Annual Conference Expo 2025, 6 pages, 2 tables, 1 figure Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.01074 [cs.LG] (or arXiv:2510.01074v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.01074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Eliciting Secret Knowledge from Language Models

链接: https://arxiv.org/abs/2510.01070
作者: Bartosz Cywiński,Emil Ryd,Rowan Wang,Senthooran Rajamanoharan,Neel Nanda,Arthur Conmy,Samuel Marks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study secret elicitation: discovering knowledge that an AI possesses but does not explicitly verbalize. As a testbed, we train three families of large language models (LLMs) to possess specific knowledge that they apply downstream but deny knowing when asked directly. For example, in one setting, we train an LLM to generate replies that are consistent with knowing the user is female, while denying this knowledge when asked directly. We then design various black-box and white-box secret elicitation techniques and evaluate them based on whether they can help an LLM auditor successfully guess the secret knowledge. Many of our techniques improve on simple baselines. Our most effective techniques (performing best in 2/3 settings) are based on prefill attacks, a black-box technique where the LLM reveals secret knowledge when generating a completion from a predefined prefix. In our remaining setting, white-box techniques based on logit lens and sparse autoencoders (SAEs) are most effective. We release our models and code, establishing a public benchmark for evaluating secret elicitation methods.

[LG-16] Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

链接: https://arxiv.org/abs/2510.01068
作者: Jiahang Cao,Yize Huang,Hanzhong Guo,Rui Zhang,Mu Nan,Weijian Mai,Jiaxu Wang,Hao Cheng,Jingkai Sun,Gang Han,Wen Zhao,Qiang Zhang,Yijie Guo,Qihao Zheng,Chunfeng Song,Xiao Li,Ping Luo,Andrew F. Luo
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance without additional model training. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold. First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Grönwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.

[LG-17] Gated X-TFC: Soft Domain Decomposition for Forward and Inverse Problems in Sharp-Gradient PDEs

链接: https://arxiv.org/abs/2510.01039
作者: Vikas Dwivedi,Enrico Schiassi,Monica Sigovan,Bruno Sixou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) and related methods struggle to resolve sharp gradients in singularly perturbed boundary value problems without resorting to some form of domain decomposition, which often introduce complex interface penalties. While the Extreme Theory of Functional Connections (X-TFC) avoids multi-objective optimization by employing exact boundary condition enforcement, it remains computationally inefficient for boundary layers and incompatible with decomposition. We propose Gated X-TFC, a novel framework for both forward and inverse problems, that overcomes these limitations through a soft, learned domain decomposition. Our method replaces hard interfaces with a differentiable logistic gate that dynamically adapts radial basis function (RBF) kernel widths across the domain, eliminating the need for interface penalties. This approach yields not only superior accuracy but also dramatic improvements in computational efficiency: on a benchmark one dimensional (1D) convection-diffusion, Gated X-TFC achieves an order-of-magnitude lower error than standard X-TFC while using 80 percent fewer collocation points and reducing training time by 66 percent. In addition, we introduce an operator-conditioned meta-learning layer that learns a probabilistic mapping from PDE parameters to optimal gate configurations, enabling fast, uncertainty-aware warm-starting for new problem instances. We further demonstrate scalability to multiple subdomains and higher dimensions by solving a twin boundary-layer equation and a 2D Poisson problem with a sharp Gaussian source. Overall, Gated X-TFC delivers a simple alternative alternative to PINNs that is both accurate and computationally efficient for challenging boundar-layer regimes. Future work will focus on nonlinear problems.

[LG-18] Meaningless Tokens Meaningful Gains: How Activation Shifts Enhance LLM Reasoning

链接: https://arxiv.org/abs/2510.01032
作者: Zeru Shi,Yingjia Wan,Zhenting Wang,Qifan Wang,Fan Yang,Elisa Kreiss,Ruixiang Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by the puzzling observation that inserting long sequences of meaningless tokens before the query prompt can consistently enhance LLM reasoning performance, this work analyzes the underlying mechanism driving this phenomenon and based on these insights proposes a more principled method that allows for similar performance gains. First, we find that the improvements arise from a redistribution of activations in the LLM’s MLP layers, where near zero activations become less frequent while large magnitude activations increase. This redistribution enhances the model’s representational capacity by suppressing weak signals and promoting stronger, more informative ones. Building on this insight, we propose the Activation Redistribution Module (ARM), a lightweight inference-time technique that modifies activations directly without altering the input sequence. ARM adaptively identifies near-zero activations after the non-linear function and shifts them outward, implicitly reproducing the beneficial effects of meaningless tokens in a controlled manner. Extensive experiments across diverse benchmarks and model architectures clearly show that ARM consistently improves LLM performance on reasoning tasks while requiring only a few lines of simple code to implement. Our findings deliver both a clear mechanistic explanation for the unexpected benefits of meaningless tokens and a simple yet effective technique that harnesses activation redistribution to further improve LLM performance.

[LG-19] Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets NEURIPS

链接: https://arxiv.org/abs/2510.01022
作者: David R. Johnson,Rishabh Anand,Smita Krishnaswamy,Michael Perlmutter
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Accepted for presentation at the NeurIPS workshop on New Perspectives in Advancing Graph Machine Learning

点击查看摘要

Abstract:We introduce a novel version of the geometric scattering transform for geometric graphs containing scalar and vector node features. This new scattering transform has desirable symmetries with respect to rigid-body roto-translations (i.e., SE(3) -equivariance) and may be incorporated into a geometric GNN framework. We empirically show that our equivariant scattering-based GNN achieves comparable performance to other equivariant message-passing-based GNNs at a fraction of the parameter count.

[LG-20] Random Feature Spiking Neural Networks

链接: https://arxiv.org/abs/2510.01012
作者: Maximilian Gollwitzer,Felix Dietrich
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 34 pages incl. references appendix, 3 figures, 4 tables

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) as Machine Learning (ML) models have recently received a lot of attention as a potentially more energy-efficient alternative to conventional Artificial Neural Networks. The non-differentiability and sparsity of the spiking mechanism can make these models very difficult to train with algorithms based on propagating gradients through the spiking non-linearity. We address this problem by adapting the paradigm of Random Feature Methods (RFMs) from Artificial Neural Networks (ANNs) to Spike Response Model (SRM) SNNs. This approach allows training of SNNs without approximation of the spike function gradient. Concretely, we propose a novel data-driven, fast, high-performance, and interpretable algorithm for end-to-end training of SNNs inspired by the SWIM algorithm for RFM-ANNs, which we coin S-SWIM. We provide a thorough theoretical discussion and supplementary numerical experiments showing that S-SWIM can reach high accuracies on time series forecasting as a standalone strategy and serve as an effective initialisation strategy before gradient-based training. Additional ablation studies show that our proposed method performs better than random sampling of network weights.

[LG-21] Riemannian Consistency Model NEURIPS2025

链接: https://arxiv.org/abs/2510.00983
作者: Chaoran Cheng,Yusong Wang,Yuxin Chen,Xiangxin Zhou,Nanning Zheng,Ge Liu
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Consistency models are a class of generative models that enable few-step generation for diffusion and flow matching models. While consistency models have achieved promising results on Euclidean domains like images, their applications to Riemannian manifolds remain challenging due to the curved geometry. In this work, we propose the Riemannian Consistency Model (RCM), which, for the first time, enables few-step consistency modeling while respecting the intrinsic manifold constraint imposed by the Riemannian geometry. Leveraging the covariant derivative and exponential-map-based parameterization, we derive the closed-form solutions for both discrete- and continuous-time training objectives for RCM. We then demonstrate theoretical equivalence between the two variants of RCM: Riemannian consistency distillation (RCD) that relies on a teacher model to approximate the marginal vector field, and Riemannian consistency training (RCT) that utilizes the conditional vector field for training. We further propose a simplified training objective that eliminates the need for the complicated differential calculation. Finally, we provide a unique kinematics perspective for interpreting the RCM objective, offering new theoretical angles. Through extensive experiments, we manifest the superior generative quality of RCM in few-step generation on various non-Euclidean manifolds, including flat-tori, spheres, and the 3D rotation group SO(3).

[LG-22] Modeling Market States with Clustering and State Machines

链接: https://arxiv.org/abs/2510.00953
作者: Christian Oliva,Silviu Gabriel Tinjala
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work introduces a new framework for modeling financial markets through an interpretable probabilistic state machine. By clustering historical returns based on momentum and risk features across multiple time horizons, we identify distinct market states that capture underlying regimes, such as expansion phase, contraction, crisis, or recovery. From a transition matrix representing the dynamics between these states, we construct a probabilistic state machine that models the temporal evolution of the market. This state machine enables the generation of a custom distribution of returns based on a mixture of Gaussian components weighted by state frequencies. We show that the proposed benchmark significantly outperforms the traditional approach in capturing key statistical properties of asset returns, including skewness and kurtosis, and our experiments across random assets and time periods confirm its robustness.

[LG-23] Large Reasoning Models Learn Better Alignment from Flawed Thinking

链接: https://arxiv.org/abs/2510.00938
作者: ShengYun Peng,Eric Smith,Ivan Evtimov,Song Jiang,Pin-Yu Chen,Hongyuan Zhan,Haozhu Wang,Duen Horng Chau,Mahesh Pasupuleti,Jianfeng Chi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) “think” by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability – all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

[LG-24] BoMGene: Integrating Boruta-mRMR feature selection for enhanced Gene expression classification

链接: https://arxiv.org/abs/2510.00907
作者: Bich-Chung Phan,Thanh Ma,Huu-Hoa Nguyen,Thanh-Nghi Do
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature selection is a crucial step in analyzing gene expression data, enhancing classification performance, and reducing computational costs for high-dimensional datasets. This paper proposes BoMGene, a hybrid feature selection method that effectively integrates two popular techniques: Boruta and Minimum Redundancy Maximum Relevance (mRMR). The method aims to optimize the feature space and enhance classification accuracy. Experiments were conducted on 25 publicly available gene expression datasets, employing widely used classifiers such as Support Vector Machine (SVM), Random Forest, XGBoost (XGB), and Gradient Boosting Machine (GBM). The results show that using the Boruta-mRMR combination cuts down the number of features chosen compared to just using mRMR, which helps to speed up training time while keeping or even improving classification accuracy compared to using individual feature selection methods. The proposed approach demonstrates clear advantages in accuracy, stability, and practical applicability for multi-class gene expression data analysis

[LG-25] Rectifying Regression in Reinforcement Learning

链接: https://arxiv.org/abs/2510.00885
作者: Alex Ayoub,David Szepesvári,Alireza Baktiari,Csaba Szepesvári,Dale Schuurmans
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the impact of the loss function in value-based methods for reinforcement learning through an analysis of underlying prediction objectives. We theoretically show that mean absolute error is a better prediction objective than the traditional mean squared error for controlling the learned policy’s suboptimality gap. Furthermore, we present results that different loss functions are better aligned with these different regression objectives: binary and categorical cross-entropy losses with the mean absolute error and squared loss with the mean squared error. We then provide empirical evidence that algorithms minimizing these cross-entropy losses can outperform those based on the squared loss in linear reinforcement learning.

[LG-26] COMMET: orders-of-magnitude speed-up in finite element method via batch-vectorized neural constitutive updates

链接: https://arxiv.org/abs/2510.00884
作者: Benjamin Alheit,Mathias Peirlinck,Siddhant Kumar
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 40 pages, 15 figures

点击查看摘要

Abstract:Constitutive evaluations often dominate the computational cost of finite element (FE) simulations whenever material models are complex. Neural constitutive models (NCMs) offer a highly expressive and flexible framework for modeling complex material behavior in solid mechanics. However, their practical adoption in large-scale FE simulations remains limited due to significant computational costs, especially in repeatedly evaluating stress and stiffness. NCMs thus represent an extreme case: their large computational graphs make stress and stiffness evaluations prohibitively expensive, restricting their use to small-scale problems. In this work, we introduce COMMET, an open-source FE framework whose architecture has been redesigned from the ground up to accelerate high-cost constitutive updates. Our framework features a novel assembly algorithm that supports batched and vectorized constitutive evaluations, compute-graph-optimized derivatives that replace automatic differentiation, and distributed-memory parallelism via MPI. These advances dramatically reduce runtime, with speed-ups exceeding three orders of magnitude relative to traditional non-vectorized automatic differentiation-based implementations. While we demonstrate these gains primarily for NCMs, the same principles apply broadly wherever for-loop based assembly or constitutive updates limit performance, establishing a new standard for large-scale, high-fidelity simulations in computational mechanics.

[LG-27] Reducción de ruido por medio de autoencoders: caso de estudio con la señal GW150914

链接: https://arxiv.org/abs/2510.00873
作者: Fernanda Zapata Bascuñán,Darío Fernando Mendieta
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: in Spanish language, Presented at the RPIC 2023 (Information Processing and Control work Reunion)

点击查看摘要

Abstract:This brief study focuses on the application of autoencoders to improve the quality of low-amplitude signals, such as gravitational events. A pre-existing autoencoder was trained using cosmic event data, optimizing its architecture and parameters. The results show a significant increase in the signal-to-noise ratio of the processed signals, demonstrating the potential of autoencoders in the analysis of small signals with multiple sources of interference.

[LG-28] A Visual Diagnostics Framework for District Heating Data: Enhancing Data Quality for AI-Driven Heat Consumption Prediction

链接: https://arxiv.org/abs/2510.00872
作者: Kristoffer Christensen,Bo Nørregaard Jørgensen,Zheng Grace Ma
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Energy this http URL Conference 2025 (EI.A 2025), 3-6 December 2025, Universiti Tenaga Nasional (UNITEN), Kuala Lumpur, Malaysia

点击查看摘要

Abstract:High-quality data is a prerequisite for training reliable Artificial Intelligence (AI) models in the energy domain. In district heating networks, sensor and metering data often suffer from noise, missing values, and temporal inconsistencies, which can significantly degrade model performance. This paper presents a systematic approach for evaluating and improving data quality using visual diagnostics, implemented through an interactive web-based dashboard. The dashboard employs Python-based visualization techniques, including time series plots, heatmaps, box plots, histograms, correlation matrices, and anomaly-sensitive KPIs such as skewness and anomaly detection based on the modified z-scores. These tools al-low human experts to inspect and interpret data anomalies, enabling a human-in-the-loop strategy for data quality assessment. The methodology is demonstrated on a real-world dataset from a Danish district heating provider, covering over four years of hourly data from nearly 7000 meters. The findings show how visual analytics can uncover systemic data issues and, in the future, guide data cleaning strategies that enhance the accuracy, stability, and generalizability of Long Short-Term Memory and Gated Recurrent Unit models for heat demand forecasting. The study contributes to a scalable, generalizable framework for visual data inspection and underlines the critical role of data quality in AI-driven energy management systems.

[LG-29] arget Population Synthesis using CT-GAN

链接: https://arxiv.org/abs/2510.00871
作者: Tanay Rastogi,Daniel Jonsson
类目: Machine Learning (cs.LG)
*备注: Submitted for journal and is under review

点击查看摘要

Abstract:Agent-based models used in scenario planning for transportation and urban planning usually require detailed population information from the base as well as target scenarios. These populations are usually provided by synthesizing fake agents through deterministic population synthesis methods. However, these deterministic population synthesis methods face several challenges, such as handling high-dimensional data, scalability, and zero-cell issues, particularly when generating populations for target scenarios. This research looks into how a deep generative model called Conditional Tabular Generative Adversarial Network (CT-GAN) can be used to create target populations either directly from a collection of marginal constraints or through a hybrid method that combines CT-GAN with Fitness-based Synthesis Combinatorial Optimization (FBS-CO). The research evaluates the proposed population synthesis models against travel survey and zonal-level aggregated population data. Results indicate that the stand-alone CT-GAN model performs the best when compared with FBS-CO and the hybrid model. CT-GAN by itself can create realistic-looking groups that match single-variable distributions, but it struggles to maintain relationships between multiple variables. However, the hybrid model demonstrates improved performance compared to FBS-CO by leveraging CT-GAN ability to generate a descriptive base population, which is then refined using FBS-CO to align with target-year marginals. This study demonstrates that CT-GAN represents an effective methodology for target populations and highlights how deep generative models can be successfully integrated with conventional synthesis techniques to enhance their performance.

[LG-30] Population Synthesis using Incomplete Information

链接: https://arxiv.org/abs/2510.00859
作者: Tanay Rastogi,Daniel Jonsson,Anders Karlström
类目: Machine Learning (cs.LG)
*备注: Presented at 25th Euro Working Group on Transportation (EWGT) Meeting

点击查看摘要

Abstract:This paper presents a population synthesis model that utilizes the Wasserstein Generative-Adversarial Network (WGAN) for training on incomplete microsamples. By using a mask matrix to represent missing values, the study proposes a WGAN training algorithm that lets the model learn from a training dataset that has some missing information. The proposed method aims to address the challenge of missing information in microsamples on one or more attributes due to privacy concerns or data collection constraints. The paper contrasts WGAN models trained on incomplete microsamples with those trained on complete microsamples, creating a synthetic population. We conducted a series of evaluations of the proposed method using a Swedish national travel survey. We validate the efficacy of the proposed method by generating synthetic populations from all the models and comparing them to the actual population dataset. The results from the experiments showed that the proposed methodology successfully generates synthetic data that closely resembles a model trained with complete data as well as the actual population. The paper contributes to the field by providing a robust solution for population synthesis with incomplete data, opening avenues for future research, and highlighting the potential of deep generative models in advancing population synthesis capabilities.

[LG-31] LLM Routing with Dueling Feedback

链接: https://arxiv.org/abs/2510.00841
作者: Chao-Kai Chiang,Takashi Ishida,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study LLM routing, the problem of selecting the best model for each query while balancing user satisfaction, model expertise, and inference cost. We formulate routing as contextual dueling bandits, learning from pairwise preference feedback rather than absolute scores, thereby yielding label-efficient and dynamic adaptation. Building on this formulation, we introduce Category-Calibrated Fine-Tuning (CCFT), a representation-learning method that derives model embeddings from offline data using contrastive fine-tuning with categorical weighting. These embeddings enable the practical instantiation of Feel-Good Thompson Sampling for Contextual Dueling Bandits (this http URL), a theoretically grounded posterior-sampling algorithm. We propose four variants of the categorical weighting that explicitly integrate model quality and cost, and we empirically evaluate the proposed methods on the RouterBench and MixInstruct datasets. Across both benchmarks, our methods achieve lower cumulative regret and faster convergence, with better robustness and performance-cost balance than strong baselines built with a general-purpose OpenAI embedding model.

[LG-32] Learn to Guide Your Diffusion Model

链接: https://arxiv.org/abs/2510.00815
作者: Alexandre Galashov,Ashwini Pokle,Arnaud Doucet,Arthur Gretton,Mauricio Delbracio,Valentin De Bortoli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) is a widely used technique for improving the perceptual quality of samples from conditional diffusion models. It operates by linearly combining conditional and unconditional score estimates using a guidance weight \omega . While a large, static weight can markedly improve visual results, this often comes at the cost of poorer distributional alignment. In order to better approximate the target conditional distribution, we instead learn guidance weights \omega_c,(s,t) , which are continuous functions of the conditioning c , the time t from which we denoise, and the time s towards which we denoise. We achieve this by minimizing the distributional mismatch between noised samples from the true conditional distribution and samples from the guided diffusion process. We extend our framework to reward guided sampling, enabling the model to target distributions tilted by a reward function R(x_0,c) , defined on clean data and a conditioning c . We demonstrate the effectiveness of our methodology on low-dimensional toy examples and high-dimensional image settings, where we observe improvements in Fréchet inception distance (FID) for image generation. In text-to-image applications, we observe that employing a reward function given by the CLIP score leads to guidance weights that improve image-prompt alignment.

[LG-33] Are Time Series Foundation Models Susceptible to Catastrophic Forgetting?

链接: https://arxiv.org/abs/2510.00809
作者: Nouha Karaouli(1),Denis Coquenet(2),Elisa Fromont(1),Martial Mermillod(3),Marina Reyboz(4) ((1) Univ. Rennes, CNRS, Inria, Rennes, France, (2) Univ. Rennes, CNRS, IRISA - UMR 6074, Rennes, France, (3) Univ. Grenoble Alpes, Univ. Savoie Mont Blanc, CNRS, LPNC, Grenoble, France, (4) Univ. Grenoble Alpes, CEA, LIST, Grenoble, France)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have shown promising zero-shot generalization across diverse forecasting tasks. However, their robustness to continual adaptation remains underexplored. In this work, we investigate the extent to which TSFMs suffer from catastrophic forgetting when fine-tuned sequentially on multiple datasets. Using synthetic datasets designed with varying degrees of periodic structure, we measure the trade-off between adaptation to new data and retention of prior knowledge. Our experiments reveal that, while fine-tuning improves performance on new tasks, it often causes significant degradation on previously learned ones, illustrating a fundamental stability-plasticity dilemma.

[LG-34] Online Minimization of Polarization and Disagreement via Low-Rank Matrix Bandits

链接: https://arxiv.org/abs/2510.00803
作者: Federico Cinus,Yuko Kuroki,Atsushi Miyauchi,Francesco Bonchi
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:We study the problem of minimizing polarization and disagreement in the Friedkin-Johnsen opinion dynamics model under incomplete information. Unlike prior work that assumes a static setting with full knowledge of users’ innate opinions, we address the more realistic online setting where innate opinions are unknown and must be learned through sequential observations. This novel setting, which naturally mirrors periodic interventions on social media platforms, is formulated as a regret minimization problem, establishing a key connection between algorithmic interventions on social media platforms and theory of multi-armed bandits. In our formulation, a learner observes only a scalar feedback of the overall polarization and disagreement after an intervention. For this novel bandit problem, we propose a two-stage algorithm based on low-rank matrix bandits. The algorithm first performs subspace estimation to identify an underlying low-dimensional structure, and then employs a linear bandit algorithm within the compact dimensional representation derived from the estimated subspace. We prove that our algorithm achieves an \widetildeO(\sqrtT) cumulative regret over any time horizon T . Empirical results validate that our algorithm significantly outperforms a linear bandit baseline in terms of both cumulative regret and running time.

[LG-35] Guiding Evolutionary Molecular Design: Adding Reinforcement Learning for Mutation Selection ICTAI2025

链接: https://arxiv.org/abs/2510.00802
作者: Gaelle Milon-Harnois,Chaimaa Touhami,Nicolas Gutowski,Benoit Da Mota,Thomas Cauchy
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, Accepted for publication in the proceedings of ICTAI 2025

点击查看摘要

Abstract:The efficient exploration of chemical space remains a central challenge, as many generative models still produce unstable or non-synthesizable compounds. To address these limitations, we present EvoMol-RL, a significant extension of the EvoMol evolutionary algorithm that integrates reinforcement learning to guide molecular mutations based on local structural context. By leveraging Extended Connectivity Fingerprints (ECFPs), EvoMol-RL learns context-aware mutation policies that prioritize chemically plausible transformations. This approach significantly improves the generation of valid and realistic molecules, reducing the frequency of structural artifacts and enhancing optimization performance. The results demonstrate that EvoMol-RL consistently outperforms its baseline in molecular pre-filtering realism. These results emphasize the effectiveness of combining reinforcement learning with molecular fingerprints to generate chemically relevant molecular structures.

[LG-36] Complex System Exploration with Interactive Human Guidance

链接: https://arxiv.org/abs/2510.00794
作者: Bastien Morel,Clément Moulin-Frier,Pascal Barla
类目: Machine Learning (cs.LG)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:The diversity of patterns that emerge from complex systems motivates their use for scientific or artistic purposes. When exploring these systems, the challenges faced are the size of the parameter space and the strongly non-linear mapping between parameters and emerging patterns. In addition, artists and scientists who explore complex systems do so with an expectation of particular patterns. Taking these expectations into account adds a new set of challenges, which the exploration process must address. We provide design choices and their implementation to address these challenges; enabling the maximization of the diversity of patterns discovered in the user’s region of interest – which we call the constrained diversity – in a sample-efficient manner. The region of interest is expressed in the form of explicit constraints. These constraints are formulated by the user in a system-agnostic way, and their addition enables interactive system exploration leading to constrained diversity, while maintaining global diversity.

[LG-37] In-Place Feedback: A New Paradigm for Guiding LLM s in Multi-Turn Reasoning

链接: https://arxiv.org/abs/2510.00777
作者: Youngbin Choi,Minjong Lee,Saemi Moon,Seunghyuk Cho,Chaehyeon Chung,MoonJeong Park,Dongwoo Kim
类目: Machine Learning (cs.LG)
*备注: 28 pages, 23 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly studied in the context of multi-turn reasoning, where models iteratively refine their outputs based on user-provided feedback. Such settings are crucial for tasks that require complex reasoning, yet existing feedback paradigms often rely on issuing new messages. LLMs struggle to integrate these reliably, leading to inconsistent improvements. In this work, we introduce in-place feedback, a novel interaction paradigm in which users directly edit an LLM’s previous response, and the model conditions on this modified response to generate its revision. Empirical evaluations on diverse reasoning-intensive benchmarks reveal that in-place feedback achieves better performance than conventional multi-turn feedback while using 79.1% fewer tokens. Complementary analyses on controlled environments further demonstrate that in-place feedback resolves a core limitation of multi-turn feedback: models often fail to apply feedback precisely to erroneous parts of the response, leaving errors uncorrected and sometimes introducing new mistakes into previously correct content. These findings suggest that in-place feedback offers a more natural and effective mechanism for guiding LLMs in reasoning-intensive tasks.

[LG-38] Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

链接: https://arxiv.org/abs/2510.00761
作者: Yicheng Lang,Yihua Zhang,Chongyu Fan,Changsheng Wang,Jinghan Jia,Sijia Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the ‘grade’ of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.

[LG-39] LEAP: Local ECT-Based Learnable Positional Encodings for Graphs

链接: https://arxiv.org/abs/2510.00757
作者: Juan Amboage,Ernst Röell,Patrick Schnider,Bastian Rieck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) largely rely on the message-passing paradigm, where nodes iteratively aggregate information from their neighbors. Yet, standard message passing neural networks (MPNNs) face well-documented theoretical and practical limitations. Graph positional encoding (PE) has emerged as a promising direction to address these limitations. The Euler Characteristic Transform (ECT) is an efficiently computable geometric-topological invariant that characterizes shapes and graphs. In this work, we combine the differentiable approximation of the ECT (DECT) and its local variant ( \ell -ECT) to propose LEAP, a new end-to-end trainable local structural PE for graphs. We evaluate our approach on multiple real-world datasets as well as on a synthetic task designed to test its ability to extract topological features. Our results underline the potential of LEAP-based encodings as a powerful component for graph representation learning pipelines.

[LG-40] How Foundational are Foundation Models for Time Series Forecasting? NEURIPS2025

链接: https://arxiv.org/abs/2510.00742
作者: Nouha Karaouli(1),Denis Coquenet(2),Elisa Fromont(1),Martial Mermillod(3),Marina Reyboz(4) ((1) Univ. Rennes, CNRS, Inria, IRISA - UMR 6074, F-35000 Rennes, France, (2) Univ. Rennes, CNRS, IRISA - UMR 6074, F-35000 Rennes, France, (3) Univ. Grenoble Alpes, Univ. Savoie Mont Blanc, CNRS, LPNC, Grenoble, France, (4) Univ. Grenoble Alpes, CEA, LIST, 38000 Grenoble, France)
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S)

点击查看摘要

Abstract:Foundation Models are designed to serve as versatile embedding machines, with strong zero shot capabilities and superior generalization performance when fine-tuned on diverse downstream tasks. While this is largely true for language and vision foundation models, we argue that the inherent diversity of time series data makes them less suited for building effective foundation models. We demonstrate this using forecasting as our downstream task. We show that the zero-shot capabilities of a time series foundation model are significantly influenced and tied to the specific domains it has been pretrained on. Furthermore, when applied to unseen real-world time series data, fine-tuned foundation models do not consistently yield substantially better results, relative to their increased parameter count and memory footprint, than smaller, dedicated models tailored to the specific forecasting task at hand.

[LG-41] Discovering Communities in Continuous-Time Temporal Networks by Optimizing L-Modularity ICDM2025

链接: https://arxiv.org/abs/2510.00741
作者: Victor Brabant,Angela Bonifati,Rémy Cazabet
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: Accepted in ICDM 2025

点击查看摘要

Abstract:Community detection is a fundamental problem in network analysis, with many applications in various fields. Extending community detection to the temporal setting with exact temporal accuracy, as required by real-world dynamic data, necessitates methods specifically adapted to the temporal nature of interactions. We introduce LAGO, a novel method for uncovering dynamic communities by greedy optimization of Longitudinal Modularity, a specific adaptation of Modularity for continuous-time networks. Unlike prior approaches that rely on time discretization or assume rigid community evolution, LAGO captures the precise moments when nodes enter and exit communities. We evaluate LAGO on synthetic benchmarks and real-world datasets, demonstrating its ability to efficiently uncover temporally and topologically coherent communities.

[LG-42] D-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning

链接: https://arxiv.org/abs/2510.00739
作者: Marco Bagatella,Matteo Pirotta,Ahmed Touati,Alessandro Lazaric,Andrea Tirinzoni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent prediction–where agents learn by predicting their own latents–has emerged as a powerful paradigm for training general representations in machine learning. In reinforcement learning (RL), this approach has been explored to define auxiliary losses for a variety of settings, including reward-based and unsupervised RL, behavior cloning, and world modeling. While existing methods are typically limited to single-task learning, one-step prediction, or on-policy trajectory data, we show that temporal difference (TD) learning enables learning representations predictive of long-term latent dynamics across multiple policies from offline, reward-free transitions. Building on this, we introduce TD-JEPA, which leverages TD-based latent-predictive representations into unsupervised RL. TD-JEPA trains explicit state and task encoders, a policy-conditioned multi-step predictor, and a set of parameterized policies directly in latent space. This enables zero-shot optimization of any reward function at test time. Theoretically, we show that an idealized variant of TD-JEPA avoids collapse with proper initialization, and learns encoders that capture a low-rank factorization of long-term policy dynamics, while the predictor recovers their successor features in latent space. Empirically, TD-JEPA matches or outperforms state-of-the-art baselines on locomotion, navigation, and manipulation tasks across 13 datasets in ExoRL and OGBench, especially in the challenging setting of zero-shot RL from pixels.

[LG-43] Comparison of Machine Learning Models to Classify Documents on Digital Development

链接: https://arxiv.org/abs/2510.00720
作者: Uvini Ranaweera,Bawun Mawitagama,Sanduni Liyanage,Sandupa Keshan,Tiloka de Silva,Supun Hewawalpita
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, 4 tables, presented at First International Conference, DSAI 2023, Bangkok

点击查看摘要

Abstract:Automated document classification is a trending topic in Natural Language Processing (NLP) due to the extensive growth in digital databases. However, a model that fits well for a specific classification task might perform weakly for another dataset due to differences in the context. Thus, training and evaluating several models is necessary to optimise the results. This study employs a publicly available document database on worldwide digital development interventions categorised under twelve areas. Since digital interventions are still emerging, utilising NLP in the field is relatively new. Given the exponential growth of digital interventions, this research has a vast scope for improving how digital-development-oriented organisations report their work. The paper examines the classification performance of Machine Learning (ML) algorithms, including Decision Trees, k-Nearest Neighbors, Support Vector Machine, AdaBoost, Stochastic Gradient Descent, Naive Bayes, and Logistic Regression. Accuracy, precision, recall and F1-score are utilised to evaluate the performance of these models, while oversampling is used to address the class-imbalanced nature of the dataset. Deviating from the traditional approach of fitting a single model for multiclass classification, this paper investigates the One vs Rest approach to build a combined model that optimises the performance. The study concludes that the amount of data is not the sole factor affecting the performance; features like similarity within classes and dissimilarity among classes are also crucial.

[LG-44] Physics-Informed Extreme Learning Machine (PIELM) for Tunnelling-Induced Soil-Pile Interactions

链接: https://arxiv.org/abs/2510.00698
作者: Fu-Chen Guo,Pei-Zhi Zhuang,Fei Ren,Hong-Ya Yue,He Yang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Physics-informed machine learning has been a promising data-driven and physics-informed approach in geotechnical engineering. This study proposes a physics-informed extreme learning machine (PIELM) framework for analyzing tunneling-induced soil-pile interactions. The pile foundation is modeled as an Euler-Bernoulli beam, and the surrounding soil is modeled as a Pasternak foundation. The soil-pile interaction is formulated into a fourth-order ordinary differential equation (ODE) that constitutes the physics-informed component, while measured data are incorporated into PIELM as the data-driven component. Combining physics and data yields a loss vector of the extreme learning machine (ELM) network, which is trained within 1 second by the least squares method. After validating the PIELM approach by the boundary element method (BEM) and finite difference method (FDM), parametric studies are carried out to examine the effects of ELM network architecture, data monitoring locations and numbers on the performance of PIELM. The results indicate that monitored data should be placed at positions where the gradients of pile deflections are significant, such as at the pile tip/top and near tunneling zones. Two application examples highlight the critical role of physics-informed and data-driven approach for tunnelling-induced soil-pile interactions. The proposed approach shows great potential for real-time monitoring and safety assessment of pile foundations, and benefits for intelligent early-warning systems in geotechnical engineering.

[LG-45] Error Feedback for Muon and Friends

链接: https://arxiv.org/abs/2510.00643
作者: Kaja Gruntkowska,Alexander Gaponov,Zhirayr Tovmasyan,Peter Richtárik
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent optimizers like Muon, Scion, and Gluon have pushed the frontier of large-scale deep learning by exploiting layer-wise linear minimization oracles (LMOs) over non-Euclidean norm balls, capturing neural network structure in ways traditional algorithms cannot. Yet, no principled distributed framework exists for these methods, and communication bottlenecks remain unaddressed. The very few distributed variants are heuristic, with no convergence guarantees in sight. We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based optimizer with rigorous convergence guarantees. EF21-Muon supports stochastic gradients, momentum, and bidirectional compression with error feedback-marking the first extension of error feedback beyond the Euclidean setting. It recovers Muon/Scion/Gluon when compression is off and specific norms are chosen, providing the first efficient distributed implementation of this powerful family. Our theory covers non-Euclidean smooth and the more general (L^0, L^1) -smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices. We further extend the analysis to layer-wise (generalized) smoothness regimes, capturing the anisotropic structure of deep networks. Experiments on NanoGPT benchmarking EF21-Muon against uncompressed Muon/Scion/Gluon demonstrate up to 7\times communication savings with no accuracy degradation.

[LG-46] Multi-Agent Stage-wise Conservative Linear Bandits

链接: https://arxiv.org/abs/2510.00602
作者: Amirhoseein Afsharrad,Ahmadreza Moradipari,Sanjay Lall
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of N agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than (1-\alpha) times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret \tildeO\left(\fracd\sqrtN\sqrtT\cdot\frac\log(NT)\sqrt\log(1/|\lambda_2|)\right) with high probability, where d is the dimension, T is the horizon, and |\lambda_2| is the network’s second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields \frac1\sqrtN improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.

[LG-47] Designing Ambiguity Sets for Distributionally Robust Optimization Using Structural Causal Optimal Transport

链接: https://arxiv.org/abs/2510.00599
作者: Ahmad-Reza Ehyaei,Golnoosh Farnadi,Samira Samadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributionally robust optimization tackles out-of-sample issues like overfitting and distribution shifts by adopting an adversarial approach over a range of possible data distributions, known as the ambiguity set. To balance conservatism and accuracy, these sets must include realistic probability distributions by leveraging information from the nominal distribution. Assuming that nominal distributions arise from a structural causal model with a directed acyclic graph \mathcalG and structural equations, previous methods such as adapted and \mathcalG -causal optimal transport have only utilized causal graph information in designing ambiguity sets. In this work, we propose incorporating structural equations, which include causal graph information, to enhance ambiguity sets, resulting in more realistic distributions. We introduce structural causal optimal transport and its associated ambiguity set, demonstrating their advantages and connections to previous methods. A key benefit of our approach is a relaxed version, where a regularization term replaces the complex causal constraints, enabling an efficient algorithm via difference-of-convex programming to solve structural causal optimal transport. We also show that when structural information is absent and must be estimated, our approach remains effective and provides finite sample guarantees. Lastly, we address the radius of ambiguity sets, illustrating how our method overcomes the curse of dimensionality in optimal transport problems, achieving faster shrinkage with dimension-free order.

[LG-48] Probability calibration for precipitation nowcasting NEURIPS2025

链接: https://arxiv.org/abs/2510.00594
作者: Lauri Kurki,Yaniel Cabrera,Samu Karanko
类目: Machine Learning (cs.LG)
*备注: Submitted to NeurIPS 2025 Workshop: Tackling Climate Change with Machine Learning

点击查看摘要

Abstract:Reliable precipitation nowcasting is critical for weather-sensitive decision-making, yet neural weather models (NWMs) can produce poorly calibrated probabilistic forecasts. Standard calibration metrics such as the expected calibration error (ECE) fail to capture miscalibration across precipitation thresholds. We introduce the expected thresholded calibration error (ETCE), a new metric that better captures miscalibration in ordered classes like precipitation amounts. We extend post-processing techniques from computer vision to the forecasting domain. Our results show that selective scaling with lead time conditioning reduces model miscalibration without reducing the forecast quality.

[LG-49] Private Online Learning against an Adaptive Adversary: Realizable and Agnostic Settings

链接: https://arxiv.org/abs/2510.00574
作者: Bo Li,Wei Wang,Peng Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the problem of private online learning, in which a learner receives a sequence of T data points and has to respond at each time-step a hypothesis. It is required that the entire stream of output hypotheses should satisfy differential privacy. Prior work of Golowich and Livni [2021] established that every concept class \mathcalH with finite Littlestone dimension d is privately online learnable in the realizable setting. In particular, they proposed an algorithm that achieves an O_d(\log T) mistake bound against an oblivious adversary. However, their approach yields a suboptimal \tildeO_d(\sqrtT) bound against an adaptive adversary. In this work, we present a new algorithm with a mistake bound of O_d(\log T) against an adaptive adversary, closing this gap. We further investigate the problem in the agnostic setting, which is more general than the realizable setting as it does not impose any assumptions on the data. We give an algorithm that obtains a sublinear regret of \tildeO_d(\sqrtT) for generic Littlestone classes, demonstrating that they are also privately online learnable in the agnostic setting.

[LG-50] IntrusionX: A Hybrid Convolutional-LSTM Deep Learning Framework with Squirrel Search Optimization for Network Intrusion Detection

链接: https://arxiv.org/abs/2510.00572
作者: Ahsan Farabi,Muhaiminul Rashid Shad,Israt Khandaker
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intrusion Detection Systems (IDS) face persistent challenges due to evolving cyberattacks, high-dimensional traffic data, and severe class imbalance in benchmark datasets such as NSL-KDD. To address these issues, we propose IntrusionX, a hybrid deep learning framework that integrates Convolutional Neural Networks (CNNs) for local feature extraction and Long Short-Term Memory (LSTM) networks for temporal modeling. The architecture is further optimized using the Squirrel Search Algorithm (SSA), enabling effective hyperparameter tuning while maintaining computational efficiency. Our pipeline incorporates rigorous preprocessing, stratified data splitting, and dynamic class weighting to enhance the detection of rare classes. Experimental evaluation on NSL-KDD demonstrates that IntrusionX achieves 98% accuracy in binary classification and 87% in 5-class classification, with significant improvements in minority class recall (U2R: 71%, R2L: 93%). The novelty of IntrusionX lies in its reproducible, imbalance-aware design with metaheuristic optimization.

[LG-51] Interpretable Machine Learning for Life Expectancy Prediction: A Comparative Study of Linear Regression Decision Tree and Random Forest

链接: https://arxiv.org/abs/2510.00542
作者: Roman Dolgopolyi,Ioanna Amaslidou,Agrippina Margaritou
类目: Machine Learning (cs.LG)
*备注: 20 pages, 15 figures, 3 tables

点击查看摘要

Abstract:Life expectancy is a fundamental indicator of population health and socio-economic well-being, yet accurately forecasting it remains challenging due to the interplay of demographic, environmental, and healthcare factors. This study evaluates three machine learning models – Linear Regression (LR), Regression Decision Tree (RDT), and Random Forest (RF), using a real-world dataset drawn from World Health Organization (WHO) and United Nations (UN) sources. After extensive preprocessing to address missing values and inconsistencies, each model’s performance was assessed with R^2 , Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). Results show that RF achieves the highest predictive accuracy ( R^2 = 0.9423 ), significantly outperforming LR and RDT. Interpretability was prioritized through p-values for LR and feature importance metrics for the tree-based models, revealing immunization rates (diphtheria, measles) and demographic attributes (HIV/AIDS, adult mortality) as critical drivers of life-expectancy predictions. These insights underscore the synergy between ensemble methods and transparency in addressing public-health challenges. Future research should explore advanced imputation strategies, alternative algorithms (e.g., neural networks), and updated data to further refine predictive accuracy and support evidence-based policymaking in global health contexts.

[LG-52] Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space? EMNLP2025

链接: https://arxiv.org/abs/2510.00537
作者: Nandan Kumar Jha,Brandon Reagen
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: EMNLP 2025 Main Conference (Long paper)

点击查看摘要

Abstract:As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite – Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) – we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

[LG-53] Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness

链接: https://arxiv.org/abs/2510.00517
作者: Tsubasa Takahashi,Shojiro Yamabe,Futa Waseda,Kento Sasaki
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differential Attention (DA) has been proposed as a refinement to standard attention, suppressing redundant or noisy context through a subtractive structure and thereby reducing contextual hallucination. While this design sharpens task-relevant focus, we show that it also introduces a structural fragility under adversarial perturbations. Our theoretical analysis identifies negative gradient alignment-a configuration encouraged by DA’s subtraction-as the key driver of sensitivity amplification, leading to increased gradient norms and elevated local Lipschitz constants. We empirically validate this Fragile Principle through systematic experiments on ViT/DiffViT and evaluations of pretrained CLIP/DiffCLIP, spanning five datasets in total. These results demonstrate higher attack success rates, frequent gradient opposition, and stronger local sensitivity compared to standard attention. Furthermore, depth-dependent experiments reveal a robustness crossover: stacking DA layers attenuates small perturbations via depth-dependent noise cancellation, though this protection fades under larger attack budgets. Overall, our findings uncover a fundamental trade-off: DA improves discriminative focus on clean inputs but increases adversarial vulnerability, underscoring the need to jointly design for selectivity and robustness in future attention mechanisms.

[LG-54] Diffusion Alignment as Variational Expectation-Maximization

链接: https://arxiv.org/abs/2510.00502
作者: Jaewoo Lee,Minsu Kim,Sanghyeok Choi,Inhyuck Song,Sujin Yun,Hyeongyu Kang,Woocheol Shin,Taeyoung Yun,Kiyoung Om,Jinkyoo Park
类目: Machine Learning (cs.LG)
*备注: 30 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.

[LG-55] Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation NEURIPS2025

链接: https://arxiv.org/abs/2510.00478
作者: Jing Wang,Wonho Bae,Jiahong Chen,Wenxu Wang,Junhyug Noh
类目: Machine Learning (cs.LG)
*备注: 32 pages, 6 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature’s label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature’s latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier’s accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve.

[LG-56] Robust Spatiotemporally Contiguous Anomaly Detection Using Tensor Decomposition

链接: https://arxiv.org/abs/2510.00460
作者: Rachita Mondal,Mert Indibi,Tapabrata Maiti,Selin Aviyente
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Anomaly detection in spatiotemporal data is a challenging problem encountered in a variety of applications, including video surveillance, medical imaging data, and urban traffic monitoring. Existing anomaly detection methods focus mainly on point anomalies and cannot deal with temporal and spatial dependencies that arise in spatio-temporal data. Tensor-based anomaly detection methods have been proposed to address this problem. Although existing methods can capture dependencies across different modes, they are primarily supervised and do not account for the specific structure of anomalies. Moreover, these methods focus mainly on extracting anomalous features without providing any statistical confidence. In this paper, we introduce an unsupervised tensor-based anomaly detection method that simultaneously considers the sparse and spatiotemporally smooth nature of anomalies. The anomaly detection problem is formulated as a regularized robust low-rank + sparse tensor decomposition where the total variation of the tensor with respect to the underlying spatial and temporal graphs quantifies the spatiotemporal smoothness of the anomalies. Once the anomalous features are extracted, we introduce a statistical anomaly scoring framework that accounts for local spatio-temporal dependencies. The proposed framework is evaluated on both synthetic and real data.

[LG-57] Randomized Matrix Sketching for Neural Network Training and Gradient Monitoring

链接: https://arxiv.org/abs/2510.00442
作者: Harbir Antil,Deepanshu Verma
类目: Machine Learning (cs.LG)
*备注: 21, pages, 5 figures, 1 table

点击查看摘要

Abstract:Neural network training relies on gradient computation through backpropagation, yet memory requirements for storing layer activations present significant scalability challenges. We present the first adaptation of control-theoretic matrix sketching to neural network layer activations, enabling memory-efficient gradient reconstruction in backpropagation. This work builds on recent matrix sketching frameworks for dynamic optimization problems, where similar state trajectory storage challenges motivate sketching techniques. Our approach sketches layer activations using three complementary sketch matrices maintained through exponential moving averages (EMA) with adaptive rank adjustment, automatically balancing memory efficiency against approximation quality. Empirical evaluation on MNIST, CIFAR-10, and physics-informed neural networks demonstrates a controllable accuracy-memory tradeoff. We demonstrate a gradient monitoring application on MNIST showing how sketched activations enable real-time gradient norm tracking with minimal memory overhead. These results establish that sketched activation storage provides a viable path toward memory-efficient neural network training and analysis.

[LG-58] Learning a Zeroth-Order Optimizer for Fine-Tuning LLM s

链接: https://arxiv.org/abs/2510.00419
作者: Kairun Zhang,Haoyu Li,Yanjun Zhao,Yifan Sun,Huan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zeroth-order optimizers have recently emerged as a practical approach for fine-tuning large language models (LLMs), significantly reducing GPU memory consumption compared to traditional first-order methods. Yet, existing zeroth-order methods rely on hand-crafted, static sampling strategies that are not adaptable to model-specific structures. To address this, we propose ZO Fine-tuner, a learning-based zeroth-order optimizer for LLMs that automatically learns efficient perturbation strategies through a compact and memory-efficient design. Crucially, our approach is motivated by the observation that only a small number of foundation models and their derivatives are widely adopted in practice. Therefore, learning the optimizer once for a given LLM and reusing it across diverse downstream tasks is both feasible and highly desirable. Accordingly, ZO Fine-tuner is designed to scale learning to learn (L2L) to the foundation-model era by supporting one-time training per LLM with minimal overhead. Experiments on 4 LLMs and 7 datasets show that ZO Fine-tuner outperforms prior zeroth-order baselines in 82.1% of task-model combinations, thereby demonstrating strong performance and scalability for efficient LLM fine-tuning. Our code is available at this https URL.

[LG-59] Hierarchy-Aware Neural Subgraph Matching with Enhanced Similarity Measure

链接: https://arxiv.org/abs/2510.00402
作者: Zhouyang Liu,Ning Liu,Yixin Chen,Jiezhong He,Menghan Jia,Dongsheng Li
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Knowledge and Data Engineering

点击查看摘要

Abstract:Subgraph matching is challenging as it necessitates time-consuming combinatorial searches. Recent Graph Neural Network (GNN)-based approaches address this issue by employing GNN encoders to extract graph information and hinge distance measures to ensure containment constraints in the embedding space. These methods significantly shorten the response time, making them promising solutions for subgraph retrieval. However, they suffer from scale differences between graph pairs during encoding, as they focus on feature counts but overlook the relative positions of features within node-rooted subtrees, leading to disturbed containment constraints and false predictions. Additionally, their hinge distance measures lack discriminative power for matched graph pairs, hindering ranking applications. We propose NC-Iso, a novel GNN architecture for neural subgraph matching. NC-Iso preserves the relative positions of features by building the hierarchical dependencies between adjacent echelons within node-rooted subtrees, ensuring matched graph pairs maintain consistent hierarchies while complying with containment constraints in feature counts. To enhance the ranking ability for matched pairs, we introduce a novel similarity dominance ratio-enhanced measure, which quantifies the dominance of similarity over dissimilarity between graph pairs. Empirical results on nine datasets validate the effectiveness, generalization ability, scalability, and transferability of NC-Iso while maintaining time efficiency, offering a more discriminative neural subgraph matching solution for subgraph retrieval. Code available at this https URL.

[LG-60] Can Mamba Learn In Context with Outliers? A Theoretical Generalization Analysis

链接: https://arxiv.org/abs/2510.00399
作者: Hongkang Li,Songtao Lu,Xiaodong Cui,Pin-Yu Chen,Meng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Mamba model has gained significant attention for its computational advantages over Transformer-based models, while achieving comparable performance across a wide range of language tasks. Like Transformers, Mamba exhibits in-context learning (ICL) capabilities, i.e., making predictions for new tasks based on a prompt containing input-label pairs and a query, without requiring fine-tuning. Despite its empirical success, the theoretical understanding of Mamba remains limited, largely due to the nonlinearity introduced by its gating mechanism. To the best of our knowledge, this paper presents the first theoretical analysis of the training dynamics of a one-layer Mamba model, which consists of a linear attention component followed by a nonlinear gating layer, and its ICL generalization on unseen binary classification tasks, even when the prompt includes additive outliers. Our analysis shows that Mamba leverages the linear attention layer to select informative context examples and uses the nonlinear gating layer to suppress the influence of outliers. By establishing and comparing to the analysis of linear Transformers under the same setting, we show that although Mamba may require more training iterations to converge, it maintains accurate predictions even when the proportion of outliers exceeds the threshold that a linear Transformer can tolerate. These theoretical findings are supported by empirical experiments.

[LG-61] Graph2Region: Efficient Graph Similarity Learning with Structure and Scale Restoration

链接: https://arxiv.org/abs/2510.00394
作者: Zhouyang Liu,Yixin Chen,Ning Liu,Jiezhong He,Dongsheng Li
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted by IEEE Transactions on Knowledge and Data Engineering

点击查看摘要

Abstract:Graph similarity is critical in graph-related tasks such as graph retrieval, where metrics like maximum common subgraph (MCS) and graph edit distance (GED) are commonly used. However, exact computations of these metrics are known to be NP-Hard. Recent neural network-based approaches approximate the similarity score in embedding spaces to alleviate the computational burden, but they either involve expensive pairwise node comparisons or fail to effectively utilize structural and scale information of graphs. To tackle these issues, we propose a novel geometric-based graph embedding method called Graph2Region (G2R). G2R represents nodes as closed regions and recovers their adjacency patterns within graphs in the embedding space. By incorporating the node features and adjacency patterns of graphs, G2R summarizes graph regions, i.e., graph embeddings, where the shape captures the underlying graph structures and the volume reflects the graph size. Consequently, the overlap between graph regions can serve as an approximation of MCS, signifying similar node regions and adjacency patterns. We further analyze the relationship between MCS and GED and propose using disjoint parts as a proxy for GED similarity. This analysis enables concurrent computation of MCS and GED, incorporating local and global structural information. Experimental evaluation highlights G2R’s competitive performance in graph similarity computation. It achieves up to a 60.0% relative accuracy improvement over state-of-the-art methods in MCS similarity learning, while maintaining efficiency in both training and inference. Moreover, G2R showcases remarkable capability in predicting both MCS and GED similarities simultaneously, providing a holistic assessment of graph similarity. Code available at this https URL.

[LG-62] Bayesian Distributional Models of Executive Functioning

链接: https://arxiv.org/abs/2510.00387
作者: Robert Kasumba,Zeyu Lu,Dom CP Marticorena,Mingyang Zhong,Paul Beggs,Anja Pahor,Geetha Ramani,Imani Goffney,Susanne M Jaeggi,Aaron R Seitz,Jacob R Gardner,Dennis L Barbour
类目: Machine Learning (cs.LG)
*备注: 42 pages, 8 figures, 1 table

点击查看摘要

Abstract:Estimation (IMLE). DLVM integrates observations across multiple executive function tasks and individuals, allowing parameter estimation even under sparse or incomplete data conditions. DLVM consistently outperformed IMLE, especially under with smaller amounts of data, and converges faster to highly accurate estimates of the true distributions. In a second set of analyses, DALE adaptively guided sampling to maximize information gain, outperforming random sampling and fixed test batteries, particularly within the first 80 trials. These findings establish the advantages of combining DLVMs cross-task inference with DALEs optimal adaptive sampling, providing a principled basis for more efficient cognitive assessments.

[LG-63] Learning Passive Continuous-Time Dynamics with Multistep Port-Hamiltonian Gaussian Processes

链接: https://arxiv.org/abs/2510.00384
作者: Chi Ho Leung,Philip E. Paré
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We propose the multistep port-Hamiltonian Gaussian process (MS-PHS GP) to learn physically consistent continuous-time dynamics and a posterior over the Hamiltonian from noisy, irregularly-sampled trajectories. By placing a GP prior on the Hamiltonian surface H and encoding variable-step multistep integrator constraints as finite linear functionals, MS-PHS GP enables closed-form conditioning of both the vector field and the Hamiltonian surface without latent states, while enforcing energy balance and passivity by design. We state a finite-sample vector-field bound that separates the estimation and variable-step discretization terms. Lastly, we demonstrate improved vector-field recovery and well-calibrated Hamiltonian uncertainty on mass-spring, Van der Pol, and Duffing benchmarks.

[LG-64] Efficient Probabilistic Tensor Networks

链接: https://arxiv.org/abs/2510.00382
作者: Marawan Gamal Abdel Hameed,Guillaume Rabusseau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensor networks (TNs) enable compact representations of large tensors through shared parameters. Their use in probabilistic modeling is particularly appealing, as probabilistic tensor networks (PTNs) allow for tractable computation of marginals. However, existing approaches for learning parameters of PTNs are either computationally demanding and not fully compatible with automatic differentiation frameworks, or numerically unstable. In this work, we propose a conceptually simple approach for learning PTNs efficiently, that is numerically stable. We show our method provides significant improvements in time and space complexity, achieving 10x reduction in latency for generative modeling on the MNIST dataset. Furthermore, our approach enables learning of distributions with 10x more variables than previous approaches when applied to a variety of density estimation benchmarks. Our code is publicly available at this http URL.

[LG-65] Composer: A Search Framework for Hybrid Neural Architecture Design

链接: https://arxiv.org/abs/2510.00379
作者: Bilge Acun,Prasoon Sinha,Newsha Ardalani,Sangmin Bae,Alicia Golden,Chien-Yu Lin,Meghana Madhyastha,Fei Sun,Neeraja J. Yadwadkar,Carole-Jean Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid model architectures that combine computational primitives (e.g., Attention, MLP) in different ratios have shown promising performance beyond Transformers. Some studies have shown that different interleavings of primitives can affect model quality as well. However, prior works explore the hybrid model architecture design space manually. Due to the large design space and training costs, discovering hybrid models that combine key computational primitives for pre-training is challenging. In this work, we take a principled approach in designing a modular hybrid model architecture search framework – Composer. Composer explores model architectures at a small scale and extrapolates the top-performing model architectures to a larger scale using our proposed scaling strategies. Using Composer, we discover new hybrid LLM architectures that outperform Llama 3.2. Compared to Llama 3.2 and previous state-of-the-art baselines, the new model architectures consistently reduce validation loss at parameter scales of 350M-3B and improve evaluation accuracy on the downstream tasks by up to 2.8-8.3% (1.1-3.1% on average) while improving both training and inference efficiency.

[LG-66] Multidimensional Bayesian Active Machine Learning of Working Memory Task Performance

链接: https://arxiv.org/abs/2510.00375
作者: Dom CP Marticorena,Chris Wissmann,Zeyu Lu,Dennis L Barbour
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 37 pages, 7 figures

点击查看摘要

Abstract:While adaptive experimental design has outgrown one-dimensional, staircase-based adaptations, most cognitive experiments still control a single factor and summarize performance with a scalar. We show a validation of a Bayesian, two-axis, active- classification approach, carried out in an immersive virtual testing environment for a 5-by-5 working-memory reconstruction task. Two variables are controlled: spatial load L (number of occupied tiles) and feature-binding load K (number of distinct colors) of items. Stimulus acquisition is guided by posterior uncertainty of a nonparametric Gaussian Process (GP) probabilistic classifier, which outputs a surface over (L, K) rather than a single threshold or max span value. In a young adult population, we compare GP-driven Adaptive Mode (AM) with a traditional adaptive staircase Classic Mode (CM), which varies L only at K = 3. Parity between the methods is achieved for this cohort, with an intraclass coefficient of 0.755 at K = 3. Additionally, AM reveals individual differences in interactions between spatial load and feature binding. AM estimates converge more quickly than other sampling strategies, demonstrating that only about 30 samples are required for accurate fitting of the full model.

[LG-67] he Transformer Cookbook

链接: https://arxiv.org/abs/2510.00368
作者: Andy Yang,Christopher Watson,Anton Xue,Satwik Bhattamishra,Jose Llarena,William Merrill,Emile Dos Santos Ferreira,Anej Svete,David Chiang
类目: Machine Learning (cs.LG)
*备注: 39 pages

点击查看摘要

Abstract:We present the transformer cookbook: a collection of techniques for directly encoding algorithms into a transformer’s parameters. This work addresses the steep learning curve of such endeavors, a problem exacerbated by a fragmented literature where key results are scattered across numerous papers. In particular, we synthesize this disparate body of findings into a curated set of recipes that demonstrate how to implement everything from basic arithmetic in feed-forward layers to complex data routing via self-attention. Our mise en place of formulations is for both newcomers seeking an accessible entry point and experts in need of a systematic reference. This unified presentation of transformer constructions provides a foundation for future work spanning theoretical research in computational complexity to empirical investigations in architecture design and interpretability.

[LG-68] Continual Learning with Query-Only Attention

链接: https://arxiv.org/abs/2510.00365
作者: Gautham Bekal,Ashish Pujari,Scott David Kelly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning involves learning from a stream of data without repetition of data points, a scenario that is inherently complex due to distributional shift across tasks. We propose a query-only attention mechanism that discards keys and values, yet preserves the core inductive bias of transformer architectures. In continual learning scenarios, this simplified mechanism significantly mitigates both loss of plasticity and catastrophic forgetting, outperforming baselines such as selective re-initialization. We establish a conceptual link between query-only attention, full transformer attention, and model agnostic meta-learning, framing them as instances of meta-learning. We further provide intuition for why query-based models and attention networks help preserve plasticity in continual settings. Finally, through preliminary Hessian spectrum analysis, we observe that models maintaining higher curvature rank across tasks tend to retain plasticity. Our findings suggest that full attention may not be essential for capturing the benefits of meta-learning in continual learning.

[LG-69] AReUReDi: Annealed Rectified Updates for Refining Discrete Flows with Multi-Objective Guidance

链接: https://arxiv.org/abs/2510.00352
作者: Tong Chen,Yinuo Zhang,Pranam Chatterjee
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Designing sequences that satisfy multiple, often conflicting, objectives is a central challenge in therapeutic and biomolecular engineering. Existing generative frameworks largely operate in continuous spaces with single-objective guidance, while discrete approaches lack guarantees for multi-objective Pareto optimality. We introduce AReUReDi (Annealed Rectified Updates for Refining Discrete Flows), a discrete optimization algorithm with theoretical guarantees of convergence to the Pareto front. Building on Rectified Discrete Flows (ReDi), AReUReDi combines Tchebycheff scalarization, locally balanced proposals, and annealed Metropolis-Hastings updates to bias sampling toward Pareto-optimal states while preserving distributional invariance. Applied to peptide and SMILES sequence design, AReUReDi simultaneously optimizes up to five therapeutic properties (including affinity, solubility, hemolysis, half-life, and non-fouling) and outperforms both evolutionary and diffusion-based baselines. These results establish AReUReDi as a powerful, sequence-based framework for multi-property biomolecule generation.

[LG-70] Flow Autoencoders are Effective Protein Tokenizers

链接: https://arxiv.org/abs/2510.00351
作者: Rohit Dilip,Evan Zhang,Ayush Varshney,David Van Valen
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Protein structure tokenizers enable the creation of multimodal models of protein structure, sequence, and function. Current approaches to protein structure tokenization rely on bespoke components that are invariant to spatial symmetries, but that are challenging to optimize and scale. We present Kanzi, a flow-based tokenizer for tokenization and generation of protein structures. Kanzi consists of a diffusion autoencoder trained with a flow matching loss. We show that this approach simplifies several aspects of protein structure tokenizers: frame-based representations can be replaced with global coordinates, complex losses are replaced with a single flow matching loss, and SE(3)-invariant attention operations can be replaced with standard attention. We find that these changes stabilize the training of parameter-efficient models that outperform existing tokenizers on reconstruction metrics at a fraction of the model size and training cost. An autoregressive model trained with Kanzi outperforms similar generative models that operate over tokens, although it does not yet match the performance of state-of-the-art continuous diffusion models. Code is available here: this https URL.

[LG-71] Initial Distribution Sensitivity of Constrained Markov Decision Processes

链接: https://arxiv.org/abs/2510.00348
作者: Alperen Tercan,Necmiye Ozay
类目: Machine Learning (cs.LG)
*备注: Full version of CDC 2025 paper

点击查看摘要

Abstract:Constrained Markov Decision Processes (CMDPs) are notably more complex to solve than standard MDPs due to the absence of universally optimal policies across all initial state distributions. This necessitates re-solving the CMDP whenever the initial distribution changes. In this work, we analyze how the optimal value of CMDPs varies with different initial distributions, deriving bounds on these variations using duality analysis of CMDPs and perturbation analysis in linear programming. Moreover, we show how such bounds can be used to analyze the regret of a given policy due to unknown variations of the initial distribution.

[LG-72] Cutting the Skip: Training Residual-Free Transformers

链接: https://arxiv.org/abs/2510.00345
作者: Yiping Ji,James Martens,Jianqiao Zheng,Ziqin Zhou,Peyman Moghadam,Xinyu Zhang,Hemanth Saratchandran,Simon Lucey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without skip (residual) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong baselines, that incorporate skip connections, on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.

[LG-73] Which Programming Language and Model Work Best With LLM -as-a-Judge For Code Retrieval? SIGIR

链接: https://arxiv.org/abs/2510.00324
作者: Lucas Roberts,Denisa Roberts
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted as a full paper at SIGIR-AP 2025

点击查看摘要

Abstract:Code search is an important information retrieval application. Benefits of better code search include faster new developer on-boarding, reduced software maintenance, and ease of understanding for large repositories. Despite improvements in search algorithms and search benchmarks, the domain of code search has lagged behind. One reason is the high cost of human annotation for code queries and answers. While humans may annotate search results in general text QA systems, code annotations require specialized knowledge of a programming language (PL), as well as domain specific software engineering knowledge. In this work we study the use of Large Language Models (LLMs) to retrieve code at the level of functions and to generate annotations for code search results. We compare the impact of the retriever representation (sparse vs. semantic), programming language, and LLM by comparing human annotations across several popular languages (C, Java, Javascript, Go, and Python). We focus on repositories that implement common data structures likely to be implemented in any PLs. For the same human annotations, we compare several LLM-as-a-Judge models to evaluate programming language and other affinities between LLMs. We find that the chosen retriever and PL exhibit affinities that can be leveraged to improve alignment of human and AI relevance determinations, with significant performance implications. We also find differences in representation (sparse vs. semantic) across PLs that impact alignment of human and AI relevance determinations. We propose using transpilers to bootstrap scalable code search benchmark datasets in other PLs and in a case study demonstrate that human-AI relevance agreement rates largely match the (worst case) human-human agreement under study. The application code used in this work is available at \hrefthis https URLthis github repo.

[LG-74] Privately Estimating Black-Box Statistics

链接: https://arxiv.org/abs/2510.00322
作者: Günter F. Steinke,Thomas Steinke
类目: Cryptography and Security (cs.CR); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard techniques for differentially private estimation, such as Laplace or Gaussian noise addition, require guaranteed bounds on the sensitivity of the estimator in question. But such sensitivity bounds are often large or simply unknown. Thus we seek differentially private methods that can be applied to arbitrary black-box functions. A handful of such techniques exist, but all are either inefficient in their use of data or require evaluating the function on exponentially many inputs. In this work we present a scheme that trades off between statistical efficiency (i.e., how much data is needed) and oracle efficiency (i.e., the number of evaluations). We also present lower bounds showing the near-optimality of our scheme.

[LG-75] DiSC-AMC: Token- and Parameter-Efficient Discretized Statistics In-Context Automatic Modulation Classification

链接: https://arxiv.org/abs/2510.00316
作者: Mohammad Rostami,Atik Faysal,Reihaneh Gh. Roshan,Huaxia Wang,Nikhil Muralidhar,Yu-Dong Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can perform Automatic Modulation Classification (AMC) in an open-set manner without LLM fine-tuning when equipped with carefully designed in-context prompts~\citerostami2025plug. Building on this prior work, we target the practical bottlenecks of long prompt contexts and large model sizes that impede in-the-loop deployment. We present Discretized Statistics in-Context Automatic Modulation Classification (DiSC-AMC), a token- and parameter-efficient variant that: (i) discretizes higher-order statistics and cumulants into compact symbolic tokens, (ii) prunes the exemplar list via a lightweight k-top neural prefilter and filters misleading/low-impact features using rationales extracted from prior LLM responses, and (iii) enforces label-only predictions through a calibrated prompt template. Together, these changes reduce both input/output tokens and the model parameter footprint by more than half while maintaining competitive accuracy. On synthetic AMC with ten modulation types under noise, a 7B \textitDeepSeek-R1-Distill-Qwen baseline achieves 5.2% accuracy, whereas our system, using an approximately 5B-parameter \textitGemini-2.5-Flash~\citecomanici2025gemini model, attains 45.5% accuracy. These results demonstrate that careful discretization and context selection can cut inference cost by over 2x while preserving the advantages of prompt-based AMC and enabling practical in-the-loop use.

[LG-76] Robust Federated Inference

链接: https://arxiv.org/abs/2510.00310
作者: Akash Dhasade,Sadegh Farhadkhani,Rachid Guerraoui,Nirupam Gupta,Maxime Jacovella,Anne-Marie Kermarrec,Rafael Pinot
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Federated inference, in the form of one-shot federated learning, edge ensembles, or federated ensembles, has emerged as an attractive solution to combine predictions from multiple models. This paradigm enables each model to remain local and proprietary while a central server queries them and aggregates predictions. Yet, the robustness of federated inference has been largely neglected, leaving them vulnerable to even simple attacks. To address this critical gap, we formalize the problem of robust federated inference and provide the first robustness analysis of this class of methods. Our analysis of averaging-based aggregators shows that the error of the aggregator is small either when the dissimilarity between honest responses is small or the margin between the two most probable classes is large. Moving beyond linear averaging, we show that problem of robust federated inference with non-linear aggregators can be cast as an adversarial machine learning problem. We then introduce an advanced technique using the DeepSet aggregation model, proposing a novel composition of adversarial training and test-time robust aggregation to robustify non-linear aggregators. Our composition yields significant improvements, surpassing existing robust aggregation methods by 4.7 - 22.2% in accuracy points across diverse benchmarks.

[LG-77] Lipschitz Bandits with Stochastic Delayed Feedback

链接: https://arxiv.org/abs/2510.00309
作者: Zhongxuan Liu,Yue Kang,Thomas C. M. Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Lipschitz bandit problem extends stochastic bandits to a continuous action set defined over a metric space, where the expected reward function satisfies a Lipschitz condition. In this work, we introduce a new problem of Lipschitz bandit in the presence of stochastic delayed feedback, where the rewards are not observed immediately but after a random delay. We consider both bounded and unbounded stochastic delays, and design algorithms that attain sublinear regret guarantees in each setting. For bounded delays, we propose a delay-aware zooming algorithm that retains the optimal performance of the delay-free setting up to an additional term that scales with the maximal delay \tau_\max . For unbounded delays, we propose a novel phased learning strategy that accumulates reliable feedback over carefully scheduled intervals, and establish a regret lower bound showing that our method is nearly optimal up to logarithmic factors. Finally, we present experimental results to demonstrate the efficiency of our algorithms under various delay scenarios.

[LG-78] Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT NEURIPS2025

链接: https://arxiv.org/abs/2510.00296
作者: Guy Bar-Shalom,Fabrizio Frasca,Yaniv Galron,Yftah Ziser,Haggai Maron
类目: Machine Learning (cs.LG)
*备注: Published in NeurIPS 2025

点击查看摘要

Abstract:Detecting hallucinations in Large Language Model-generated text is crucial for their safe deployment. While probing classifiers show promise, they operate on isolated layer-token pairs and are LLM-specific, limiting their effectiveness and hindering cross-LLM applications. In this paper, we introduce a novel approach to address these shortcomings. We build on the natural sequential structure of activation data in both axes (layers \times tokens) and advocate treating full activation tensors akin to images. We design ACT-ViT, a Vision Transformer-inspired model that can be effectively and efficiently applied to activation tensors and supports training on data from multiple LLMs simultaneously. Through comprehensive experiments encompassing diverse LLMs and datasets, we demonstrate that ACT-ViT consistently outperforms traditional probing techniques while remaining extremely efficient for deployment. In particular, we show that our architecture benefits substantially from multi-LLM training, achieves strong zero-shot performance on unseen datasets, and can be transferred effectively to new LLMs through fine-tuning. Full code is available at this https URL.

[LG-79] Low Resource Audio Codec Challenge Baseline Systems

链接: https://arxiv.org/abs/2510.00264
作者: Yusuf Ziya Isik,Rafał Łaganowski
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Low-Resource Audio Codec Challenge 2025

点击查看摘要

Abstract:The Low-Resource Audio Codec (LRAC) Challenge aims to advance neural audio coding for deployment in resource-constrained environments. The first edition focuses on low-resource neural speech codecs that must operate reliably under everyday noise and reverberation, while satisfying strict constraints on computational complexity, latency, and bitrate. Track 1 targets transparency codecs, which aim to preserve the perceptual transparency of input speech under mild noise and reverberation. Track 2 addresses enhancement codecs, which combine coding and compression with denoising and dereverberation. This paper presents the official baseline systems for both tracks in the 2025 LRAC Challenge. The baselines are convolutional neural codec models with Residual Vector Quantization, trained end-to-end using a combination of adversarial and reconstruction objectives. We detail the data filtering and augmentation strategies, model architectures, optimization procedures, and checkpoint selection criteria.

[LG-80] Delayed Attention Training Improves Length Generalization in Transformer–RNN Hybrids

链接: https://arxiv.org/abs/2510.00258
作者: Buu Phan,Reza Ebrahimi,Sanjay Haresh,Roland Memisevic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study length generalization in sequence models on a composite problem involving both state tracking and associative recall. Prior work finds that recurrent networks handle state tracking well but struggle with recall, whereas Transformers excel at recall yet fail to extend state-tracking capabilities to longer sequences. Motivated by the complementary strengths of these architectures, we construct hybrid models integrating recurrent and attention-based components, and train them on the combined task to evaluate whether both capabilities can be preserved. Our results reveal that, in such hybrids, the Transformer component tends to exploit shortcut solutions, leading to poor length generalization. We identify this shortcut reliance as a key obstacle and propose a simple yet effective training strategy – delaying the training of the attention layers – that mitigates this effect and significantly improves length generalization performance. Our experiments show that this approach enables hybrid models to achieve near-perfect accuracy ( 90% ) on hybrid sequences three times longer than those used during training.

[LG-81] CODED-SMOOTHING: Coding Theory Helps Generalization

链接: https://arxiv.org/abs/2510.00253
作者: Parsa Moradi,Tayyebeh Jahaninezhad,Mohammad Ali Maddah-Ali
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the coded-smoothing module, which can be seamlessly integrated into standard training pipelines, both supervised and unsupervised, to regularize learning and improve generalization with minimal computational overhead. In addition, it can be incorporated into the inference pipeline to randomize the model and enhance robustness against adversarial perturbations. The design of coded-smoothing is inspired by general coded computing, a paradigm originally developed to mitigate straggler and adversarial failures in distributed computing by processing linear combinations of the data rather than the raw inputs. Building on this principle, we adapt coded computing to machine learning by designing an efficient and effective regularization mechanism that encourages smoother representations and more generalizable solutions. Extensive experiments on both supervised and unsupervised tasks demonstrate that coded-smoothing consistently improves generalization and achieves state-of-the-art robustness against gradient-based adversarial attacks.

[LG-82] Reward driven discovery of the optimal microstructure representations with invariant variational autoencoders

链接: https://arxiv.org/abs/2510.00243
作者: Boris N. Slautin,Kamyar Barakati,Hiroshi Funakubo,Maxim A. Ziatdinov,Vladimir V. Shvartsman,Doru C. Lupascu,Sergei V. Kalinin
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:Microscopy techniques generate vast amounts of complex image data that in principle can be used to discover simpler, interpretable, and parsimonious forms to reveal the underlying physical structures, such as elementary building blocks in molecular systems or order parameters and phases in crystalline materials. Variational Autoencoders (VAEs) provide a powerful means of constructing such low-dimensional representations, but their performance heavily depends on multiple non-myopic design choices, which are often optimized through trial-and-error and empirical analysis. To enable automated and unbiased optimization of VAE workflows, we investigated reward-based strategies for evaluating latent space representations. Using Piezoresponse Force Microscopy data as a model system, we examined multiple policies and reward functions that can serve as a foundation for automated optimization. Our analysis shows that approximating the latent space with Gaussian Mixture Models (GMM) and Bayesian Gaussian Mixture Models (BGMM) provides a strong basis for constructing reward functions capable of estimating model efficiency and guiding the search for optimal parsimonious representations.

[LG-83] Per-example gradients: a new frontier for understanding and improving optimizers

链接: https://arxiv.org/abs/2510.00236
作者: Vincent Roulet,Atish Agarwala
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training algorithms in deep learning usually treat a mini-batch of samples as a single object; they average gradients over the mini-batch, and then process the average in various ways. Computing other statistics beyond the average may have been seen as prohibitively resource intensive in automatic differentiation (AD) frameworks. We show that this is not the case. Generally, gradient statistics can be implemented through a surgery of the AD graph, which, in some cases, incur almost no computational and memory overheads compared to the mini-batch gradient computation. Additionally, we show that in certain classes of models, including transformers, JAX’s vectorization transformation offers a viable implementation for prototyping and experimentation. We then revise our understanding of two nonlinear operations in optimization through the lens of per-example gradient transformations. We first study signSGD and show that the optimal placement of the sign operation in the gradient processing chain is crucial to success and can be predicted with a simple signal-to-noise ratio argument. Next we study per-example variations of the Adam preconditioner, and show that optimization is best served when the preconditioner is dominated by the mean rather than the variance of the gradient distribution - in contrast to conventional wisdom. Overall we demonstrate that per-example gradient information enables new analyses and possibilities for algorithm design.

[LG-84] Differentiable Autoencoding Neural Operator for Interpretable and Integrable Latent Space Modeling

链接: https://arxiv.org/abs/2510.00233
作者: Siva Viknesh,Amirhossein Arzani
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Scientific machine learning has enabled the extraction of physical insights from high-dimensional spatiotemporal flow data using linear and nonlinear dimensionality reduction techniques. Despite these advances, achieving interpretability within the latent space remains a challenge. To address this, we propose the DIfferentiable Autoencoding Neural Operator (DIANO), a deterministic autoencoding neural operator framework that constructs physically interpretable latent spaces for both dimensional and geometric reduction, with the provision to enforce differential governing equations directly within the latent space. Built upon neural operators, DIANO compresses high-dimensional input functions into a low-dimensional latent space via spatial coarsening through an encoding neural operator and subsequently reconstructs the original inputs using a decoding neural operator through spatial refinement. We assess DIANO’s latent space interpretability and performance in dimensionality reduction against baseline models, including the Convolutional Neural Operator and standard autoencoders. Furthermore, a fully differentiable partial differential equation (PDE) solver is developed and integrated within the latent space, enabling the temporal advancement of both high- and low-fidelity PDEs, thereby embedding physical priors into the latent dynamics. We further investigate various PDE formulations, including the 2D unsteady advection-diffusion and the 3D Pressure-Poisson equation, to examine their influence on shaping the latent flow representations. Benchmark problems considered include flow past a 2D cylinder, flow through a 2D symmetric stenosed artery, and a 3D patient-specific coronary artery. These case studies demonstrate DIANO’s capability to solve PDEs within a latent space that facilitates both dimensional and geometrical reduction while allowing latent interpretability.

[LG-85] RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers

链接: https://arxiv.org/abs/2510.00202
作者: Yifan Lu,Rixin Liu,Jiayi Yuan,Xingqi Cui,Shenrun Zhang,Hongyi Liu,Jiarong Xing
类目: Machine Learning (cs.LG)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Today’s LLM ecosystem comprises a wide spectrum of models that differ in size, capability, and cost. No single model is optimal for all scenarios; hence, LLM routers have become essential for selecting the most appropriate model under varying circumstances. However, the rapid emergence of various routers makes choosing the right one increasingly challenging. To address this problem, we need a comprehensive router comparison and a standardized leaderboard, similar to those available for models. In this work, we introduce RouterArena, the first open platform enabling comprehensive comparison of LLM routers. RouterArena has (1) a principally constructed dataset with broad knowledge domain coverage, (2) distinguishable difficulty levels for each domain, (3) an extensive list of evaluation metrics, and (4) an automated framework for leaderboard updates. Leveraging our framework, we have produced the initial leaderboard with detailed metrics comparison as shown in Figure 1. We will make our platform open to the public soon.

[LG-86] Large Language Models Inference Engines based on Spiking Neural Networks

链接: https://arxiv.org/abs/2510.00133
作者: Adarsha Balaji,Sandeep Madireddy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundational models based on the transformer architecture are currently the state-of-the-art in general language modeling, as well as in scientific areas such as material science and climate. However, training and deploying these models is computationally challenging as the time and space complexity has a quadratic relation to the input sequence length. Several efforts exploring efficient computational paradigms and model architectures to address these limitations have been made. In this work, we explore spiking neural networks (SNNs) to design transformer models. A challenge in training large-scale SNNs, using existing surrogate learning methods is inefficient and time-consuming. On the other hand, techniques to convert existing transformer-based models to their SNN equivalent are not scalable, as achieving optimal performance comes at the cost of a large number of spike time-steps, i.e. increased latency. To address this, we propose NeurTransformer, a methodology for designing transformer-based SNN for inference using a supervised fine-tuning approach with existing conversion methods. The proposed methodology works by: (1) replacing the self-attention mechanism with a spike-based self-attention (SSA), (2) converting the feed-forward block of the trained transformer model to its equivalent SNN, and (3) fine-tuning the SSA block using SNN-based surrogate learning algorithms. We benchmark the proposed methodology and demonstrate its accuracy and scalability using three variants of the GPT-2 model of increasing model size. We observe that the converted GPT-2 small models demonstrate a 5-12% loss in cosine similarity and a 9.7% reduction in perplexity. Finally, we demonstrate the energy efficiency of the SSA block compared to the ASA block and show between 64.71% and 85.28% reductions in estimated energy consumption when implementing the self-attention mechanism on a digital hardware.

[LG-87] Approximately Unimodal Likelihood Models for Ordinal Regression

链接: https://arxiv.org/abs/2510.00122
作者: Ryoya Yamasaki
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Ordinal regression (OR, also called ordinal classification) is classification of ordinal data, in which the underlying target variable is categorical and considered to have a natural ordinal relation for the underlying explanatory variable. A key to successful OR models is to find a data structure `natural ordinal relation’ common to many ordinal data and reflect that structure into the design of those models. A recent OR study found that many real-world ordinal data show a tendency that the conditional probability distribution (CPD) of the target variable given a value of the explanatory variable will often be unimodal. Several previous studies thus developed unimodal likelihood models, in which a predicted CPD is guaranteed to become unimodal. However, it was also observed experimentally that many real-world ordinal data partly have values of the explanatory variable where the underlying CPD will be non-unimodal, and hence unimodal likelihood models may suffer from a bias for such a CPD. Therefore, motivated to mitigate such a bias, we propose approximately unimodal likelihood models, which can represent up to a unimodal CPD and a CPD that is close to be unimodal. We also verify experimentally that a proposed model can be effective for statistical modeling of ordinal data and OR tasks.

[LG-88] Directed Information γ-covering: An Information-Theoretic Framework for Context Engineering

链接: https://arxiv.org/abs/2510.00079
作者: Hai Huang
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 6 tables, preprint

点击查看摘要

Abstract:We introduce \textbfDirected Information \gamma -covering, a simple but general framework for redundancy-aware context engineering. Directed information (DI), a causal analogue of mutual information, measures asymmetric predictiveness between chunks. If \operatornameDI_i \to j \ge H(C_j) - \gamma , then C_i suffices to represent C_j up to \gamma bits. Building on this criterion, we formulate context selection as a \gamma -cover problem and propose a greedy algorithm with provable guarantees: it preserves query information within bounded slack, inherits (1+\ln n) and (1-1/e) approximations from submodular set cover, and enforces a diversity margin. Importantly, building the \gamma -cover is \emphquery-agnostic: it incurs no online cost and can be computed once offline and amortized across all queries. Experiments on HotpotQA show that \gamma -covering consistently improves over BM25, a competitive baseline, and provides clear advantages in hard-decision regimes such as context compression and single-slot prompt selection. These results establish DI \gamma -covering as a principled, self-organizing backbone for modern LLM pipelines.

[LG-89] Federated Learning Meets LLM s: Feature Extraction From Heterogeneous Clients

链接: https://arxiv.org/abs/2510.00065
作者: Abdelrhman Gaber,Hassan Abd-Eltawab,Youssif Abuzied,Muhammad ElMahdy,Tamer ElBatt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training without sharing raw data, making it attractive for privacy-sensitive domains such as healthcare, finance, and IoT. A major obstacle, however, is the heterogeneity of tabular data across clients, where divergent schemas and incompatible feature spaces prevent straightforward aggregation. To address this challenge, we propose FedLLM-Align, a federated frame- work that leverages pre-trained large language models (LLMs) as universal feature extractors. Tabular records are serialized into text, and embeddings from models such as DistilBERT, ALBERT, RoBERTa, and ClinicalBERT provide semantically aligned representations that support lightweight local classifiers under the standard FedAvg protocol. This approach removes the need for manual schema harmonization while preserving privacy, since raw data remain strictly local. We evaluate FedLLM- Align on coronary heart disease prediction using partitioned Framingham datasets with simulated schema divergence. Across all client settings and LLM backbones, our method consistently outperforms state-of-the-art baselines, achieving up to +0.25 improvement in F1-score and a 65% reduction in communication cost. Stress testing under extreme schema divergence further demonstrates graceful degradation, unlike traditional methods that collapse entirely. These results establish FedLLM-Align as a robust, privacy-preserving, and communication-efficient solution for federated learning in heterogeneous environments.

[LG-90] A Recall-First CNN for Sleep Apnea Screening from Snoring Audio

链接: https://arxiv.org/abs/2510.00052
作者: Anushka Mallick,Afiya Noorain,Ashwin Menon,Ashita Solanki,Keertan Balaji
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Sleep apnea is a serious sleep-related breathing disorder that is common and can impact health if left untreated. Currently the traditional method for screening and diagnosis is overnight polysomnography. Polysomnography is expensive and takes a lot of time, and is not practical for screening large groups of people. In this paper, we explored a more accessible option, using respiratory audio recordings to spot signs of this http URL utilized 18 audio this http URL approach involved converting breathing sounds into spectrograms, balancing the dataset by oversampling apnea segments, and applying class weights to reduce bias toward the majority class. The model reached a recall of 90.55 for apnea detection. Intentionally, prioritizing catching apnea events over general accuracy. Despite low precision,the high recall suggests potential as a low-cost screening tool that could be used at home or in basic clinical setups, potentially helping identify at-risk individuals much earlier.

[LG-91] FTSCommDetector: Discovering Behavioral Communities through Temporal Synchronization

链接: https://arxiv.org/abs/2510.00014
作者: Tianyang Luo,Xikun Zhang,Dongjin Song
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Why do trillion-dollar tech giants AAPL and MSFT diverge into different response patterns during market disruptions despite identical sector classifications? This paradox reveals a fundamental limitation: traditional community detection methods fail to capture synchronization-desynchronization patterns where entities move independently yet align during critical moments. To this end, we introduce FTSCommDetector, implementing our Temporal Coherence Architecture (TCA) to discover similar and dissimilar communities in continuous multivariate time series. Unlike existing methods that process each timestamp independently, causing unstable community assignments and missing evolving relationships, our approach maintains coherence through dual-scale encoding and static topology with dynamic attention. Furthermore, we establish information-theoretic foundations demonstrating how scale separation maximizes complementary information and introduce Normalized Temporal Profiles (NTP) for scale-invariant evaluation. As a result, FTSCommDetector achieves consistent improvements across four diverse financial markets (SP100, SP500, SP1000, Nikkei 225), with gains ranging from 3.5% to 11.1% over the strongest baselines. The method demonstrates remarkable robustness with only 2% performance variation across window sizes from 60 to 120 days, making dataset-specific tuning unnecessary, providing practical insights for portfolio construction and risk management.

[LG-92] A first-order method for constrained nonconvex–nonconcave minimax problems under a local Kurdyka-Łojasiewicz condition

链接: https://arxiv.org/abs/2510.01168
作者: Zhaosong Lu,Xiangyuan Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 25 pages

点击查看摘要

Abstract:We study a class of constrained nonconvex–nonconcave minimax problems in which the inner maximization involves potentially complex constraints. Under the assumption that the inner problem of a novel lifted minimax problem satisfies a local Kurdyka-Łojasiewicz (KL) condition, we show that the maximal function of the original problem enjoys a local Hölder smoothness property. We also propose a sequential convex programming (SCP) method for solving constrained optimization problems and establish its convergence rate under a local KL condition. Leveraging these results, we develop an inexact proximal gradient method for the original minimax problem, where the inexact gradient of the maximal function is computed via the SCP method applied to a locally KL-structured subproblem. Finally, we establish complexity guarantees for the proposed method in computing an approximate stationary point of the original minimax problem.

[LG-93] he causal structure of galactic astrophysics

链接: https://arxiv.org/abs/2510.01112
作者: Harry Desmond,Joseph Ramsey
类目: Astrophysics of Galaxies (astro-ph.GA); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 5 pages, 3 figures; submitted to MNRAS Letters

点击查看摘要

Abstract:Data-driven astrophysics currently relies on the detection and characterisation of correlations between objects’ properties, which are then used to test physical theories that make predictions for them. This process fails to utilise information in the data that forms a crucial part of the theories’ predictions, namely which variables are directly correlated (as opposed to accidentally correlated through others), the directions of these determinations, and the presence or absence of confounders that correlate variables in the dataset but are themselves absent from it. We propose to recover this information through causal discovery, a well-developed methodology for inferring the causal structure of datasets that is however almost entirely unknown to astrophysics. We develop a causal discovery algorithm suitable for astrophysical datasets and illustrate it on \sim 5 \times10^5 low-redshift galaxies from the Nasa Sloan Atlas, demonstrating its ability to distinguish physical mechanisms that are degenerate on the basis of correlations alone.

[LG-94] heory of Scaling Laws for In-Context Regression: Depth Width Context and Time

链接: https://arxiv.org/abs/2510.01098
作者: Blake Bordelon,Mary I. Letey,Cengiz Pehlevan
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: preprint with 29 pages

点击查看摘要

Abstract:We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shape as a function of compute. This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks.

[LG-95] Optimal placement of wind farms via quantile constraint learning

链接: https://arxiv.org/abs/2510.01093
作者: Wenxiu Feng,Antonio Alcántara,Carlos Ruiz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wind farm placement arranges the size and the location of multiple wind farms within a given region. The power output is highly related to the wind speed on spatial and temporal levels, which can be modeled by advanced data-driven approaches. To this end, we use a probabilistic neural network as a surrogate that accounts for the spatiotemporal correlations of wind speed. This neural network uses ReLU activation functions so that it can be reformulated as mixed-integer linear set of constraints (constraint learning). We embed these constraints into the placement decision problem, formulated as a two-stage stochastic optimization problem. Specifically, conditional quantiles of the total electricity production are regarded as recursive decisions in the second stage. We use real high-resolution regional data from a northern region in Spain. We validate that the constraint learning approach outperforms the classical bilinear interpolation method. Numerical experiments are implemented on risk-averse investors. The results indicate that risk-averse investors concentrate on dominant sites with strong wind, while exhibiting spatial diversification and sensitive capacity spread in non-dominant sites. Furthermore, we show that if we introduce transmission line costs in the problem, risk-averse investors favor locations closer to the substations. On the contrary, risk-neutral investors are willing to move to further locations to achieve higher expected profits. Our results conclude that the proposed novel approach is able to tackle a portfolio of regional wind farm placements and further provide guidance for risk-averse investors.

[LG-96] Non-Euclidean Broximal Point Method: A Blueprint for Geometry-Aware Optimization

链接: https://arxiv.org/abs/2510.00823
作者: Kaja Gruntkowska,Peter Richtárik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The recently proposed Broximal Point Method (BPM) [Gruntkowska et al., 2025] offers an idealized optimization framework based on iteratively minimizing the objective function over norm balls centered at the current iterate. It enjoys striking global convergence guarantees, converging linearly and in a finite number of steps for proper, closed and convex functions. However, its theoretical analysis has so far been confined to the Euclidean geometry. At the same time, emerging trends in deep learning optimization, exemplified by algorithms such as Muon [Jordan et al., 2024] and Scion [Pethick et al., 2025], demonstrate the practical advantages of minimizing over balls defined via non-Euclidean norms which better align with the underlying geometry of the associated loss landscapes. In this note, we ask whether the convergence theory of BPM can be extended to this more general, non-Euclidean setting. We give a positive answer, showing that most of the elegant guarantees of the original method carry over to arbitrary norm geometries. Along the way, we clarify which properties are preserved and which necessarily break down when leaving the Euclidean realm. Our analysis positions Non-Euclidean BPM as a conceptual blueprint for understanding a broad class of geometry-aware optimization algorithms, shedding light on the principles behind their practical effectiveness.

[LG-97] GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins NEURIPS

链接: https://arxiv.org/abs/2510.00774
作者: Eoin Quinn,Marco Carobene,Jean Quentin,Sebastien Boyer,Miguel Arbesú,Oliver Bent
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Accepted at AI4Science and ML4PS NeurIPS Workshops 2025

点击查看摘要

Abstract:While deep learning has revolutionized the prediction of rigid protein structures, modelling the conformational ensembles of Intrinsically Disordered Proteins (IDPs) remains a key frontier. Current AI paradigms present a trade-off: Protein Language Models (PLMs) capture evolutionary statistics but lack explicit physical grounding, while generative models trained to model full ensembles are computationally expensive. In this work we critically assess these limits and propose a path forward. We introduce GeoGraph, a simulation-informed surrogate trained to predict ensemble-averaged statistics of residue-residue contact-map topology directly from sequence. By featurizing coarse-grained molecular dynamics simulations into residue- and sequence-level graph descriptors, we create a robust and information-rich learning target. Our evaluation demonstrates that this approach yields representations that are more predictive of key biophysical properties than existing methods.

[LG-98] Approximation of differential entropy in Bayesian optimal experimental design

链接: https://arxiv.org/abs/2510.00734
作者: Chuntao Chen,Tapio Helin,Nuutti Hyvönen,Yuya Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:Bayesian optimal experimental design provides a principled framework for selecting experimental settings that maximize obtained information. In this work, we focus on estimating the expected information gain in the setting where the differential entropy of the likelihood is either independent of the design or can be evaluated explicitly. This reduces the problem to maximum entropy estimation, alleviating several challenges inherent in expected information gain computation. Our study is motivated by large-scale inference problems, such as inverse problems, where the computational cost is dominated by expensive likelihood evaluations. We propose a computational approach in which the evidence density is approximated by a Monte Carlo or quasi-Monte Carlo surrogate, while the differential entropy is evaluated using standard methods without additional likelihood evaluations. We prove that this strategy achieves convergence rates that are comparable to, or better than, state-of-the-art methods for full expected information gain estimation, particularly when the cost of entropy evaluation is negligible. Moreover, our approach relies only on mild smoothness of the forward map and avoids stronger technical assumptions required in earlier work. We also present numerical experiments, which confirm our theoretical findings. Comments: 28 pages, 3 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO) Cite as: arXiv:2510.00734 [stat.ML] (or arXiv:2510.00734v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2510.00734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] Guaranteed Noisy CP Tensor Recovery via Riemannian Optimization on the Segre Manifold

链接: https://arxiv.org/abs/2510.00569
作者: Ke Xu,Yuefeng Han
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 33 pages, 7 figures

点击查看摘要

Abstract:Recovering a low-CP-rank tensor from noisy linear measurements is a central challenge in high-dimensional data analysis, with applications spanning tensor PCA, tensor regression, and beyond. We exploit the intrinsic geometry of rank-one tensors by casting the recovery task as an optimization problem over the Segre manifold, the smooth Riemannian manifold of rank-one tensors. This geometric viewpoint yields two powerful algorithms: Riemannian Gradient Descent (RGD) and Riemannian Gauss-Newton (RGN), each of which preserves feasibility at every iteration. Under mild noise assumptions, we prove that RGD converges at a local linear rate, while RGN exhibits an initial local quadratic convergence phase that transitions to a linear rate as the iterates approach the statistical noise floor. Extensive synthetic experiments validate these convergence guarantees and demonstrate the practical effectiveness of our methods.

[LG-100] Bayesian Neural Networks for Functional ANOVA model

链接: https://arxiv.org/abs/2510.00545
作者: Seokhun Park,Choeun Kim,Jihu Lee,Yunseop Shin,Insung Kong,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the increasing demand for interpretability in machine learning, functional ANOVA decomposition has gained renewed attention as a principled tool for breaking down high-dimensional function into low-dimensional components that reveal the contributions of different variable groups. Recently, Tensor Product Neural Network (TPNN) has been developed and applied as basis functions in the functional ANOVA model, referred to as ANOVA-TPNN. A disadvantage of ANOVA-TPNN, however, is that the components to be estimated must be specified in advance, which makes it difficult to incorporate higher-order TPNNs into the functional ANOVA model due to computational and memory constraints. In this work, we propose Bayesian-TPNN, a Bayesian inference procedure for the functional ANOVA model with TPNN basis functions, enabling the detection of higher-order components with reduced computational cost compared to ANOVA-TPNN. We develop an efficient MCMC algorithm and demonstrate that Bayesian-TPNN performs well by analyzing multiple benchmark datasets. Theoretically, we prove that the posterior of Bayesian-TPNN is consistent.

[LG-101] A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws

链接: https://arxiv.org/abs/2510.00504
作者: Hong-Yi Wang,Di Luo,Tomaso Poggio,Isaac L. Chuang,Liu Ziyin
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of d objects can be asymptotically compressed into a function of \operatornamepolylog d objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textitdynamical lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form L\sim d^-\alpha can be boosted to an arbitrarily fast power law decay, and ultimately to \exp(-\alpha’ \sqrt[m]d) .

[LG-102] On the Adversarial Robustness of Learning-based Conformal Novelty Detection

链接: https://arxiv.org/abs/2510.00463
作者: Daofu Zhang,Mehrdad Pournaderi,Hanne M. Clifford,Yu Xiang,Pramod K. Varshney
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper studies the adversarial robustness of conformal novelty detection. In particular, we focus on AdaDetect, a powerful learning-based framework for novelty detection with finite-sample false discovery rate (FDR) control. While AdaDetect provides rigorous statistical guarantees under benign conditions, its behavior under adversarial perturbations remains unexplored. We first formulate an oracle attack setting that quantifies the worst-case degradation of FDR, deriving an upper bound that characterizes the statistical cost of attacks. This idealized formulation directly motivates a practical and effective attack scheme that only requires query access to AdaDetect’s output labels. Coupling these formulations with two popular and complementary black-box adversarial algorithms, we systematically evaluate the vulnerability of AdaDetect on synthetic and real-world datasets. Our results show that adversarial perturbations can significantly increase the FDR while maintaining high detection power, exposing fundamental limitations of current error-controlled novelty detection methods and motivating the development of more robust alternatives.

[LG-103] Improving Virtual Contrast Enhancement using Longitudinal Data MICCAI2025

链接: https://arxiv.org/abs/2510.00418
作者: Pierre Fayolle,Alexandre Bône,Noëlie Debs,Pihlippe Robert,Pascal Bourdon,Remy Guillevin,David Helbert
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, Workshop MICCAI 2025 - Learning with Longitudinal Medical Images and Data

点击查看摘要

Abstract:Gadolinium-based contrast agents (GBCAs) are widely used in magnetic resonance imaging (MRI) to enhance lesion detection and characterisation, particularly in the field of neuro-oncology. Nevertheless, concerns regarding gadolinium retention and accumulation in brain and body tissues, most notably for diseases that require close monitoring and frequent GBCA injection, have led to the need for strategies to reduce dosage. In this study, a deep learning framework is proposed for the virtual contrast enhancement of full-dose post-contrast T1-weighted MRI images from corresponding low-dose acquisitions. The contribution of the presented model is its utilisation of longitudinal information, which is achieved by incorporating a prior full-dose MRI examination from the same patient. A comparative evaluation against a non-longitudinal single session model demonstrated that the longitudinal approach significantly improves image quality across multiple reconstruction metrics. Furthermore, experiments with varying simulated contrast doses confirmed the robustness of the proposed method. These results emphasize the potential of integrating prior imaging history into deep learning-based virtual contrast enhancement pipelines to reduce GBCA usage without compromising diagnostic utility, thus paving the way for safer, more sustainable longitudinal monitoring in clinical MRI practice.

[LG-104] Progressively Sampled Equality-Constrained Optimization

链接: https://arxiv.org/abs/2510.00417
作者: Frank E. Curtis,Lingjun Guo,Daniel P. Robinson
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:An algorithm is proposed, analyzed, and tested for solving continuous nonlinear-equality-constrained optimization problems where the constraints are defined by an expectation or an average over a large (finite) number of terms. The main idea of the algorithm is to solve a sequence of equality-constrained problems, each involving a finite sample of constraint-function terms, over which the sample set grows progressively. Under assumptions about the constraint functions and their first- and second-order derivatives that are reasonable in some real-world settings of interest, it is shown that – with a sufficiently large initial sample – solving a sequence of problems defined through progressive sampling yields a better worst-case sample complexity bound compared to solving a single problem with a full set of samples. The results of numerical experiments with a set of test problems demonstrate that the proposed approach can be effective in practice.

[LG-105] Parametric modeling of shear wave velocity profiles for the conterminous U.S

链接: https://arxiv.org/abs/2510.00372
作者: Morgan D. Sanger,Brett W. Maurer
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Earthquake ground motions and the related damage can be significantly impacted by near-surface soils. Accurate predictions of seismic hazard require depth-continuous models of soil stiffness, commonly described in terms of shear-wave velocity (VS). For regional-scale studies, efforts to predict VS remotely, such as the U.S. Geological Survey’s National Crustal Model, tend to emphasize deeper lithologic velocity structures, thus simplifying important near-surface soil velocity variations, and tend to be produced at relatively coarse geospatial resolution for one geographic area. In this study, we define a functional form to describe VS-with-depth across the conterminous U.S. We calibrate the parameters of the function using a national compilation of more than 9,000 in-situ geotechnical measurements. By coupling the parametric framework with geospatial machine learning, the model can be leveraged to provide consistent, high resolution VS-depth predictions of the near-surface geotechnical layer across the U.S., complementing the National Crustal Model and supporting applications such as physics-based ground motion simulations and coseismic hazard assessments.

[LG-106] CINDES: Classification induced neural density estimator and simulator

链接: https://arxiv.org/abs/2510.00367
作者: Dehao Dai,Jianqing Fan,Yihong Gu,Debarghya Mukherjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 50 pages, 1 figure

点击查看摘要

Abstract:Neural network-based methods for (un)conditional density estimation have recently gained substantial attention, as various neural density estimators have outperformed classical approaches in real-data experiments. Despite these empirical successes, implementation can be challenging due to the need to ensure non-negativity and unit-mass constraints, and theoretical understanding remains limited. In particular, it is unclear whether such estimators can adaptively achieve faster convergence rates when the underlying density exhibits a low-dimensional structure. This paper addresses these gaps by proposing a structure-agnostic neural density estimator that is (i) straightforward to implement and (ii) provably adaptive, attaining faster rates when the true density admits a low-dimensional composition structure. Another key contribution of our work is to show that the proposed estimator integrates naturally into generative sampling pipelines, most notably score-based diffusion models, where it achieves provably faster convergence when the underlying density is structured. We validate its performance through extensive simulations and a real-data application.

[LG-107] End-to-end Training of High-Dimensional Optimal Control with Implicit Hamiltonians via Jacobian-Free Backpropagation

链接: https://arxiv.org/abs/2510.00359
作者: Eric Gelphman,Deepanshu Verma,Nicole Tianjiao Yang,Stanley Osher,Samy Wu Fung
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural network approaches that parameterize value functions have succeeded in approximating high-dimensional optimal feedback controllers when the Hamiltonian admits explicit formulas. However, many practical problems, such as the space shuttle reentry problem and bicycle dynamics, among others, may involve implicit Hamiltonians that do not admit explicit formulas, limiting the applicability of existing methods. Rather than directly parameterizing controls, which does not leverage the Hamiltonian’s underlying structure, we propose an end-to-end implicit deep learning approach that directly parameterizes the value function to learn optimal control laws. Our method enforces physical principles by ensuring trained networks adhere to the control laws by exploiting the fundamental relationship between the optimal control and the value function’s gradient; this is a direct consequence of the connection between Pontryagin’s Maximum Principle and dynamic programming. Using Jacobian-Free Backpropagation (JFB), we achieve efficient training despite temporal coupling in trajectory optimization. We show that JFB produces descent directions for the optimal control objective and experimentally demonstrate that our approach effectively learns high-dimensional feedback controllers across multiple scenarios involving implicit Hamiltonians, which existing methods cannot address.

[LG-108] Malliavin Calculus with Weak Derivatives for Counterfactual Stochastic Optimization

链接: https://arxiv.org/abs/2510.00297
作者: Vikram Krishnamurthy,Luke Snow
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study counterfactual stochastic optimization of conditional loss functionals under misspecified and noisy gradient information. The difficulty is that when the conditioning event has vanishing or zero probability, naive Monte Carlo estimators are prohibitively inefficient; kernel smoothing, though common, suffers from slow convergence. We propose a two-stage kernel-free methodology. First, we show using Malliavin calculus that the conditional loss functional of a diffusion process admits an exact representation as a Skorohod integral, yielding variance comparable to classical Monte-Carlo variance. Second, we establish that a weak derivative estimate of the conditional loss functional with respect to model parameters can be evaluated with constant variance, in contrast to the widely used score function method whose variance grows linearly in the sample path length. Together, these results yield an efficient framework for counterfactual conditional stochastic gradient algorithms in rare-event regimes.

[LG-109] Electron neural closure for turbulent magnetosheath simulations: energy channels

链接: https://arxiv.org/abs/2510.00282
作者: George Miloshevich,Luka Vranckx,Felipe Nathan de Oliveira Lopes,Pietro Dazzi,Giuseppe Arrò,Giovanni Lapenta
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 16 pages, 9 figures, 4 tables

点击查看摘要

Abstract:In this work, we introduce a non-local five-moment electron pressure tensor closure parametrized by a Fully Convolutional Neural Network (FCNN). Electron pressure plays an important role in generalized Ohm’s law, competing with electron inertia. This model is used in the development of a surrogate model for a fully kinetic energy-conserving semi-implicit Particle-in-Cell simulation of decaying magnetosheath turbulence. We achieve this by training FCNN on a representative set of simulations with a smaller number of particles per cell and showing that our results generalise to a simulation with a large number of particles per cell. We evaluate the statistical properties of the learned equation of state, with a focus on pressure-strain interaction, which is crucial for understanding energy channels in turbulent plasmas. The resulting equation of state learned via FCNN significantly outperforms local closures, such as those learned by Multi-Layer Perceptron (MLP) or double adiabatic expressions. We report that the overall spatial distribution of pressure-strain and its conditional averages are reconstructed well. However, some small-scale features are missed, especially for the off-diagonal components of the pressure tensor. Nevertheless, the results are substantially improved with more training data, indicating favorable scaling and potential for improvement, which will be addressed in future work.

[LG-110] Board Gender Diversity and Carbon Emissions Performance: Insights from Panel Regressions Machine Learning and Explainable AI

链接: https://arxiv.org/abs/2510.00244
作者: Mohammad Hassan Shakil,Arne Johan Pollestad,Khine Kyaw,Ziaul Haque Munim
类目: General Finance (q-fin.GN); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 34 pages and 3 figures

点击查看摘要

Abstract:With the European Union introducing gender quotas on corporate boards, this study investigates the impact of board gender diversity (BGD) on firms’ carbon emission performance (CEP). Using panel regressions and advanced machine learning algorithms on data from European firms between 2016 and 2022, the analyses reveal a significant non-linear relationship. Specifically, CEP improves with BGD up to an optimal level of approximately 35 percent, beyond which further increases in BGD yield no additional improvement in CEP. A minimum threshold of 22 percent BGD is necessary for meaningful improvements in CEP. To assess the legitimacy of CEP outcomes, this study examines whether ESG controversies affect the relationship between BGD and CEP. The results show no significant effect, suggesting that the effect of BGD is driven by governance mechanisms rather than symbolic actions. Additionally, structural equation modelling (SEM) indicates that while environmental innovation contributes to CEP, it is not the mediating channel through which BGD promotes CEP. The results have implications for academics, businesses, and regulators.

[LG-111] Learning from the electronic structure of molecules across the periodic table

链接: https://arxiv.org/abs/2510.00224
作者: Manasa Kaniselvan,Benjamin Kurt Miller,Meng Gao,Juno Nam,Daniel S. Levine
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine-Learned Interatomic Potentials (MLIPs) require vast amounts of atomic structure data to learn forces and energies, and their performance continues to improve with training set size. Meanwhile, the even greater quantities of accompanying data in the Hamiltonian matrix H behind these datasets has so far gone unused for this purpose. Here, we provide a recipe for integrating the orbital interaction data within H towards training pipelines for atomic-level properties. We first introduce HELM (“Hamiltonian-trained Electronic-structure Learning for Molecules”), a state-of-the-art Hamiltonian prediction model which bridges the gap between Hamiltonian prediction and universal MLIPs by scaling to H of structures with 100+ atoms, high elemental diversity, and large basis sets including diffuse functions. To accompany HELM, we release a curated Hamiltonian matrix dataset, ‘OMol_CSH_58k’, with unprecedented elemental diversity (58 elements), molecular size (up to 150 atoms), and basis set (def2-TZVPD). Finally, we introduce ‘Hamiltonian pretraining’ as a method to extract meaningful descriptors of atomic environments even from a limited number atomic structures, and repurpose this shared embedding space to improve performance on energy-prediction in low-data regimes. Our results highlight the use of electronic interactions as a rich and transferable data source for representing chemical space.

[LG-112] Quantum reservoir computing using Jaynes-Cummings model

链接: https://arxiv.org/abs/2510.00171
作者: Sreetama Das,Gian Luca Giorgi,Roberta Zambrini
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages, 13 figures

点击查看摘要

Abstract:We investigate quantum reservoir computing (QRC) using a hybrid qubit-boson system described by the Jaynes-Cummings (JC) Hamiltonian and its dispersive limit (DJC). These models provide high-dimensional Hilbert spaces and intrinsic nonlinear dynamics, making them powerful substrates for temporal information processing. We systematically benchmark both reservoirs through linear and nonlinear memory tasks, demonstrating that they exhibit an unusual superior nonlinear over linear memory capacity. We further test their predictive performance on the Mackey-Glass time series, a widely used benchmark for chaotic dynamics and show comparable forecasting ability. We also investigate how memory and prediction accuracy vary with reservoir parameters, and show the role of higher-order bosonic observables and time multiplexing in enhancing expressivity, even in minimal spin-boson configurations. Our results establish JC- and DJC-based reservoirs as versatile platforms for time-series processing and as elementary units that overcome the setting of equivalent qubit pairs and offer pathways towards tunable, high-performance quantum machine learning architectures.

[LG-113] Revealing the temporal dynamics of antibiotic anomalies in the infant gut microbiome with neural jump ODEs

链接: https://arxiv.org/abs/2510.00087
作者: Anja Adamov,Markus Chardonnet,Florian Krach,Jakob Heiss,Josef Teichmann,Nicholas A. Bokulich
类目: Applications (stat.AP); Machine Learning (cs.LG); Probability (math.PR); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Detecting anomalies in irregularly sampled multi-variate time-series is challenging, especially in data-scarce settings. Here we introduce an anomaly detection framework for irregularly sampled time-series that leverages neural jump ordinary differential equations (NJODEs). The method infers conditional mean and variance trajectories in a fully path dependent way and computes anomaly scores. On synthetic data containing jump, drift, diffusion, and noise anomalies, the framework accurately identifies diverse deviations. Applied to infant gut microbiome trajectories, it delineates the magnitude and persistence of antibiotic-induced disruptions: revealing prolonged anomalies after second antibiotic courses, extended duration treatments, and exposures during the second year of life. We further demonstrate the predictive capabilities of the inferred anomaly scores in accurately predicting antibiotic events and outperforming diversity-based baselines. Our approach accommodates unevenly spaced longitudinal observations, adjusts for static and dynamic covariates, and provides a foundation for inferring microbial anomalies induced by perturbations, offering a translational opportunity to optimize intervention regimens by minimizing microbial disruptions.

[LG-114] Private Learning of Littlestone Classes Revisited

链接: https://arxiv.org/abs/2510.00076
作者: Xin Lyu
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Comments welcome

点击查看摘要

Abstract:We consider online and PAC learning of Littlestone classes subject to the constraint of approximate differential privacy. Our main result is a private learner to online-learn a Littlestone class with a mistake bound of \tildeO(d^9.5\cdot \log(T)) in the realizable case, where d denotes the Littlestone dimension and T the time horizon. This is a doubly-exponential improvement over the state-of-the-art [GL’21] and comes polynomially close to the lower bound for this task. The advancement is made possible by a couple of ingredients. The first is a clean and refined interpretation of the ``irreducibility’’ technique from the state-of-the-art private PAC-learner for Littlestone classes [GGKM’21]. Our new perspective also allows us to improve the PAC-learner of [GGKM’21] and give a sample complexity upper bound of \widetildeO(\fracd^5 \log(1/\delta\beta)\varepsilon \alpha) where \alpha and \beta denote the accuracy and confidence of the PAC learner, respectively. This improves over [GGKM’21] by factors of \fracd\alpha and attains an optimal dependence on \alpha . Our algorithm uses a private sparse selection algorithm to \emphsample from a pool of strongly input-dependent candidates. However, unlike most previous uses of sparse selection algorithms, where one only cares about the utility of output, our algorithm requires understanding and manipulating the actual distribution from which an output is drawn. In the proof, we use a sparse version of the Exponential Mechanism from [GKM’21] which behaves nicely under our framework and is amenable to a very easy utility proof. Comments: Comments welcome Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2510.00076 [stat.ML] (or arXiv:2510.00076v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2510.00076 Focus to learn more arXiv-issued DOI via DataCite

信息检索

[IR-0] ModernVBERT: Towards Smaller Visual Document Retrievers

链接: https://arxiv.org/abs/2510.01149
作者: Paul Teiletche,Quentin Macé,Max Conti,Antonio Loison,Gautier Viaud,Pierre Colombo,Manuel Faysse
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at this https URL.

[IR-1] On Listwise Reranking for Corpus Feedback

链接: https://arxiv.org/abs/2510.00887
作者: Soyoung Yoon,Jongho Kim,Daeyong Kwon,Avishek Anand,Seung-won Hwang
类目: Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Reranker improves retrieval performance by capturing document interactions. At one extreme, graph-aware adaptive retrieval (GAR) represents an information-rich regime, requiring a pre-computed document similarity graph in reranking. However, as such graphs are often unavailable, or incur quadratic memory costs even when available, graph-free rerankers leverage large language model (LLM) calls to achieve competitive performance. We introduce L2G, a novel framework that implicitly induces document graphs from listwise reranker logs. By converting reranker signals into a graph structure, L2G enables scalable graph-based retrieval without the overhead of explicit graph computation. Results on the TREC-DL and BEIR subset show that L2G matches the effectiveness of oracle-based graph methods, while incurring zero additional LLM calls.

[IR-2] HLTCOE at TREC 2024 NeuCLIR Track

链接: https://arxiv.org/abs/2510.00143
作者: Eugene Yang,Dawn Lawrie,Orion Weller,James Mayfield
类目: Information Retrieval (cs.IR)
*备注: TREC 2024 System Paper; 6 pages; 7 tables

点击查看摘要

Abstract:The HLTCOE team applied PLAID, an mT5 reranker, GPT-4 reranker, score fusion, and document translation to the TREC 2024 NeuCLIR track. For PLAID we included a variety of models and training techniques – Translate Distill (TD), Generate Distill (GD) and multi-lingual translate-distill (MTD). TD uses scores from the mT5 model over English MS MARCO query-document pairs to learn how to score query-document pairs where the documents are translated to match the CLIR setting. GD follows TD but uses passages from the collection and queries generated by an LLM for training examples. MTD uses MS MARCO translated into multiple languages, allowing experiments on how to batch the data during training. Finally, for report generation we experimented with system combination over different runs. One family of systems used either GPT-4o or Claude-3.5-Sonnet to summarize the retrieved results from a series of decomposed sub-questions. Another system took the output from those two models and verified/combined them with Claude-3.5-Sonnet. The other family used GPT4o and GPT3.5Turbo to extract and group relevant facts from the retrieved documents based on the decomposed queries. The resulting submissions directly concatenate the grouped facts to form the report and their documents of origin as the citations. The team submitted runs to all NeuCLIR tasks: CLIR and MLIR news tasks as well as the technical documents task and the report generation task.

附件下载

点击下载今日全部论文列表