本篇博文主要内容为 2025-10-31 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-31)
今日共更新507篇论文,其中:
- 自然语言处理共85篇(Computation and Language (cs.CL))
- 人工智能共154篇(Artificial Intelligence (cs.AI))
- 计算机视觉共87篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共174篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Are Video Models Ready as Zero-Shot Reason ers? An Empirical Study with the MME-CoF Benchmark
【速读】: 该论文试图解决的问题是:当前视频生成模型是否具备作为零样本推理(zero-shot reasoner)在复杂视觉推理场景中可靠应用的能力。针对这一问题,作者以领先的视频生成模型Veo-3为研究对象,通过系统性评估其在12个维度上的推理行为(涵盖空间、几何、物理、时间及具身逻辑等),构建了一个名为MME-CoF的紧凑基准测试集,用于标准化地衡量Chain-of-Frame(CoF)推理能力。解决方案的关键在于设计并实施一套全面、结构化的评估框架,从而精准识别模型在短时空间一致性、局部动态一致性等方面的优势,以及在长时因果推理、严格几何约束和抽象逻辑方面的局限性,最终表明现有视频模型尚不能独立作为可靠的零样本推理器,但可作为专用推理模型的互补性视觉引擎。
链接: https://arxiv.org/abs/2510.26802
作者: Ziyu Guo,Xinyan Chen,Renrui Zhang,Ruichuan An,Yu Qi,Dongzhi Jiang,Xiangtai Li,Manyuan Zhang,Hongsheng Li,Pheng-Ann Heng
机构: CUHK(香港中文大学); MMLab(多媒体实验室); Peking University(北京大学); Northeastern University(东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:  Project Page: this https URL
Abstract:Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: this https URL
zh
[NLP-1] Gistify! Codebase-Level Understanding via Runtime Execution
【速读】: 该论文旨在解决大型代码库中自动设计具有挑战性的评估任务的问题,以更有效地衡量编码大语言模型(Large Language Models, LLMs)在真实场景下的能力。其核心解决方案是提出Gistify任务——要求LLM从完整代码库中提取并生成一个单一、最小且自包含的文件,该文件能够复现指定入口点(如Python命令)在原始代码库中的输出结果,同时仅保留执行该命令所必需的核心组件。该任务的关键在于同时考验模型对代码库结构的理解、执行流程的准确建模能力以及生成复杂代码补丁的能力。实验表明,当前最先进模型在处理具有长执行轨迹的任务时表现不佳,凸显了该评估范式对模型能力的高要求。
链接: https://arxiv.org/abs/2510.26790
作者: Hyunji Lee,Minseon Kim,Chinmay Singh,Matheus Pereira,Atharv Sonwane,Isadora White,Elias Stengel-Eskin,Mohit Bansal,Zhengyan Shi,Alessandro Sordoni,Marc-Alexandre Côté,Xingdi Yuan,Lucas Caccia
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Microsoft Research (微软研究院); Cornell University (康奈尔大学); University of California San Diego (加州大学圣地亚哥分校); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.
zh
[NLP-2] Defeating the Training-Inference Mismatch via FP16
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)微调大语言模型(Large Language Models, LLMs)时因训练与推理策略间数值不匹配而导致的优化不稳定问题。研究表明,这一问题的根本原因在于浮点数精度本身——尽管广泛采用的BF16(Bfloat16)具有较大的动态范围,但其引入的显著舍入误差破坏了训练与推理的一致性。解决方案的关键在于将浮点精度从BF16简单切换至FP16(Full Precision 16-bit),此改动无需修改模型架构或学习算法,仅需少量代码调整即可实现,且被现代深度学习框架完全支持,从而显著提升优化稳定性、加速收敛并增强多任务、多算法和多框架下的性能表现。
链接: https://arxiv.org/abs/2510.26788
作者: Penghui Qi,Zichen Liu,Xiangxin Zhou,Tianyu Pang,Chao Du,Wee Sun Lee,Min Lin
机构: Sea AI Lab; National University of Singapore
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbfFP16 effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
zh
[NLP-3] Remote Labor Index: Measuring AI Automation of Remote Work WWW
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在研究基准上的快速进展与其在真实经济场景中实现自动化价值之间的脱节问题。为量化这一差距,作者提出了一种名为远程劳动指数(Remote Labor Index, RLI)的多行业基准,该基准由具有实际经济价值的真实项目组成,用于评估 AI 代理在现实环境中端到端执行任务的能力。解决方案的关键在于构建一个可衡量、跨行业的实证指标体系,使 AI 自动化的影响能够被客观追踪,并为利益相关方提供基于数据的决策依据以应对 AI 驱动的劳动力自动化趋势。
链接: https://arxiv.org/abs/2510.26787
作者: Mantas Mazeika,Alice Gatti,Cristina Menghini,Udari Madhushani Sehwag,Shivam Singhal,Yury Orlovskiy,Steven Basart,Manasi Sharma,Denis Peskoff,Elaine Lau,Jaehyuk Lim,Lachlan Carroll,Alice Blair,Vinaya Sivakumar,Sumana Basu,Brad Kenstler,Yuntao Ma,Julian Michael,Xiaoke Li,Oliver Ingebretsen,Aditya Mehta,Jean Mottola,John Teichmann,Kevin Yu,Zaina Shaik,Adam Khoja,Richard Ren,Jason Hausenloy,Long Phan,Ye Htet,Ankit Aich,Tahseen Rabbani,Vivswan Shah,Andriy Novykov,Felix Binder,Kirill Chugunov,Luis Ramirez,Matias Geralnik,Hernán Mesura,Dean Lee,Ed-Yeremai Hernandez Cardona,Annette Diamond,Summer Yue,Alexandr Wang,Bing Liu,Ernesto Hernandez,Dan Hendrycks
机构: Center for AI Safety (AI安全中心); Scale AI; OpenAI; Meta; Stability.AI; Anthropic; Character.ai; Claude
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:  Website: this https URL
Abstract:AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.
zh
[NLP-4] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高级数学推理能力评估中面临的性能饱和问题,即现有基准测试(如AIME)已难以有效区分顶尖模型的表现。为应对这一挑战,作者提出了AMO-Bench——一个包含50道人工设计、难度达到或超过国际数学奥林匹克(International Mathematical Olympiad, IMO)水平的原创问题基准。其解决方案的关键在于:(1) 通过专家交叉验证确保所有问题符合IMO级别难度标准,避免低难度问题导致的性能上限;(2) 所有问题均为全新原创,防止因数据记忆导致的性能泄露;(3) 每题仅需最终答案而非完整证明,支持自动化且鲁棒的评分机制,从而实现对LLMs数学推理能力更精准、更具挑战性的评估。实验表明,即使最佳模型在AMO-Bench上的准确率也仅为52.4%,凸显了当前LLMs在高级数学推理方面仍有巨大提升空间。
链接: https://arxiv.org/abs/2510.26768
作者: Shengnan An,Xunliang Cai,Xuezhi Cao,Xiaoyu Li,Yehao Lin,Junlin Liu,Xinxuan Lv,Dan Ma,Xuanlin Wang,Ziwen Wang,Shuang Zhou(Alphabetical order by last name)
机构: Meituan(美团); University of Chinese Academy of Sciences (中国科学院大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:  14 pages, 9 figures
Abstract:We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. this https URL
zh
[NLP-5] Deep sequence models tend to memorize geometrically; it is unclear why
【速读】: 该论文旨在解决序列建模中参数化记忆(parametric memory)的表征机制问题,特别是传统方法将原子事实的记忆抽象为局部共现关系的粗暴查找方式所导致的局限性。研究发现,Transformer模型在推理过程中并非仅依赖训练时观察到的局部共现模式,而是自发构建了一种几何结构来编码所有实体之间的全局关系(包括未共现的实体),从而将复杂的ℓ-阶复合推理任务简化为一个易于学习的一步几何操作。解决方案的关键在于揭示了神经嵌入几何结构的形成源于谱偏差(spectral bias),这种几何特性即使在不比粗暴查找更简洁的情况下也能被学习,且无需依赖典型架构或优化压力,这挑战了现有理论对模型几何学习机制的理解,并为提升Transformer记忆的几何强度提供了可操作的方向。
链接: https://arxiv.org/abs/2510.26745
作者: Shahriar Noroozizadeh,Vaishnavh Nagarajan,Elan Rosenfeld,Sanjiv Kumar
机构: Carnegie Mellon University (卡内基梅隆大学); Google Research (谷歌研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an  \ell -fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that – in contrast to prevailing theories – indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.         Subjects:  Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)  Cite as: arXiv:2510.26745 [cs.LG]    (or  arXiv:2510.26745v1 [cs.LG] for this version)                https://doi.org/10.48550/arXiv.2510.26745   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-6] Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
【速读】: 该论文旨在解决当前基础模型(foundation models)在不同计算基础设施上推理能力评估缺乏统一、可复现基准的问题,尤其关注模型性能是否受硬件平台影响。其核心挑战在于如何建立一个跨平台、与基础设施无关的评测体系,以准确衡量模型在多样化场景下的泛化能力。解决方案的关键在于提出并实施一种三阶段实验框架:首先在HPC超算(MareNostrum 5)上建立基线;其次在大学集群和云平台(Nebius AI Studio)重复验证,确认结果的可复现性;最后在两个平台上扩展至79个跨学科问题进行全面评估,从而揭示训练数据质量比模型规模更关键的规律,并为教育、生产及科研场景提供可操作的模型选择指南。
链接: https://arxiv.org/abs/2510.26732
作者: J. de Curtò,I. de Zarzà,Pablo García,Jordi Cabot
机构: BARCELONA Supercomputing Center (BSC), Universidad Pontificia Comillas, LUXEMBOURG Institute of Science and Technology (LIST)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.         Subjects:  Artificial Intelligence (cs.AI); Computation and Language (cs.CL)  Cite as: arXiv:2510.26732 [cs.AI]    (or  arXiv:2510.26732v1 [cs.AI] for this version)                https://doi.org/10.48550/arXiv.2510.26732   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-7] Value Drifts: Tracing Value Alignment During LLM Post-Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练过程中价值对齐(value alignment)的学习机制问题,特别是关注价值对齐是在哪个阶段形成、以及不同后训练算法与数据集如何影响这一过程。其关键解决方案在于通过系统性实验分离监督微调(Supervised Fine-Tuning, SFT)和偏好优化(Preference Optimization)两个阶段的作用,发现SFT阶段通常奠定模型的基本价值观,而后续偏好优化极少能重新调整这些价值观;同时利用可控的合成偏好数据集验证不同偏好优化算法即使在相同数据下也会导致不同的对齐结果,从而揭示了价值对齐的动态演化特性,并为数据选择、算法配置及模型选型提供了可操作的指导。
链接: https://arxiv.org/abs/2510.26707
作者: Mehar Bhatia,Shravan Nayak,Gaurav Kamath,Marius Mosbach,Karolina Stańczak,Vered Shwartz,Siva Reddy
机构: Mila - Quebec AI Institute (魁北克人工智能研究所); McGill University (麦吉尔大学); Université de Montréal (蒙特利尔大学); ETH Zurich (苏黎世联邦理工学院); University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model’s post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model’s values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.
zh
[NLP-8] he End of Manual Decoding: Towards Truly End-to-End Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“端到端”生成标签的误导性问题,即当前LLMs在实际应用中依赖非可微分的解码过程,需手动调参(如温度和top-p值),导致生成质量不稳定且难以优化。其解决方案的关键在于提出AutoDeco架构:在标准Transformer基础上引入轻量级分支头,在每一步生成时动态预测上下文相关的温度和top-p参数,从而将解码过程转化为参数化、逐token的可微分控制机制。该方法实现了真正的“端到端”生成,并首次展现出基于自然语言指令的解码策略自适应能力(如“以低随机性生成”),显著优于默认解码策略,性能接近“黑箱调参”获得的最优静态基线,为可控、交互式的大语言模型生成提供了新范式。
链接: https://arxiv.org/abs/2510.26697
作者: Zhichao Wang,Dongyang Ma,Xinting Huang,Deng Cai,Tian Lan,Jiahao Xu,Haitao Mi,Xiaoying Tang,Yan Wang
机构: Tencent AI Lab (腾讯AI实验室); 1; 2
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The “end-to-end” label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly “end-to-end” generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from “hacking the test set”-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., “generate with low randomness”) and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.         Subjects:  Computation and Language (cs.CL); Artificial Intelligence (cs.AI)  Cite as: arXiv:2510.26697 [cs.CL]    (or  arXiv:2510.26697v1 [cs.CL] for this version)                https://doi.org/10.48550/arXiv.2510.26697   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-9] Kimi Linear: An Expressive Efficient Attention Architecture
【速读】: 该论文旨在解决传统全连接注意力机制(full attention)在长序列建模中计算复杂度高、KV缓存占用大以及推理效率低的问题,尤其在长上下文和强化学习(Reinforcement Learning, RL)等场景下表现受限。其核心解决方案是提出Kimi Linear架构,关键创新在于两个方面:一是设计了Kimi Delta Attention (KDA)模块,通过细粒度门控机制扩展Gated DeltaNet,更高效地利用有限状态RNN记忆;二是开发了一种专用分块算法,基于Diagonal-Plus-Low-Rank (DPLR)过渡矩阵的特化形式,在保持与经典delta规则一致性的同时显著降低计算开销。实验表明,Kimi Linear在相同训练条件下优于全注意力模型,且KV缓存使用减少高达75%,1M上下文下的解码吞吐量提升至6倍,具备作为全注意力架构的即插即用替代方案的能力。
链接: https://arxiv.org/abs/2510.26692
作者: Kimi Team:Yu Zhang,Zongyu Lin,Xingcheng Yao,Jiaxi Hu,Fanqing Meng,Chengyin Liu,Xin Men,Songlin Yang,Zhiyuan Li,Wentao Li,Enzhe Lu,Weizhou Liu,Yanru Chen,Weixin Xu,Longhui Yu,Yejie Wang,Yu Fan,Longguang Zhong,Enming Yuan,Dehao Zhang,Yizhi Zhang,T.Y. Liu,Haiming Wang,Shengjun Fang,Weiran He,Shaowei Liu,Yiwei Li,Jianlin Su,Jiezhong Qiu,Bo Pang,Junjie Yan,Zhejun Jiang,Weixiao Huang,Bohong Yin,Jiacheng You,Chu Wei,Zhengtao Wang,Chao Hong,Yutian Chen,Guanduo Chen,Yucheng Wang,Huabin Zheng,Feng Wang,Yibo Liu,Mengnan Dong,Zheng Zhang,Siyuan Pan,Wenhao Wu,Yuhao Wu,Longyu Guan,Jiawen Tao,Guohong Fu,Xinran Xu,Yuzhi Wang,Guokun Lai,Yuxin Wu,Xinyu Zhou,Zhilin Yang,Yulun Du
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:  Kimi Linear tech report
Abstract:We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios – including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.          Comments: Kimi Linear tech report   Subjects:  Computation and Language (cs.CL); Machine Learning (cs.LG)  Cite as: arXiv:2510.26692 [cs.CL]    (or  arXiv:2510.26692v1 [cs.CL] for this version)                https://doi.org/10.48550/arXiv.2510.26692   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-10] Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
【速读】: 该论文旨在解决在数据敏感领域(如医疗健康)中,由于缺乏高质量、领域特定的训练语料,大型语言模型(Large Language Models, LLMs)难以适应专业应用场景的问题。同时,领域专家已将专业知识以本体规则(ontology rules)形式结构化,这些规则能确保知识库的一致性和完整性。解决方案的关键在于提出Evontree框架,该框架利用少量高质量的本体规则,系统性地从预训练模型中提取、验证并增强领域知识,无需依赖大量外部数据;其核心机制包括:从原始模型中提取领域本体、基于两个核心本体规则检测不一致性、并通过自蒸馏微调强化优化后的知识表示。实验表明,该方法在医学问答基准上显著优于未修改模型及主流监督基线,准确率提升最高达3.7%,验证了其在低资源场景下对LLMs进行领域适配的有效性与鲁棒性。
链接: https://arxiv.org/abs/2510.26683
作者: Mingchen Tu,Zhiqiang Liu,Juan Li,Liangyurui Liu,Junjie Wang,Lei Liang,Wen Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs’ adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.
zh
[NLP-11] he Era of Agent ic Organization: Learning to Organize with Language Models
【速读】: 该论文旨在解决当前大语言模型在处理复杂任务时推理效率低、难以并行化的问题,尤其是在多步骤推理中存在串行执行导致的高延迟和资源浪费。其解决方案的关键在于提出异步思考(AsyncThink)范式,将内部推理过程组织为可并发执行的结构,通过一个动态调度的思维协议实现子查询分配、中间知识融合与连贯解的生成;更重要的是,该协议中的思维结构可通过强化学习进一步优化,从而在保持准确性的同时显著降低推理延迟(实验显示比并行思考降低28%),且具备无需额外训练即可泛化至未见任务的能力。
链接: https://arxiv.org/abs/2510.26658
作者: Zewen Chi,Li Dong,Qingxiu Dong,Yaru Hao,Xun Wu,Shaohan Huang,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We envision a new era of AI, termed agentic organization, where agents solve complex problems by working collaboratively and concurrently, enabling outcomes beyond individual intelligence. To realize this vision, we introduce asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Specifically, we propose a thinking protocol where an organizer dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions. More importantly, the thinking structure in this protocol can be further optimized through reinforcement learning. Experiments demonstrate that AsyncThink achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning. Moreover, AsyncThink generalizes its learned asynchronous thinking capabilities, effectively tackling unseen tasks without additional training.
zh
[NLP-12] Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)研究中对编码器-解码器架构(Encoder-Decoder Architecture)的潜在价值被忽视的问题,尤其是在与主流解码器-only架构(Decoder-Only Architecture)相比时缺乏系统性的规模扩展(scaling)分析。解决方案的关键在于重新审视并改进编码器-解码器LLM(RedLLM),通过引入解码器-only LLM(DecLLM)中的最新预训练技巧(如前缀语言建模 Prefix Language Modeling),并在不同模型规模(约150M到8B参数)下进行系统对比实验。结果表明,RedLLM在指令微调后不仅表现出与DecLLM相当甚至更优的下游任务性能,还具备显著更高的推理效率,从而揭示了编码器-解码器架构在构建高效且强大的LLM中的潜力。
链接: https://arxiv.org/abs/2510.26622
作者: Biao Zhang,Yong Cheng,Siamak Shakeri,Xinyi Wang,Min Ma,Orhan Firat
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL)
备注:  The scaling study inspiring T5Gemma
Abstract:Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textitfrom the scaling perspective, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from  \sim 150M to  \sim 8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.
zh
[NLP-13] SlideAgent : Hierarchical Agent ic Framework for Multi-Page Visual Document Understanding
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在理解复杂多页视觉文档(如手册、演示文稿、海报等)时面临的挑战,尤其是对页面内元素及跨页面关系进行细粒度推理的能力不足问题。解决方案的关键在于提出SlideAgent框架,该框架通过引入分层代理机制(specialized agents),将推理过程分解为全局(global)、页面(page)和元素级(element)三个层次,构建一种与查询无关的结构化表示,从而同时捕捉文档的整体主题与具体的视觉或文本线索;在推理阶段,系统动态激活相应层级的代理并整合其输出,生成连贯且上下文感知的答案,实验证明该方法在多个基准上显著优于主流商用与开源模型。
链接: https://arxiv.org/abs/2510.26615
作者: Yiqiao Jin,Rachneet Kaur,Zhen Zeng,Sumitra Ganesh,Srijan Kumar
机构: Georgia Institute of Technology (佐治亚理工学院); J.P. Morgan AI Research (摩根大通人工智能研究)
类目: Computation and Language (cs.CL)
备注:   this https URL
Abstract:Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).
zh
[NLP-14] Normative Reasoning in Large Language Models : A Comparative Benchmark from Logical and Modal Perspectives EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在规范性推理(normative reasoning)领域的能力评估问题,特别是其在处理义务(obligation)与许可(permission)等规范模态逻辑时的逻辑一致性与认知偏差。解决方案的关键在于构建一个涵盖规范域与认识域(epistemic domain)广泛形式推理模式的新数据集,并引入影响人类推理的非形式认知因素,从而系统比较LLMs在规范模态与认识模态下的推理表现。实验结果表明,尽管LLMs整体上遵循有效推理规则,但在特定类型的规范推理中存在显著不一致性和类人认知偏差,揭示了提升其规范推理可靠性的关键挑战。
链接: https://arxiv.org/abs/2510.26606
作者: Kentaro Ozeki,Risako Ando,Takanobu Morishita,Hirohiko Abe,Koji Mineshima,Mitsuhiro Okada
机构: Keio University (庆应义塾大学); University of Tokyo (东京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:  Accepted to the 8th BlackboxNLP Workshop at EMNLP 2025
Abstract:Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs’ reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs’ normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at this https URL.
zh
[NLP-15] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因自回归架构和模型规模导致的显著延迟问题。现有方法如EAGLE-2和EAGLE-3虽通过动态树结构改进了推测解码(speculative decoding),但常忽略GPU设备配置和批量大小(batch size)等关键系统变量对推理成本的影响。为此,作者提出一种新的动态树解码方法CAST(Cost-Aware Speculative Tree Decoding),其核心在于将GPU配置、批处理大小等系统级因素纳入考量,以动态优化树结构,从而更高效地平衡生成与验证过程。实验表明,CAST在六种不同任务和六种LLM上均实现最高达5.2倍的加速,并普遍优于当前最先进方法5%至20%。
链接: https://arxiv.org/abs/2510.26577
作者: Yinrong Hong,Zhiquan Tan,Kai Hu
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.         Subjects:  Computation and Language (cs.CL); Machine Learning (cs.LG)  Cite as: arXiv:2510.26577 [cs.CL]    (or  arXiv:2510.26577v1 [cs.CL] for this version)                https://doi.org/10.48550/arXiv.2510.26577   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-16] InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
【速读】: 该论文旨在解决深度搜索场景中强化学习(Reinforcement Learning, RL)因奖励密度(Reward Density)过低而导致的训练效率低下问题,即智能体在探索过程中需付出高昂代价却难以获得有效反馈。其核心解决方案是提出InfoFlow框架,关键在于三个维度:1)子任务分解(Subproblem decomposition),通过将长程任务拆解为可分配过程奖励的子任务以增强学习信号密度;2)失败引导提示(Failure-guided hints),向停滞轨迹注入纠正性指导以提升成功概率;3)双智能体精炼机制(Dual-agent refinement),利用一个精炼代理(refiner agent)压缩搜索历史信息,降低认知负担并显著提高单位探索成本下的奖励密度,从而使得轻量级大语言模型(LLM)性能逼近先进专有模型。
链接: https://arxiv.org/abs/2510.26575
作者: Kun Luo,Hongjin Qian,Zheng Liu,Ziyi Xia,Shitao Xiao,Siqi Bao,Jun Zhao,Kang Liu
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbfReward Density in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbfReward Density Optimization problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbfInfoFlow, a systematic framework that tackles this problem from three aspects. 1) \textbfSubproblem decomposition: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbfFailure-guided hints: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbfDual-agent refinement: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher’s perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.
zh
[NLP-17] he Structure of Relation Decoding Linear Operators in Large Language Models NEURIPS2025
【速读】: 该论文旨在解决Transformer语言模型中线性关系解码器(linear relational decoder)的结构与组织机制问题,特别是这些解码器如何编码和压缩不同关系信息。其核心问题是:为何这类解码器在处理多个关系时表现出高度冗余,并能被高效压缩而不显著损失解码精度。解决方案的关键在于发现这些解码器实际上并非编码特定关系,而是提取共性的粗粒度语义属性(如“国家-首都”和“国家-食物”均属于“country-of-X”这一属性),并通过交叉评估协议验证了这种属性中心(property-centric)的结构;该发现不仅解释了解码器的可压缩性,也揭示了其泛化能力受限于语义相近的新关系。
链接: https://arxiv.org/abs/2510.26543
作者: Miranda Anna Christ,Adrián Csiszárik,Gergely Becsó,Dániel Varga
机构: Fazekas Mihály High School (法泽卡斯·米哈伊高中); HUN-REN Alfréd Rényi Insititute of Mathematics (匈牙利科学院阿尔弗雷德·伦伊数学研究所); Eötvös Loránd University (厄特沃什·罗兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:  NeurIPS 2025 (Spotlight)
Abstract:This paper investigates the structure of linear operators introduced in Hernandez et al. [2023] that decode specific relational facts in transformer language models. We extend their single-relation findings to a collection of relations and systematically chart their organization. We show that such collections of relation decoders can be highly compressed by simple order-3 tensor networks without significant loss in decoding accuracy. To explain this surprising redundancy, we develop a cross-evaluation protocol, in which we apply each linear decoder operator to the subjects of every other relation. Our results reveal that these linear maps do not encode distinct relations, but extract recurring, coarse-grained semantic properties (e.g., country of capital city and country of food are both in the country-of-X property). This property-centric structure clarifies both the operators’ compressibility and highlights why they generalize only to new relations that are semantically close. Our findings thus interpret linear relational decoding in transformer language models as primarily property-based, rather than relation-specific.
zh
[NLP-18] Hebrew Diacritics Restoration using Visual Representation
【速读】: 该论文旨在解决希伯来语中声调符号恢复(diacritics restoration)问题,该任务对于确保单词正确发音及消除文本歧义至关重要。由于希伯来语在无声调符号状态下具有高度歧义性,传统方法依赖复杂的语言学分析,而本文提出了一种名为DIVRIT的新系统,其关键创新在于将声调符号恢复建模为零样本分类问题,并采用基于视觉语言模型(Visual Language Model, VLM)的输入表示方式——将未加声调的文本作为图像输入,使声调信息直接嵌入到向量表示中,从而无需显式语言规则即可实现高精度声调恢复。
链接: https://arxiv.org/abs/2510.26521
作者: Yair Elboher,Yuval Pinter
机构: Ben-Gurion University of the Negev (本-古里安大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language’s high degree of ambiguity when unvocalized, recent machine learning approaches have significantly advanced performance on this task. In this work, we present DIVRIT, a novel system for Hebrew diacritization that frames the task as a zero-shot classification problem. Our approach operates at the word level, selecting the most appropriate diacritization pattern for each undiacritized word from a dynamically generated candidate set, conditioned on the surrounding textual context. A key innovation of DIVRIT is its use of a Hebrew Visual Language Model, which processes undiacritized text as an image, allowing diacritic information to be embedded directly within the input’s vector representation. Through a comprehensive evaluation across various configurations, we demonstrate that the system effectively performs diacritization without relying on complex, explicit linguistic analysis. Notably, in an ``oracle’’ setting where the correct diacritized form is guaranteed to be among the provided candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic architectural enhancements and optimized training methodologies yield significant improvements in the system’s overall generalization capabilities. These findings highlight the promising potential of visual representations for accurate and automated Hebrew diacritization.         Subjects:  Computation and Language (cs.CL)  Cite as: arXiv:2510.26521 [cs.CL]    (or  arXiv:2510.26521v1 [cs.CL] for this version)                https://doi.org/10.48550/arXiv.2510.26521   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] Inside CORE-KG: Evaluating Structured Prompting and Coreference Resolution for Knowledge Graphs ICDM2025
【速读】: 该论文旨在解决从复杂法律文本中自动构建知识图谱(Knowledge Graph, KG)时面临的噪声多、节点重复和结构碎片化问题,这些问题主要源于法律文档的非结构化特性、词汇密集性以及指代模糊性。解决方案的关键在于提出CORE-KG框架,其核心创新是集成类型感知的共指消解模块(type-aware coreference module)与领域引导的结构化提示(domain-guided structured prompts),从而有效减少节点冗余并提升抽取结果的准确性与一致性。
链接: https://arxiv.org/abs/2510.26512
作者: Dipak Meher,Carlotta Domeniconi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:  ICDM 2025 Workshop
Abstract:Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer critical insights but are often unstructured, lexically dense, and filled with ambiguous or shifting references, which pose significant challenges for automated knowledge graph (KG) construction. While recent LLM-based approaches improve over static templates, they still generate noisy, fragmented graphs with duplicate nodes due to the absence of guided extraction and coreference resolution. The recently proposed CORE-KG framework addresses these limitations by integrating a type-aware coreference module and domain-guided structured prompts, significantly reducing node duplication and legal noise. In this work, we present a systematic ablation study of CORE-KG to quantify the individual contributions of its two key components. Our results show that removing coreference resolution results in a 28.32% increase in node duplication and a 4.32% increase in noisy nodes, while removing structured prompts leads to a 4.34% increase in node duplication and a 73.33% increase in noisy nodes. These findings offer empirical insights for designing robust LLM-based pipelines for extracting structured representations from complex legal texts.
zh
[NLP-20] A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool
【速读】: 该论文旨在解决如何更可靠地评估基于像素的医学影像AI分诊工具(如颅内出血检测模型)性能的问题,传统单个大语言模型(Large Language Model, LLM)在生成“金标准”回顾性评估时可能存在主观偏差或一致性不足。其解决方案的关键在于构建一个由多个开源LLM组成的集成系统(ensemble),通过多模型协同判断提升评估结果的一致性和可靠性,实验证明中等至大型规模的开源LLM集合(如Top-3 Ensemble、Full-9 Ensemble)相较于单一GPT-4o模型,在F1分数、召回率、Matthews相关系数(MCC)等指标上表现更优且差异不显著,从而为临床AI工具的客观验证提供了一种更具鲁棒性的方法。
链接: https://arxiv.org/abs/2510.26498
作者: Adam E. Flanders,Yifan Peng,Luciano Prevedello,Robyn Ball,Errol Colak,Prahlad Menon,George Shih,Hui-Ming Lin,Paras Lakhani
机构: 未知
类目: Computation and Language (cs.CL)
备注:  29 pages, 3 figures, 4 tables
Abstract:Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75  0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p  0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.          Comments: 29 pages, 3 figures, 4 tables   Subjects:  Computation and Language (cs.CL)  Cite as: arXiv:2510.26498 [cs.CL]    (or  arXiv:2510.26498v1 [cs.CL] for this version)                https://doi.org/10.48550/arXiv.2510.26498   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)        Submission history From: Adam Flanders [view email]       [v1]         Thu, 30 Oct 2025 13:50:19 UTC (1,047 KB)
zh
[NLP-21] Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration
【速读】: 该论文旨在解决当前Text-to-SQL模型在静态、单轮任务中表现优异,但在真实交互场景下适应用户意图演化能力不足的问题(即多轮动态交互下的SQL生成挑战)。其解决方案的关键在于提出一个自动化构建的多轮Text-to-SQL基准测试集DySQL-Bench,该基准通过两阶段流水线生成高质量任务:第一阶段利用结构化树表示从原始数据库表中引导大语言模型(LLM)生成任务,第二阶段进行交互导向的过滤与专家验证;同时设计了一个模拟真实交互过程的多轮评估框架,其中包含LLM模拟用户、待测模型和可执行数据库之间的动态交互,从而系统性地评估模型在用户意图变化时调整推理和SQL生成的能力。
链接: https://arxiv.org/abs/2510.26495
作者: Linzhuang Sun,Tianyu Guo,Hao Liang,Yuying Li,Qifeng Cai,Jingxuan Wei,Bihui Yu,Wentao Zhang,Bin Cui
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human evaluation confirms 100% correctness of the synthesized data. We further propose a multi-turn evaluation framework simulating realistic interactions among an LLM-simulated user, the model under test, and an executable database. The model must adapt its reasoning and SQL generation as user intents change. DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling 1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the Pass@5 metric, underscoring the benchmark’s difficulty. All code and data are released at this https URL .
zh
[NLP-22] Context Engineering 2.0: The Context of Context Engineering
【速读】: 该论文旨在解决如何使机器更好地理解人类的情境与目的这一核心问题,尤其是在人机交互(Human–Machine Interaction)日益复杂的背景下。其解决方案的关键在于提出并系统化“情境工程”(Context Engineering)的概念,强调通过设计和构建能够动态感知、推理和响应用户所处环境与意图的机制,实现更自然、高效的人机协同。论文指出,尽管情境工程常被视为智能代理时代的新概念,但其实践根源可追溯至20世纪90年代早期的交互系统发展,并随着机器智能水平的演进而不断深化,最终目标是为AI系统提供一套可扩展、可操作的情境建模与应用框架。
链接: https://arxiv.org/abs/2510.26493
作者: Qishuo Hua,Lyumanshan Ye,Dayuan Fu,Yang Xiao,Xiaojie Cai,Yunze Wu,Jifan Lin,Junfei Wang,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Karl Marx once wrote that ``the human essence is the ensemble of social relations’', suggesting that individuals are not isolated entities but are fundamentally shaped by their interactions with other entities, within which contexts play a constitutive and essential role. With the advent of computers and artificial intelligence, these contexts are no longer limited to purely human–human interactions: human–machine interactions are included as well. Then a central question emerges: How can machines better understand our situations and purposes? To address this challenge, researchers have recently introduced the concept of context engineering. Although it is often regarded as a recent innovation of the agent era, we argue that related practices can be traced back more than twenty years. Since the early 1990s, the field has evolved through distinct historical phases, each shaped by the intelligence level of machines: from early human–computer interaction frameworks built around primitive computers, to today’s human–agent interaction paradigms driven by intelligent agents, and potentially to human–level or superhuman intelligence in the future. In this paper, we situate context engineering, provide a systematic definition, outline its historical and conceptual landscape, and examine key design considerations for practice. By addressing these questions, we aim to offer a conceptual foundation for context engineering and sketch its promising future. This paper is a stepping stone for a broader community effort toward systematic context engineering in AI systems.
zh
[NLP-23] Bayesian Network Fusion of Large Language Models for Sentiment Analysis
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在情感分析任务中普遍存在的透明度不足、可解释性差、微调成本高、提示工程复杂、跨领域结果不一致以及计算资源消耗大等问题。其解决方案的关键在于提出贝叶斯网络LLM融合(Bayesian network LLM fusion, BNLF)框架,通过概率机制对FinBERT、RoBERTa和BERTweet三个领域特定LLM的预测结果进行晚期融合,将各模型输出视为贝叶斯网络中的概率节点,从而实现更稳定、可解释且性能提升显著的情感分类。实验表明,BNLF在三个具有不同语言特征的人工标注金融语料库上相较基线模型平均准确率提升约6%,验证了该方法对数据集变异性的鲁棒性和概率融合的有效性。
链接: https://arxiv.org/abs/2510.26484
作者: Rasoul Amirzadeh,Dhananjay Thiruvady,Fatemeh Shiri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) continue to advance, with an increasing number of domain-specific variants tailored for specialised tasks. However, these models often lack transparency and explainability, can be costly to fine-tune, require substantial prompt engineering, yield inconsistent results across domains, and impose significant adverse environmental impact due to their high computational demands. To address these challenges, we propose the Bayesian network LLM fusion (BNLF) framework, which integrates predictions from three LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic mechanism for sentiment analysis. BNLF performs late fusion by modelling the sentiment predictions from multiple LLMs as probabilistic nodes within a Bayesian network. Evaluated across three human-annotated financial corpora with distinct linguistic and contextual characteristics, BNLF demonstrates consistent gains of about six percent in accuracy over the baseline LLMs, underscoring its robustness to dataset variability and the effectiveness of probabilistic fusion for interpretable sentiment classification.
zh
[NLP-24] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
【速读】: 该论文旨在解决自提升(self-improvement)过程中因数据分布不均衡导致的模型性能瓶颈问题,即模型在简单视觉推理任务(head data)上表现优异,而在复杂任务(tail data)上进步有限,这种不平衡会加剧“马太效应”(Matthew effect),阻碍模型整体推理能力的提升。解决方案的关键在于引入两种策略:一是通过分布重塑(distribution-reshaping)调整训练数据的分布,二是通过轨迹重采样(trajectory-resampling)优化学习样本的选择,从而实现头尾数据的再平衡,有效缓解自提升过程中的偏倚问题,显著增强模型在复杂视觉推理任务上的能力。
链接: https://arxiv.org/abs/2510.26474
作者: Xin Guo,Zhiheng Xi,Yiwen Ding,Yitao Zhai,Xiaowei Shi,Xunliang Cai,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Meituan (美团); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:  Preprint
Abstract:Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced–a dynamic we term the “Matthew effect”–which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
zh
[NLP-25] SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning ICSE2026
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动化代码审查方法在识别和修复安全相关问题上的有效性不足的问题。现有方法主要聚焦于通用代码审查任务,缺乏对安全漏洞的针对性建模,且受限于安全领域数据稀缺与评估指标不完善等挑战。其解决方案的关键在于提出SecureReviewer框架:首先构建专门用于训练和评估安全代码审查能力的数据集;其次采用安全感知的微调策略(secure-aware fine-tuning)提升LLM识别安全缺陷并提供修复建议的能力;进一步引入检索增强生成(Retrieval-Augmented Generation, RAG)技术以减少幻觉、提高输出可靠性;最后设计了专用评估指标SecureBLEU,用于量化评论在解决安全问题上的有效性。实验表明,该方案在安全问题检测准确率及评论质量方面均优于现有最优基线。
链接: https://arxiv.org/abs/2510.26457
作者: Fang Liu,Simiao Liu,Yinghao Zhu,Xiaoli Lian,Li Zhang
机构: Beihang University (北京航空航天大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:  Accepted by ICSE 2026. Code and data: this https URL
Abstract:Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates’ code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs’ ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.
zh
[NLP-26] 112: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的高带宽和计算资源消耗问题。现有压缩方法如剪枝(pruning)与低秩近似(low-rank approximation)虽各自有效,但其协同效应尚未被充分探索。解决方案的关键在于提出一种联合稀疏性与低秩结构的压缩框架——协同稀疏与低秩压缩(Synergistic Sparse and Low-Rank Compression, SSLC),通过理论建模将两者统一为一个优化问题,并采用迭代优化算法求解。该方法无需额外训练即可实现显著压缩比(如Qwen2.5模型压缩50%无性能损失)和加速效果(至少1.63倍推理速度提升),从而为高效LLM部署提供了实用路径。
链接: https://arxiv.org/abs/2510.26446
作者: Zeliang Zong,Kai Zhang,Zheyang Li,Wenming Tan,Ye Ren,Yiyan Zhai,Jilin Hu
机构: Hikvision Research Institute (海康威视研究院); School of Data Science and Engineering, East China Normal University (华东师范大学数据科学与工程学院)
类目: Computation and Language (cs.CL)
备注:  15 pages, 6 figures, EMNLP 2025 findings
Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underlineSynergistic \underlineSparse and \underlineLow-Rank \underlineCompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results. Notably, SSLC compresses Qwen2.5 by 50% with no performance drop and achieves at least 1.63 \times  speedup, offering a practical solution for efficient LLM deployment.
zh
[NLP-27] Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
【速读】: 该论文旨在解决非回归测试(non-regression testing)中的测试断言生成(test oracle generation)难题,即如何自动产生能够准确判断被测函数(Function Under Test, FUT)在给定输入下是否按预期行为执行的断言。其核心解决方案是提出一种名为Nexus的多智能体框架,其关键在于通过一组具有不同测试哲学的专用智能体(specialist agents)进行结构化的协同推理、验证与迭代自修正过程:首先由四个专家智能体对初始断言进行批判性审议和优化;随后在安全沙箱中生成候选FUT实现并执行断言以验证其正确性;对于失败的断言,系统会触发自动化自修正循环,利用运行时错误信息进行调试与修复,从而实现断言质量的持续提升。
链接: https://arxiv.org/abs/2510.26423
作者: Dong Huang,Mingzhe Du,Jie M. Zhang,Zheng Lin,Meng Luo,Qianru Zhang,See-Kiong Ng
机构: National University of Singapore (新加坡国立大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:  Under Review
Abstract:Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.
zh
[NLP-28] OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在教育领域应用中评估维度单一、缺乏对育人能力(cultivation capabilities)系统评测的问题,尤其是在中文教育场景下,现有基准测试多局限于单一学科或题型,难以全面反映模型在真实教学情境中的综合表现。其解决方案的关键在于构建一个名为OmniEduBench的综合性中文教育基准数据集,该数据集包含24,602个高质量问答对,明确划分为知识维度(18,121条)与育人维度(6,481条),每个维度细分为6个子类别,覆盖61个不同学科(知识类41个、育人类20个),并涵盖11种常见考试题型,从而为LLMs在教育场景下的多维能力评估提供结构化、多样化的评测基础。
链接: https://arxiv.org/abs/2510.26422
作者: Min Zhang,Hao Chen,Hao Chen,Wenqi Zhang,Didi Zhu,Xin Lin,Bo Jiang,Aimin Zhou,Fei Wu,Kun Kuang
机构: East China Normal University (华东师范大学); Zhejiang University (浙江大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs’ capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.
zh
[NLP-29] On the Role of Context for Discourse Relation Classification in Scientific Writing
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在科学工作流中支持科学主张时,如何利用话语层面(discourse-level)信息来识别和提取支持证据的问题。其关键解决方案在于对科学文献中的话语关系分类(Discourse Relation Classification, DRC)任务进行初步研究,探索预训练语言模型(PLM)与大语言模型(LLM)在该任务上的表现,并验证上下文信息(由话语结构定义)对提升DRC性能的积极作用。实验表明,合理利用上下文有助于改善科学文本中话语关系的识别效果,且不同类型的科学话语关系对上下文的依赖程度存在差异,为后续基于话语结构的证据检索与推理提供了方法基础。
链接: https://arxiv.org/abs/2510.26354
作者: Stephen Wan,Wei Liu,Michael Strube
机构: CSIRO (澳大利亚联邦科学与工业研究组织); Heidelberg Institute for Theoretical Studies (海德堡理论研究所)
类目: Computation and Language (cs.CL)
备注:  Accepted at Joint Sixth Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025) and Eighth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2025)
Abstract:With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing. In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.          Comments: Accepted at Joint Sixth Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025) and Eighth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2025)   Subjects:  Computation and Language (cs.CL)  Cite as: arXiv:2510.26354 [cs.CL]    (or  arXiv:2510.26354v1 [cs.CL] for this version)                https://doi.org/10.48550/arXiv.2510.26354   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-30] he Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
【速读】: 该论文旨在解决多智能体大语言模型(Multi-agent Large Language Models, Multi-agent LLMs)团队组成中缺乏有效协作机制的问题,尤其在模型内部结构不透明的情况下如何自动识别并构建功能协同的团队。其解决方案的关键在于提出一种以交互为中心的框架,通过分析成对对话的语义一致性构建“语言模型图”(language model graph),并利用社区检测算法识别出具有潜在专业化特征的功能一致模型集群,从而实现无需先验知识(如架构、训练数据或性能指标)的自动化团队组合。
链接: https://arxiv.org/abs/2510.26352
作者: Kotaro Furuya,Yuichi Kitagawa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a “language model graph” that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.
zh
[NLP-31] MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
【速读】: 该论文旨在解决健康相关虚假信息(health-related misinformation)识别困难的问题,尤其是当错误论点扭曲或误读科学发现时,传统方法难以准确识别。其解决方案的关键在于提出一种名为MisSynth的合成数据生成与轻量微调相结合的框架:首先利用检索增强生成(Retrieval-Augmented Generation, RAG)技术生成合成的谬误样本,再将这些数据用于微调大语言模型(Large Language Models, LLMs)。实验表明,该方法在MISSCI数据集上显著提升了模型性能,例如LLaMA 3.1 8B模型在测试集上的F1分数相比原始基线提高了超过35%,证明了通过少量标注资源结合合成数据增强可有效提升零样本场景下对真实世界科学虚假信息的分类能力。
链接: https://arxiv.org/abs/2510.26345
作者: Mykhailo Poliakov,Nadiya Shvai
机构: National University of Kyiv-Mohyla Academy (基辅莫吉拉国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on this https URL.
zh
[NLP-32] From Amateur to Master: Infusing Knowledge into LLM s via Automated Curriculum Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在经济学、心理学等专业领域表现不佳的问题,这些问题通常需要深层次的原理性理解而非泛化能力。为应对这一挑战,作者提出ACER(Automated Curriculum-Enhanced Regimen),其核心在于通过自动化生成结构化的教材式课程(textbook-style curriculum)来增强模型的专业知识,该课程基于布卢姆分类法(Bloom’s taxonomy)设计问答对(QA pairs),确保内容覆盖全面且难度逐步提升。随后,利用此合成语料库进行持续预训练,并采用交错式课程调度策略,在内容维度与认知维度上同步优化学习过程。实验表明,ACER不仅能显著提升目标领域的准确率(如微观经济学提升5个百分点),还能避免灾难性遗忘并促进跨领域正向迁移,从而实现专业能力增强与通用能力保持之间的平衡。
链接: https://arxiv.org/abs/2510.26336
作者: Nishit Neema,Srinjoy Mukherjee,Sapan Shah,Gokul Ramakrishnan,Ganesh Venkatesh
机构: Cerebras Systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom’s taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.         Subjects:  Computation and Language (cs.CL); Artificial Intelligence (cs.AI)  Cite as: arXiv:2510.26336 [cs.CL]    (or  arXiv:2510.26336v1 [cs.CL] for this version)                https://doi.org/10.48550/arXiv.2510.26336   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-33] SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
【速读】: 该论文旨在解决教育场景中生成式 AI (Generative AI) 应用于学生反馈时面临的三大挑战:隐私保护需求、计算资源受限以及对教学有效性(pedagogical validity)的严格要求。为此,作者提出 SCRIBE 框架,其核心创新在于结合领域特定工具与自反思推理流程(self-reflective inference pipeline),支持多跳推理(multi-hop reasoning)、工具调用和错误恢复机制,并通过两阶段 LoRA 微调将能力蒸馏至 3B 和 8B 参数规模的小型开源模型中,从而在本地部署条件下实现高准确性与教学相关性。评估结果显示,8B-SCRIBE 在相关性和可操作性等关键维度上媲美甚至超越更大模型,且被学生评价为与 GPT-4o 和 Llama-3.3 70B 相当,验证了其在低资源、隐私敏感环境下的可行性。
链接: https://arxiv.org/abs/2510.26322
作者: Fares Fawzi,Vinitra Swamy,Dominik Glandorf,Tanya Nazaretsky,Tanja Käser
机构: EPFL
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.
zh
[NLP-34] Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在动态、交互式网页环境中的实际表现问题,特别是其在需要实时响应与精确操作的任务中能力的局限性。研究通过浏览器游戏(如 T-Rex Runner、Flappy Bird 和 Sudoku)作为测试场景,利用游戏内得分作为量化指标,评估 OpenAI 的 ChatGPT Atlas 模型的网页交互能力。解决方案的关键在于构建一个基于真实交互任务的基准测试框架,从而揭示模型在逻辑推理类任务中表现出色(如 Sudoku 中显著优于人类基线),但在依赖时间精度和运动控制的实时游戏中表现不佳,表明当前生成式 AI 在动态网页环境中仍存在显著的能力瓶颈。
链接: https://arxiv.org/abs/2510.26298
作者: Jingran Zhang,Ning Li,Justin Cui
机构: UC San Deigo (加州大学圣迭戈分校); UCLA (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:OpenAI’s ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas’s web interaction capabilities using browser-based games as test scenarios, including Google’s T-Rex Runner, Sudoku, Flappy Bird, and this http URL. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at this https URL.
zh
[NLP-35] Unravelling the Mechanisms of Manipulating Numbers in Language Models
【速读】: 该论文试图解决的问题是:尽管大型语言模型(Large Language Models, LLMs)在处理数字信息时经常产生错误输出,但近期研究表明它们对数字的输入嵌入表示却趋于一致且准确,这一现象与实际输出错误之间存在矛盾。为解释该冲突,论文提出通过系统性地分析LLMs如何在内部隐藏状态中操作数字,并量化这些机制的最低精度下限来揭示其内在规律。解决方案的关键在于发现不同LLMs虽在输出层面存在误差,但均学习到可互换的、系统性强、高精度且跨层和跨输入上下文通用的数字表示;基于此,作者构建了适用于每种LLM的通用探测器(universal probes),并能追踪信息流——包括错误产生的原因——至特定网络层,从而为理解预训练LLMs处理数值信息的机制提供了基础性洞见,并指出了改进LLM架构以提升数值准确性潜力的方向。
链接: https://arxiv.org/abs/2510.26285
作者: Michal Štefánik,Timothee Mickus,Marek Kadlčík,Bertram Højer,Michal Spiegel,Raúl Vázquez,Aman Sinha,Josef Kuchař,Philipp Mondorf
机构: TransformersClub @ Faculty of Informatics, Masaryk University (马萨里克大学信息学院); University of Helsinki (赫尔辛基大学); R&D Centre for Large Language Models, National Institute of Informatics, Japan (日本信息研究所大语言模型研发中⼼); IT University of Copenhagen (哥本哈根信息技术大学); Kempelen Instutite of Information Technology (凯姆佩伦信息科技研究所); Université de Lorraine (洛林大学); MaiNLP, Center for Information and Language Processing, LMU Munich, Germany (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (MCML), Munich, Germany (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information – including the causes of output errors – to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs’ architectures.
zh
[NLP-36] Do LLM s Signal When Theyre Right? Evidence from Neuron Agreement
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在无监督场景下如何高效、准确地进行候选解码的问题,尤其是在缺乏真实标签(ground truth)时如何提升推理性能。当前主流方法依赖外部输出信号(如token概率、熵或自评估)来评分候选文本,但这些信号在模型后训练阶段容易出现校准偏差。解决方案的关键在于利用模型内部行为——具体而言,通过分析神经元激活模式发现:正确响应激活的神经元数量更少(即具有激活稀疏性),且跨样本间一致性更强;基于此,作者提出Neuron Agreement Decoding(NAD),一种仅使用内部神经元激活信号的无监督“最佳N”(best-of-N)解码方法,通过激活稀疏性和跨样本神经元一致性选择最优候选,无需比较文本内容即可实现早期正确性预测和激进的提前终止策略,从而在保持生成质量的同时将token消耗降低99%。
链接: https://arxiv.org/abs/2510.26277
作者: Kang Chen,Yaoning Wang,Kai Xiong,Zhuoka Feng,Wenhe Sun,Haotian Chen,Yixin Cao
机构: Fudan University (复旦大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.
zh
[NLP-37] PVMark: Enabling Public Verifiability for LLM Watermarking Schemes
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)水印方案中的可信性问题:现有水印检测机制依赖私有密钥,导致第三方无法验证检测结果的真实性,从而难以建立信任。其关键解决方案是提出PVMark,一种基于零知识证明(Zero-Knowledge Proof, ZKP)的插件架构,使水印检测过程在不泄露任何秘密密钥的前提下实现公开可验证性。PVMark的核心在于构建针对水印检测“正确执行”的ZKP约束体系,涵盖映射、随机数生成、比较和求和等操作,从而确保检测逻辑的透明性和安全性。实验表明,PVMark可在多种水印方案与ZKP协议组合下有效运行,同时保持原有水印性能,具备实际部署潜力。
链接: https://arxiv.org/abs/2510.26274
作者: Haohua Duan,Liyao Xiang,Xin Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:  This work has been submitted to the IEEE for possible publication
Abstract:Watermarking schemes for large language models (LLMs) have been proposed to identify the source of the generated text, mitigating the potential threats emerged from model theft. However, current watermarking solutions hardly resolve the trust issue: the non-public watermark detection cannot prove itself faithfully conducting the detection. We observe that it is attributed to the secret key mostly used in the watermark detection – it cannot be public, or the adversary may launch removal attacks provided the key; nor can it be private, or the watermarking detection is opaque to the public. To resolve the dilemma, we propose PVMark, a plugin based on zero-knowledge proof (ZKP), enabling the watermark detection process to be publicly verifiable by third parties without disclosing any secret key. PVMark hinges upon the proof of `correct execution’ of watermark detection on which a set of ZKP constraints are built, including mapping, random number generation, comparison, and summation. We implement multiple variants of PVMark in Python, Rust and Circom, covering combinations of three watermarking schemes, three hash functions, and four ZKP protocols, to show our approach effectively works under a variety of circumstances. By experimental results, PVMark efficiently enables public verifiability on the state-of-the-art LLM watermarking schemes yet without compromising the watermarking performance, promising to be deployed in practice.
zh
[NLP-38] Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
【速读】: 该论文旨在解决多语言视觉-语言模型(Vision-Language Models, VLMs)在模型压缩过程中出现的跨语言性能不均衡问题,尤其是在小模型中这一问题更为显著。现有知识蒸馏(Knowledge Distillation, KD)方法虽在单语境下表现良好,但在多语言场景下的应用仍缺乏系统研究。论文通过控制实验比较五种不同的KD策略,发现部分蒸馏配置能够在模型规模减半的情况下维持甚至提升多语言检索的鲁棒性,而另一些则无法保证跨任务稳定性,揭示了蒸馏设计中的敏感权衡关系——仅依赖整体准确率无法捕捉这些关键差异。其解决方案的关键在于识别并优化蒸馏过程中对跨语言表征一致性(cross-lingual representation consistency)和下游任务稳定性(downstream performance stability)的协同影响机制。
链接: https://arxiv.org/abs/2510.26271
作者: Sukrit Sriratanawilai,Jhayahgrit Thongwat,Romrawin Chumpu,Patomporn Payoungkhamdee,Sarana Nutanong,Peerat Limkonchotiwat
机构: VISTEC(视觉技术研究所); AI Singapore(人工智能新加坡)
类目: Computation and Language (cs.CL)
备注:  Work in progress
Abstract:Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.
zh
[NLP-39] Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
【速读】: 该论文试图解决的问题是:预训练语言模型(包括大语言模型)是否具备识别借词(loanword)与本族词(native word)的能力,尤其是在受强势语言持续影响的少数语言社区中。解决方案的关键在于通过在10种不同语言上系统评估多个主流语言模型,并结合显式指令和上下文信息进行测试,发现这些模型在区分借词与本族词方面表现不佳,且存在对借词的偏好偏差,这揭示了当前自然语言处理(NLP)系统在支持少数语言保护方面的局限性。
链接: https://arxiv.org/abs/2510.26254
作者: Mérilin Sousa Silva,Sina Ahmadi
机构: 未知
类目: Computation and Language (cs.CL)
备注:  Under review
Abstract:Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient’s lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.
zh
[NLP-40] Prag matic Theories Enhance Understanding of Implied Meanings in LLM s
【速读】: 该论文旨在解决语言模型在理解语用含义(pragmatic meaning)方面能力不足的问题,尤其是在需要推断言外之意(implicature)的任务中表现有限。其解决方案的关键在于引入语用理论作为上下文提示(in-context prompt),通过提供格赖斯语用学(Gricean pragmatics)和关联理论(Relevance Theory)的概要,引导模型进行分步推理,从而显著提升其对隐含意义的理解能力。实验表明,相较于仅使用零样本链式思维(0-shot Chain-of-Thought)的基线方法,该策略使模型在语用推理任务上得分最高提升9.6%;即使不详述理论细节,仅提及理论名称也能在大模型中带来约1–3%的性能增益。
链接: https://arxiv.org/abs/2510.26253
作者: Takuma Sato,Seiya Kawano,Koichiro Yoshino
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); RIKEN (理化学研究所); Kyoto Institute of Technology (京都工芸纤维大学); Institute of Science Tokyo (东京科学大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.
zh
[NLP-41] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视频理解中对时间信息掌握薄弱且缺乏系统评估的问题。其核心挑战是判断视频片段的时序方向(Arrow of Time, AoT),即识别一段短片是正向播放还是反向播放。解决方案的关键在于提出并构建了一个心理学与生理学验证过的基准测试集——AoT-PsyPhyBENCH,该基准使用与人类行为实验一致的刺激和基线,系统性地评估VLMs在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动动作(如分割/加法)等场景下的时间推理能力。结果表明,多数VLMs表现接近随机水平,即使最优模型也显著落后于人类,揭示了现有模型在时间连续性和因果推理方面的根本性不足。
链接: https://arxiv.org/abs/2510.26241
作者: Shiho Matta,Lis Kanashiro Pereira,Peitao Han,Fei Cheng,Shigeru Kitazawa
机构: Kyoto University (京都大学); Center for Information and Neural Networks (信息与神经网络中心); National Institute of Information and Communications Technology (信息通信技术国立研究所); The University of Osaka (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:  10 pages
Abstract:Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
zh
[NLP-42] owards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理全局信息聚合任务时表现不佳的问题,即现有评估基准主要聚焦于局部RAG(local RAG),难以衡量模型对整个文档集合进行跨段落信息整合与分析的能力。针对这一局限,作者提出GlobalQA——首个专门用于评估全局RAG(global RAG)能力的基准,涵盖计数、极值查询、排序和Top-K提取四类核心任务。解决方案的关键在于提出GlobalRAG框架,其通过三个核心机制实现:(1)基于chunk-level检索保持结构连贯性;(2)引入LLM驱动的智能过滤器去除噪声文档;(3)集成聚合模块以支持精确的符号计算。实验表明,在Qwen2.5-14B模型上,GlobalRAG相较最强基线F1得分从1.51提升至6.63,验证了该方法的有效性。
链接: https://arxiv.org/abs/2510.26205
作者: Qi Luo,Xiaonan Li,Tingshuo Fan,Xinchi Chen,Xipeng Qiu
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability – global RAG – which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, “What are the top 10 most cited papers in 2023?”). In this paper, we introduce GlobalQA – the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline’s 1.51 F1, validating the effectiveness of our method.
zh
[NLP-43] Whats In My Human Feedback? Learning Interpretable Descriptions of Preference Data
【速读】: 该论文旨在解决人类反馈数据(human feedback)在训练语言模型时所蕴含的偏好信息不明确、难以解释的问题,尤其关注如何自动提取人类标注者实际表达的偏好特征,而无需预先设定假设。其解决方案的关键在于提出 What’s In My Human Feedback? (WIMHF),一种基于稀疏自编码器(sparse autoencoders)的方法,能够从反馈数据中学习出少量可解释的人类偏好特征,并区分数据集所能测量的偏好范围与标注者真实表达的偏好模式。该方法揭示了不同数据集背景下人类偏好的多样性及其潜在风险(如对拒绝回答的负面倾向),并进一步支持更安全的数据筛选与个性化建模,从而提升偏好学习的透明度和可控性。
链接: https://arxiv.org/abs/2510.26202
作者: Rajiv Movva,Smitha Milli,Sewon Min,Emma Pierson
机构: UC Berkeley (加州大学伯克利分校); FAIR at Meta (Meta人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:  Code: this https URL
Abstract:Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What’s In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.
zh
[NLP-44] Dont Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation NEURIPS2025
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在实际应用中可控性脆弱的问题,特别是由于均匀且与上下文无关的更新导致的“更新遗忘”(update forgetting)现象——即在时间步(timestep)间产生词元级波动,破坏早期语义编辑效果,进而损害文本流畅性和连贯性。解决方案的关键在于引入Token Timestep Allocation (TTA),通过为每个词元分配特定的时间步调度策略实现软性的语义排序:关键词元早期冻结,不确定词元持续优化。这种基于时间步的排序机制可作为固定或由任务信号驱动的自适应策略,无需修改模型结构即可在推理阶段统一提升多种DLM的可控性与生成质量。
链接: https://arxiv.org/abs/2510.26200
作者: Woojin Kim,Jaeyoung Do
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:  Accepted in NeurIPS 2025
Abstract:While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.
zh
[NLP-45] RCScore: Quantifying Response Consistency in Large Language Models
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中普遍存在的单一指令模板问题,即现有评测方法忽视了模型对指令风格(instruction style)的敏感性,而这在实际部署场景中至关重要。其解决方案的关键在于提出RCScore框架,通过系统性地将基准任务转化为多种指令风格,量化指令形式对模型输出的影响;同时引入交叉响应相似性(Cross-Response Similarity, CRS)作为衡量模型在不同指令风格下输出一致性的指标,并发现该一致性与任务准确率高度相关,从而为评估模型鲁棒性和可靠性提供了新的量化依据。
链接: https://arxiv.org/abs/2510.26193
作者: Dongjun Jang,Youngchae Ahn,Hyopil Shin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Current LLM evaluations often rely on a single instruction template, overlooking models’ sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.
zh
[NLP-46] SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level
【速读】: 该论文旨在解决传统文本到语音(Text-to-Speech, TTS)系统评估中过度依赖词错误率(Word Error Rate, WER)所带来的局限性,即WER无法有效反映真实场景下语音的可理解性与关键信息传递的准确性。其解决方案的关键在于提出一种新的主观评估方法——口语段落多项选择题问答(Spoken-Passage Multiple-Choice Question Answering, SP-MCQA),该方法通过设计基于新闻风格语料的8.76小时基准数据集SP-MCQA-Eval,直接衡量合成语音中关键信息的准确度。实验表明,低WER并不等价于高关键信息准确率,揭示了现有指标与实际人类理解需求之间的鸿沟,同时指出当前最先进的TTS模型在文本归一化和音素准确性方面仍存在不足,凸显了构建更高层次、更贴近真实场景的评估标准的紧迫性。
链接: https://arxiv.org/abs/2510.26190
作者: Hitomi Jin Ling Tee,Chaoren Wang,Zijie Zhang,Zhizheng Wu
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.
zh
[NLP-47] Similarity-Distance-Magnitude Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在指令遵循任务中因概率分布不准确而导致的过度弃权(abstentions)问题,即模型在不确定时倾向于不生成回答,从而降低统计效率。解决方案的关键在于引入相似性-距离-幅度(Similarity-Distance-Magnitude, SDM)语言模型架构,通过监督微调使预训练的解码器-only Transformer 语言模型最大化生成样本落在由最终层 SDM 激活层划分的高概率、校准区域内,该激活层用于二分类判断是否遵循指令;训练过程中利用对比输入编码方案和在线生成的硬负例,结合 SDM 层估计基变换以优化下一词预测损失,显著减少弃权行为并提升模型统计效率。
链接: https://arxiv.org/abs/2510.26183
作者: Allen Schmaltz
机构: Reexpress AI
类目: Computation and Language (cs.CL)
备注:  8 pages, 5 tables
Abstract:We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.
zh
[NLP-48] MossNet: Mixture of State-Space Experts is a Multi-Head Attention
【速读】: 该论文旨在解决当前基于状态空间模型(State-space Models, SSMs)或门控循环模型(Gated Recurrent Models, GRMs)的大型语言模型(Large Language Models, LLMs)在表达能力上的局限性问题,即现有方法通常仅能模拟单一注意力头(attention head),从而限制了模型的建模能力。其解决方案的关键在于提出MossNet——一种新颖的“状态空间专家混合”(mixture-of-state-space-experts)架构,通过在时间混洗(time-mixing)SSM核中引入专家混合(Mixture-of-Experts, MoE)机制,实现线性多头注意力(linear multi-head attention, MHA)的模拟;同时,MoE结构也被扩展至通道混洗MLP模块,从而在保持高效计算的同时显著增强模型的表达能力与可扩展性。
链接: https://arxiv.org/abs/2510.26182
作者: Shikhar Tuli,James Seale Smith,Haris Jeelani,Chi-Heng Lin,Abhishek Patel,Vasili Ramanishka,Yen-Chang Hsu,Hongxia Jin
机构: Samsung Research America (三星研究院美国)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple “attention heads.” Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.
zh
[NLP-49] One Model to Critique Them All: Rewarding Agent ic Tool-Use via Efficient Reasoning
【速读】: 该论文旨在解决工具调用(function-calling)场景中缺乏专门设计的奖励模型(Reward Models, RMs)的问题,从而限制了代理型人工智能(agentic AI)在工具使用能力上的进展。其核心解决方案是提出 ToolRM,一个轻量级生成式奖励模型家族,专为通用工具使用场景设计;关键创新在于构建了一种新颖的数据构建流水线,通过规则评分与多维采样策略生成高质量配对偏好数据(ToolPref-Pairwise-30K),支持基于可验证反馈的强化学习训练,并结合 TRBench_BFCL 基准进行评估,显著提升了模型在工具调用任务中的准确率与推理效率。
链接: https://arxiv.org/abs/2510.26167
作者: Renhao Li,Jianhong Tu,Yang Su,Hamid Alinejad-Rokny,Derek F. Wong,Junyang Lin,Min Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench _BFCL , a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
zh
[NLP-50] Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非数学与代码领域中推理能力不足的问题,尤其是在缺乏专门奖励模型的情况下如何有效提升通用推理能力。其解决方案的关键在于提出一种简明的两阶段训练课程(Reasoning Curriculum):第一阶段通过仅使用数学数据并结合可验证奖励信号进行强化学习(Reinforcement Learning, RL),冷启动式地激发基础推理技能;第二阶段则在多领域混合数据上执行联合强化学习,实现推理技能的迁移与巩固。该方法无需专用奖励模型,仅依赖标准的可验证性检查,具有轻量化、通用性强且易于部署的优势。
链接: https://arxiv.org/abs/2510.26143
作者: Bo Pang,Deqian Kong,Silvio Savarese,Caiming Xiong,Yingbo Zhou
机构: Salesforce AI Research (Salesforce人工智能研究中心); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:  9 pages
Abstract:Reinforcement learning (RL) can elicit strong reasoning in large language models (LLMs), yet most open efforts focus on math and code. We propose Reasoning Curriculum, a simple two-stage curriculum that first elicits reasoning skills in pretraining-aligned domains such as math, then adapts and refines these skills across other domains via joint RL. Stage 1 performs a brief cold start and then math-only RL with verifiable rewards to develop reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills. The curriculum is minimal and backbone-agnostic, requiring no specialized reward models beyond standard verifiability checks. Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning curriculum yields consistent gains. Ablations and a cognitive-skill analysis indicate that both stages are necessary and that math-first elicitation increases cognitive behaviors important for solving complex problems. Reasoning Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
zh
[NLP-51] On the Influence of Discourse Relations in Persuasive Texts
【速读】: 该论文旨在解决说服技巧(Persuasion Techniques, PTs)与话语关系(Discourse Relations, DRs)之间关联性缺乏系统研究的问题,尤其在无同时标注PT和DR的公开数据集背景下。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)与提示工程(prompt engineering),基于SemEval 2023 Task 3中已标注19类PT的数据集,构建LLM分类器以自动标注每个实例对应的PDTB 3.0 Level-2 DRs,进而生成包含双标签的银质数据集(silver datasets)。通过集成不同多数投票策略形成5个银质数据集,并结合统计分析揭示了6种关键话语关系(如因果、目的、对比等)在特定说服技巧(如负载语言、夸张/最小化、重复等)中的显著作用,为在线虚假信息检测与有效传播机制理解提供实证基础。
链接: https://arxiv.org/abs/2510.26124
作者: Nawar Turk,Sevag Kaspar,Leila Kosseim
机构: 未知
类目: Computation and Language (cs.CL)
备注:  Published in Proceedings of the 38th Canadian Conference on Artificial Intelligence CanAI 2025 Calgary Alberta May 26-27 2025. 5 figures 7 tables
Abstract:This paper investigates the relationship between Persuasion Techniques (PTs) and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and prompt engineering. Since no dataset annotated with both PTs and DRs exists, we took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point and developed LLM-based classifiers to label each instance of the dataset with one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10 different prompts, resulting in 40 unique DR classifiers. Ensemble models using different majority-pooling strategies were used to create 5 silver datasets of instances labelled with both persuasion techniques and level-2 PDTB senses. The silver dataset sizes vary from 1,281 instances to 204 instances, depending on the majority pooling technique used. Statistical analysis of these silver datasets shows that six discourse relations (namely Cause, Purpose, Contrast, Cause+Belief, Concession, and Condition) play a crucial role in persuasive texts, especially in the use of Loaded Language, Exaggeration/Minimisation, Repetition and to cast Doubt. This insight can contribute to detecting online propaganda and misinformation, as well as to our general understanding of effective communication.
zh
[NLP-52] Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
【速读】: 该论文旨在解决测试时缩放(Test-Time Scaling, TTS)在提升大语言模型(Large Language Models, LLMs)推理能力时因输出多样性不足而导致的性能瓶颈问题。其核心原因在于当前普遍采用的“一题一解”(One Problem, One Solution, 1P1S)训练范式仅提供单一标准答案,导致模型倾向于收敛到有限的推理路径。为此,作者提出“一题多解”(One Problem, Multiple Solutions, 1PNS)训练范式,通过引入多样化的有效推理轨迹来增强模型在推理阶段的多样性。该方案的关键创新在于提出一种步骤级的语义差异度量方法——推理路径分歧(Reasoning Path Divergence, RPD),该指标能够对长链思维(Long Chain-of-Thought)进行对齐与评分,从而可靠地捕捉中间推理步骤间的差异,并据此筛选每道题目下最大多样性的解集用于微调,显著提升了TTS的效果,在pass@k指标上平均优于1P1S基线+2.80%,在AIME24数据集上更是达到+4.99%的提升。
链接: https://arxiv.org/abs/2510.26122
作者: Feng Ju,Zeyu Qin,Rui Min,Zhitao He,Lingpeng Kong,Yi R. Fung
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Science and Technology of China (中国科学技术大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common “one problem, one solution” (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a “one problem, multiple solutions” (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at this https URL .
zh
[NLP-53] QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在量子编程领域应用中的评估难题,特别是缺乏能够结合硬件反馈与人类编码实践的系统性评测框架。现有研究多聚焦于通用代码生成任务,而量子编程涉及物理设备交互、电路复杂度优化等特殊需求,传统基于Python执行环境的评估方法无法提供关键的硬件感知指标(如电路深度、执行时间及错误分类)。为此,作者提出QCoder Benchmark,其核心创新在于:一方面引入量子模拟器环境以获取域特定的硬件反馈指标,从而指导更高质量的代码生成;另一方面整合真实编程竞赛中的人类代码作为基准,实现定量对比与定性分析。实验表明,即使先进模型如GPT-4o仅达18.97%准确率,而基于推理增强的模型(如o3)可达78%,显著优于人类平均表现(39.98%),验证了该框架的有效性和挑战性。
链接: https://arxiv.org/abs/2510.26101
作者: Taku Mikuriya,Tatsuya Ishigaki,Masayuki Kawarada,Shunya Minami,Tadashi Kadowaki,Yohichi Suzuki,Soshun Naito,Shunya Takata,Takumi Kato,Tamotsu Basseda,Reo Yamada,Hiroya Takamura
机构: National Institute of Advanced Industrial Science and Technology (AIST); Yokohama National University; CyberAgent, Inc.; The University of Tokyo; Keio University; NTT DATA GROUP Corporation; Miletos inc.; University of Tsukuba
类目: Computation and Language (cs.CL); Programming Languages (cs.PL); Quantum Physics (quant-ph)
备注:
Abstract:Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.
zh
[NLP-54] ORBIT - Open Recommendation Benchmark for Reproducible Research with Hidden Tests NEURIPS2025
【速读】: 该论文旨在解决推荐系统研究中因数据集无法真实反映用户行为以及评估设置不一致导致结论模糊的问题。其解决方案的关键在于提出一个统一的基准测试框架——Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT),该框架包含标准化的公共数据集划分、透明的评估设置及公开排行榜,同时引入一个新的网页推荐任务ClueWeb-Reco,该数据集基于8700万高质量网页的真实用户浏览序列构建,且经过用户授权和隐私保护处理,用于作为隐藏测试集以评估模型在大规模网页推荐场景下的泛化能力。通过在公共基准上评测12种代表性推荐模型,并在ClueWeb-Reco上引入提示型大语言模型(prompted LLM)基线,ORBIT验证了现有方法在真实场景中的局限性,并揭示了LLM集成带来的改进潜力。
链接: https://arxiv.org/abs/2510.26095
作者: Jingyuan He,Jiongnan Liu,Vishan Vishesh Oberoi,Bolin Wu,Mahima Jagadeesh Patel,Kangrui Mao,Chuning Shi,I-Ta Lee,Arnold Overwijk,Chenyan Xiong
机构: Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所); Meta
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:  Accepted to NeurIPS 2025 Datasets  Benchmarks track
Abstract:Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences. However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions. This paper introduces the Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models’ generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances. The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations. ORBIT benchmark, leaderboard, and codebase are available at this https URL.
zh
[NLP-55] Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation, KD)对模型去偏能力(debiasing capability)的影响问题,特别是KD是否能够有效传递教师模型的去偏能力至学生模型,以及其内在机制如何作用于不同类型的偏差。研究发现,KD通常会削弱模型的去偏能力,且训练去偏模型时引入教师知识并无益处;尽管整体鲁棒性可能保持稳定,但不同偏差类型的表现存在显著差异。关键解决方案在于:构建高质量增强数据以提升蒸馏效果、采用迭代式知识蒸馏策略,以及使用教师模型权重初始化学生模型,从而改善去偏方法的可蒸馏性。
链接: https://arxiv.org/abs/2510.26038
作者: Jiali Cheng,Chirag Agarwal,Hadi Amiri
机构: University of Massachussetts Lowell (马萨诸塞大学洛厄尔分校); University of Virginia (弗吉尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model’s robustness against spurious correlations that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on the transferability of ``debiasing’’ capabilities from teacher models to student models on natural language inference (NLI) and image classification tasks. Through extensive experiments, we illustrate several key findings: (i) overall the debiasing capability of a model is undermined post-KD; (ii) training a debiased model does not benefit from injecting teacher knowledge; (iii) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and (iv) we pin-point the internal attention pattern and circuit that causes the distinct behavior post-KD. Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models. To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale. Our findings provide understandings on how KD works and how to design better debiasing methods.
zh
[NLP-56] SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在规划与工具调用过程中引入的新安全风险问题,提出了一种通用的红队测试框架SIRAJ,以系统性发现漏洞并保障其安全部署。解决方案的关键在于:首先通过动态两阶段流程生成覆盖多种风险结果、工具使用轨迹和风险来源的多样化种子测试用例;其次基于先前执行轨迹迭代构建并优化模型驱动的对抗攻击;同时引入模型蒸馏方法,利用教师模型的结构化推理过程训练出更小但同样有效的红队模型,从而显著降低红队测试成本并提升攻击成功率。
链接: https://arxiv.org/abs/2510.26037
作者: Kaiwen Zhou,Ahmed Elgohary,A S M Iftekhar,Amin Saied
机构: Microsoft Responsible AI Research (微软负责任人工智能研究); University of California Santa Cruz (加州大学圣克鲁兹分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model’s reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 – 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
zh
[NLP-57] Artificial Intelligence-Enabled Analysis of Radiology Reports: Epidemiology and Consequences of Incidental Thyroid Findings
【速读】: 该论文旨在解决偶然性甲状腺发现(Incidental Thyroid Findings, ITFs)在临床影像学检查中日益增多但其流行病学特征、影像学表现及后续临床结局尚不明确的问题。研究通过构建基于Transformer架构的自然语言处理(Natural Language Processing, NLP)管道,从多模态、多部位的放射学报告中自动识别ITFs并提取结节特征,进而分析其患病率、临床转归及相关风险因素。关键解决方案在于开发并验证了一种高精度的NLP系统,实现了对非甲状腺影像报告中ITFs的自动化挖掘与结构化信息提取,从而支持大规模回顾性队列研究,揭示了ITFs与甲状腺癌过度诊断之间的强关联,为优化随访策略和标准化报告提供了依据。
链接: https://arxiv.org/abs/2510.26032
作者: Felipe Larios,Mariana Borras-Osorio,Yuqi Wu,Ana Gabriela Claros,David Toro-Tobon,Esteban Cabezas,Ricardo Loor-Torres,Maria Mateo Chavez,Kerly Guevara Maldonado,Luis Vilatuna Andrango,Maria Lizarazo Jimenez,Ivan Mateo Alzamora,Misk Al Zahidy,Marcelo Montero,Ana Cristina Proano,Cristian Soto Jacome,Jungwei W. Fan,Oscar J. Ponce-Ponte,Megan E. Branda,Naykky Singh Ospina,Juan P. Brito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Importance Incidental thyroid findings (ITFs) are increasingly detected on imaging performed for non-thyroid indications. Their prevalence, features, and clinical consequences remain undefined. Objective To develop, validate, and deploy a natural language processing (NLP) pipeline to identify ITFs in radiology reports and assess their prevalence, features, and clinical outcomes. Design, Setting, and Participants Retrospective cohort of adults without prior thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline identified ITFs and extracted nodule characteristics from image reports from multiple modalities and body regions. Main Outcomes and Measures Prevalence of ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer diagnosis. Logistic regression identified demographic and imaging-related factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9% women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more likely in women, older adults, those with higher BMI, and when imaging was ordered by oncology or internal medicine. Compared with chest CT, ITFs were more likely via neck CT, PET, and nuclear medicine scans. Nodule characteristics were poorly documented, with size reported in 44% and other features in fewer than 15% (e.g. calcifications). Compared with patients without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis, biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were common and strongly associated with cascades leading to the detection of small, low-risk cancers. These findings underscore the role of ITFs in thyroid cancer overdiagnosis and the need for standardized reporting and more selective follow-up.
zh
[NLP-58] Rethinking Cross-lingual Alignment: Balancing Transfer and Cultural Erasure in Multilingual LLM s
【速读】: 该论文试图解决跨语言对齐(Cross-lingual Alignment, CLA)过程中存在的“文化消解”问题,即在追求多语言表示收敛以实现知识迁移的同时,导致模型丧失提供基于查询语言文化语境的差异化响应能力。解决方案的关键在于发现通用事实知识迁移与文化特异性知识在不同模型层具有可分离性,并提出一种名为“外科导向”(Surgical Steering)的推理时干预方法,通过针对特定层进行激活引导(activation steering),从而在不损害知识迁移性能的前提下提升文化定位能力,实现二者之间的更优平衡。
链接: https://arxiv.org/abs/2510.26024
作者: HyoJung Han,Sweta Agrawal,Eleftheria Briakou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-lingual alignment (CLA) aims to align multilingual representations, enabling Large Language Models (LLMs) to seamlessly transfer knowledge across languages. While intuitive, we hypothesize, this pursuit of representational convergence can inadvertently cause “cultural erasure”, the functional loss of providing culturally-situated responses that should diverge based on the query language. In this work, we systematically analyze this trade-off by introducing a holistic evaluation framework, the transfer-localization plane, which quantifies both desirable knowledge transfer and undesirable cultural erasure. Using this framework, we re-evaluate recent CLA approaches and find that they consistently improve factual transfer at the direct cost of cultural localization across all six languages studied. Our investigation into the internal representations of these models reveals a key insight: universal factual transfer and culturally-specific knowledge are optimally steerable at different model layers. Based on this finding, we propose Surgical Steering, a novel inference-time method that disentangles these two objectives. By applying targeted activation steering to distinct layers, our approach achieves a better balance between the two competing dimensions, effectively overcoming the limitations of current alignment techniques.
zh
[NLP-59] PORTool: Tool-Use LLM Training with Rewarded Tree
【速读】: 该论文旨在解决当前工具调用大型语言模型(Tool-use LLM)在静态数据集上训练时,因仅模仿通用工具调用流程而缺乏探索能力,导致在动态、演化环境中表现受限的问题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的PORTool方法:通过生成多条工具调用轨迹(rollouts),构建具有共享前缀的树状结构;设计逐步奖励机制(step-wise rewards),依据每个步骤对正确答案和成功工具调用的贡献分配奖励,其中同一分叉点下的不同路径获得差异化奖励;最终结合分叉相对优势(fork-relative advantages)与轨迹相对优势(trajectory-relative advantages)共同优化模型,从而增强模型在复杂工具调用场景中的探索能力和决策效率。
链接: https://arxiv.org/abs/2510.26020
作者: Feijie Wu,Weiwu Zhu,Yuxiang Zhang,Soumya Chatterjee,Jiarong Zhu,Fan Mo,Rodin Luo,Jing Gao
机构: Purdue University (普渡大学); Apple (苹果公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.
zh
[NLP-60] CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
【速读】: 该论文旨在解决当前计算机视觉领域中对真实世界视觉异常(visual anomalies)识别与理解能力不足的问题,现有研究多局限于工业缺陷或人工合成的异常场景,难以反映现实世界的复杂性和多样性。其解决方案的关键在于提出首个面向真实世界视觉异常的基准测试集CAVE,该基准支持三个开放任务:异常描述、解释与论证,并提供细粒度标注,涵盖异常的视觉定位、表现形式、复杂度、严重程度及常见性等维度。这些标注基于认知科学中人类识别和处理异常的机制,为评估视觉-语言模型(VLMs)在异常检测与常识推理方面的能力提供了全面框架,从而推动该领域的研究进展。
链接: https://arxiv.org/abs/2510.26006
作者: Rishika Bhagwatkar,Syrielle Montariol,Angelika Romanou,Beatriz Borges,Irina Rish,Antoine Bosselut
机构: EPFL(瑞士联邦理工学院); MILA(蒙特利尔学习算法研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
zh
[NLP-61] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
【速读】: 该论文旨在解决小规模开源大语言模型(Large Language Models, LLMs)在多步推理任务中表现不佳的问题,具体表现为:基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)在正确解稀疏时难以采样有效轨迹,而监督微调(Supervised Fine-Tuning, SFT)则因逐token模仿导致过拟合长示范序列。解决方案的关键在于提出一种新的训练框架——监督强化学习(Supervised Reinforcement Learning, SRL),其将问题求解重构为生成一系列逻辑“动作”的过程,并通过在每个步骤中引入专家动作相似度作为平滑奖励信号,使模型在内部生成推理对话(reasoning monologue)后再执行动作。该机制不仅在所有轨迹均错误的情况下仍提供丰富的学习信号,还借助专家示范引导灵活推理,从而显著提升小模型对复杂推理任务的学习能力;此外,SRL预训练后再用RLVR精调可获得最优整体性能。
链接: https://arxiv.org/abs/2510.25992
作者: Yihe Deng,I-Hung Hsu,Jun Yan,Zifeng Wang,Rujun Han,Gufeng Zhang,Yanfei Chen,Wei Wang,Tomas Pfister,Chen-Yu Lee
机构: Google Cloud AI Research(谷歌云人工智能研究); Google Cloud(谷歌云); UCLA(加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical “actions”. SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model’s actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
zh
[NLP-62] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在仅需预填充(prefill)阶段的推理任务中,由于自注意力(self-attention)计算复杂度随序列长度呈二次增长而导致的性能瓶颈问题。解决方案的关键在于提出AttnCache框架,该框架基于注意力映射记忆数据库,利用高效的缓存与相似性搜索技术,在推理过程中检索并复用先前缓存的相似注意力映射,从而显著减少自注意力计算开销。实验表明,AttnCache在CPU和GPU上分别实现了平均1.2x/1.6x的端到端加速和2x/3x的注意力计算加速,且精度损失可忽略不计。
链接: https://arxiv.org/abs/2510.25979
作者: Dinghong Song(1),Yuan Feng(1),Yiwei Wang(1),Shangye Chen(1),Cyril Guyot(2),Filip Blagojevic(2),Hyeran Jeon(1),Pengfei Su(1),Dong Li(1) ((1) University of California, Merced, USA, (2) Western Digital Research, USA)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:  10 pages, 6 figures, submitted to Ninth Annual Conference on Machine Learning and Systems (MLSys’26)
Abstract:Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
zh
[NLP-63] NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium EUROSYS’26
【速读】: 该论文旨在解决在Amazon Web Services(AWS)最新推出的AI加速器Trainium上高效执行大语言模型(Large Language Model, LLM)推理时所面临的性能瓶颈问题,尤其是由其脉动阵列(systolic array)架构和特殊数据布局要求带来的挑战。解决方案的关键在于设计了一种高性能矩阵乘法(matmul)计算核,通过定制化的内核融合(kernel fusion)技术和新颖的缓存策略,显著减少了软件管理内存层次结构中的数据移动,最大化了片上静态随机存取存储器(SRAM)带宽,并避免了昂贵的矩阵转置操作。实验表明,该方案在九个数据集和四个近期LLM上的端到端推理性能相比AWS官方实现平均提升1.66倍(最高达2.49倍)。
链接: https://arxiv.org/abs/2510.25977
作者: Dinghong Song(1),Jierui Xu(2),Weichu Yang(2),Pengfei Su(1),Dong Li(1) ((1) University of California, Merced, (2) University of Wisconsin, Madison)
机构: University of California, Merced (加州大学默塞德分校); University of Wisconsin, Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL)
备注:  12 pages, 8 figures, submitted to the Proceedings of the Twenty-First European Conference on Computer Systems (EuroSys’26)
Abstract:AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.
zh
[NLP-64] SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂数学推理任务中表现不佳的问题,尤其是由于基于自然语言的生成方式导致解题过程缺乏验证机制,从而产生未经验证且算术上不严谨的结果。现有提示策略如思维链(Chain of Thought)仍依赖不可靠的文本生成范式,无法实现确定性验证。其解决方案的关键在于提出SymCode——一个神经符号(neurosymbolic)框架,将数学问题求解重构为可验证代码生成任务,利用SymPy符号计算库实现确定性验证。该方法显著提升了在MATH-500和OlympiadBench等挑战性基准上的准确率,并使模型错误从隐蔽的逻辑谬误转变为透明的程序性错误,从而增强了AI在形式化领域中的准确性与可信度。
链接: https://arxiv.org/abs/2510.25975
作者: Sina Bagheri Nezhad,Yao Li,Ameeta Agrawal
机构: Portland State University (波特兰州立大学); ElastixAI (ElastixAI)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
Abstract:Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.
zh
[NLP-65] Semantic Label Drift in Cross-Cultural Translation
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)在低资源语言中生成合成数据时,因文化差异导致语义标签漂移(label drift)的问题,进而影响下游任务的准确性与文化适配性。其解决方案的关键在于揭示并量化文化相似性对标签保真度的影响:研究发现,现代大型语言模型(Large Language Models, LLMs)虽具备文化知识编码能力,但若未考虑源语言与目标语言之间的文化相似性,反而会放大标签漂移现象;因此,提升MT系统在跨文化场景下的标签一致性需以文化对齐为前提,从而保障语义忠实性和应用安全性。
链接: https://arxiv.org/abs/2510.25967
作者: Mohsinul Kabir,Tasnim Ahmed,Md Mezbaur Rahman,Polydoros Giannouris,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); Queen’s University (皇后大学); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Machine Translation (MT) is widely employed to address resource scarcity in low-resource languages by generating synthetic data from high-resource counterparts. While sentiment preservation in translation has long been studied, a critical but underexplored factor is the role of cultural alignment between source and target languages. In this paper, we hypothesize that semantic labels are drifted or altered during MT due to cultural divergence. Through a series of experiments across culturally sensitive and neutral domains, we establish three key findings: (1) MT systems, including modern Large Language Models (LLMs), induce label drift during translation, particularly in culturally sensitive domains; (2) unlike earlier statistical MT tools, LLMs encode cultural knowledge, and leveraging this knowledge can amplify label drift; and (3) cultural similarity or dissimilarity between source and target languages is a crucial determinant of label preservation. Our findings highlight that neglecting cultural factors in MT not only undermines label fidelity but also risks misinterpretation and cultural conflict in downstream applications.
zh
[NLP-66] Revisiting Multilingual Data Mixtures in Language Model Pretraining
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多语言预训练过程中是否存在“多语言诅咒”(curse of multilinguality)的问题,即随着训练语言数量的增加,模型在单个语言上的性能是否会下降。研究表明,这种担忧并不必然成立。解决方案的关键在于:通过合理平衡多语言数据的分布,确保每种语言在预训练语料库中具有足够的词元(token)数量,从而在不牺牲任何语言性能的前提下提升模型的多语言能力;此外,使用英语作为枢纽语言(pivot language)可有效促进跨语言家族的泛化能力,而并非局限于特定语系内的语言选择能带来一致收益。
链接: https://arxiv.org/abs/2510.25947
作者: Negar Foroutan,Paul Teiletche,Ayush Kumar Tarun,Antoine Bosselut
机构: EPFL (École Polytechnique Fédérale de Lausanne)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:  Under Review
Abstract:The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant “curse of multilinguality” as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings
zh
[NLP-67] RECAP: Reproducing Copyrighted Data from LLM s Training with an Agent ic Pipeline
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)训练数据不可见时,如何有效识别和提取其记忆内容的问题。核心挑战在于模型可能隐式存储了训练数据中的敏感或版权内容,但缺乏直接访问训练集的途径。解决方案的关键是提出RECAP——一个基于代理(agentic)的反馈驱动流水线,通过迭代优化机制实现对目标文本的精准提取:首先由目标模型生成候选内容,再由辅助语言模型(secondary language model)进行比对并生成最小化修正提示(minimal correction hints),反馈至目标模型以引导后续生成;同时引入越狱模块(jailbreaking module)应对因对齐机制导致的拒绝响应,从而显著提升提取准确率。实验表明,该方法在EchoTrace基准上相较单次迭代策略有显著提升,如GPT-4.1在版权文本提取任务中ROUGE-L得分从0.38提升至0.47。
链接: https://arxiv.org/abs/2510.25941
作者: André V. Duarte,Xuying li,Bin Zeng,Arlindo L. Oliveira,Lei Li,Zhuo Li
机构: Carnegie Mellon University (卡内基梅隆大学); Instituto Superior Técnico/INESC-ID (里斯本理工学院/INESC-ID); Hydrox AI (Hydrox AI)
类目: Computation and Language (cs.CL)
备注:
Abstract:If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.
zh
[NLP-68] FakeZero: Real-Time Privacy-Preserving Misinformation Detection for Facebook and X
【速读】: 该论文旨在解决社交媒体平台中虚假信息(misinformation)传播速度快、危害公共讨论的问题。其解决方案的关键在于提出 FakeZero,一个完全运行在客户端的跨平台浏览器扩展,能够在用户浏览 Facebook 和 X(原 Twitter)时实时标记不可靠内容。该方案的核心创新在于所有计算(包括 DOM 抽取、分词、Transformer 推理和 UI 渲染)均通过 Chromium 消息 API 在本地完成,确保用户数据不外泄;同时采用三阶段训练策略(基础微调、领域自适应训练结合焦点损失、对抗增强与后训练量化),使轻量级 DistilBERT-Quant 模型(67.6 MB)达到 97.1% 宏 F1 和 0.996 AUROC 的高精度检测性能,且延迟仅为约 103 ms,进一步优化后的 TinyBERT-Quant 版本(14.7 MB)仍保持 95.7% 宏 F1,延迟降至 40 ms,验证了在资源受限环境下实现高质量虚假新闻检测的可行性。
链接: https://arxiv.org/abs/2510.25932
作者: Soufiane Essahli,Oussama Sarsar,Imane Fouad,Anas Motii,Ahmed Bentajer
机构: Université Mohammed VI Polytechnique (穆罕默德六世理工大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:  Accepted for publication in the Proceedings of the 24th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2025) Privacy track, 11 pages, 8 figures
Abstract:Social platforms distribute information at unprecedented speed, which in turn accelerates the spread of misinformation and threatens public discourse. We present FakeZero, a fully client-side, cross-platform browser extension that flags unreliable posts on Facebook and X (formerly Twitter) while the user scrolls. All computation, DOM scraping, tokenisation, Transformer inference, and UI rendering run locally through the Chromium messaging API, so no personal data leaves the this http URL employs a three-stage training curriculum: baseline fine-tuning and domain-adaptive training enhanced with focal loss, adversarial augmentation, and post-training quantisation. Evaluated on a dataset of 239,000 posts, the DistilBERT-Quant model (67.6 MB) reaches 97.1% macro-F1, 97.4% accuracy, and an AUROC of 0.996, with a median latency of approximately 103 ms on a commodity laptop. A memory-efficient TinyBERT-Quant variant retains 95.7% macro-F1 and 96.1% accuracy while shrinking the model to 14.7 MB and lowering latency to approximately 40 ms, showing that high-quality fake-news detection is feasible under tight resource budgets with only modest performance this http URL providing inline credibility cues, the extension can serve as a valuable tool for policymakers seeking to curb the spread of misinformation across social networks. With user consent, FakeZero also opens the door for researchers to collect large-scale datasets of fake news in the wild, enabling deeper analysis and the development of more robust detection techniques.
zh
[NLP-69] Evaluating the Impact of LLM -Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
【速读】: 该论文旨在解决当前大语言模型(LLM)在语义标注数据集构建中的应用缺乏系统性评估的问题,尤其是从自然语言处理(NLP)的视角出发,对自动与半自动标注方法在标注效率、覆盖度和多样性方面的表现尚不明确。其解决方案的关键在于通过对比三种实验设置——人工标注、全自动标注和半自动标注——来量化评估基于LLM的语义角色标注器在FrameNet类语义标注任务中的性能差异,结果表明半自动标注能够在保持标注覆盖率的同时显著提升框架多样性,优于纯人工标注,而全自动标注在多数指标上表现较差,仅在标注时间上具有优势。
链接: https://arxiv.org/abs/2510.25904
作者: Frederico Belcavello,Ely Matos,Arthur Lorenzi,Lisandra Bonoto,Lívia Ruiz,Luiz Fernando Pereira,Victor Herbst,Yulla Navarro,Helen de Andrade Abreu,Lívia Dutra,Tiago Timponi Torrent
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.
zh
[NLP-70] Approximating Human Preferences Using a Multi-Judge Learned System
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的评判器(judge)在对齐人类偏好时面临的挑战,包括校准困难、评分规则敏感性(rubric sensitivity)、偏见(bias)和不稳定性等问题。这些问题限制了其在强化学习中的人类反馈(Reinforcement Learning from Human Feedback, RLHF)中构建可靠奖励模型以及在智能路由系统中为用户查询选择最优模型的应用效果。解决方案的关键在于提出一种基于人格特征(persona-based)的偏好建模框架,通过学习聚合多个受评分规则条件约束的评判器输出,实现多样化偏好标签的规模化合成;并设计两种不同的聚合器实现方式:广义加性模型(Generalized Additive Model, GAM)和多层感知机(Multi-Layer Perceptron, MLP),从而提升评判系统的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2510.25884
作者: Eitán Sprejer,Fernando Avalos,Augusto Bernardi,Jose Pedro Brito de Azevedo Faustino,Jacob Haimes,Narmeen Fatimah Oozeer
机构: BAISH | UBA | Apart Research; University of São Paulo | Apart Research; Apart Research; Dovetail Research | Apart Research; Apart Research; Martian
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
zh
[NLP-71] hrough the Judges Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在主观评估任务中可靠性不足的问题,尤其当人类判断涉及超出标注标签的细微推理时。其核心挑战在于如何获取和构建人类决策背后的“思维轨迹”(thinking traces),这些轨迹虽具高信息价值但难以大规模收集。解决方案的关键在于提出一种人-LLM协作框架,通过简单的拒绝采样(rejection sampling)方法从仅含标签的注释数据中推断出思维轨迹,并将其应用于两个互补任务:一是微调开放源代码LLM评分器以提升其与人类判断的一致性;二是生成更清晰的标注指南,用于优化专有LLM评分器的性能。实验表明,该方法显著提升了LLM与人类之间的评估一致性,并增强了不同LLM模型间的共识,证明了LLM可作为人类思维轨迹的有效代理,从而将标签数据扩展为富含思维轨迹的增强资源,提升LLM评分器的可靠性。
链接: https://arxiv.org/abs/2510.25860
作者: Xingjian Zhang,Tianhong Gao,Suliang Jin,Tianhao Wang,Teng Ye,Eytan Adar,Qiaozhu Mei
机构: University of Michigan (密歇根大学); University of California, San Diego (加州大学圣地亚哥分校); University of Minnesota, Twin Cities (明尼苏达大学双城分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.
zh
[NLP-72] A Survey on Efficient Large Language Model Training: From Data-centric Perspectives ACL2025
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)后训练阶段面临的数据效率低下问题,具体表现为人工标注成本高和数据规模增长带来的边际收益递减。其解决方案的关键在于从数据中心视角系统性地梳理并分类数据高效后训练方法,提出一个涵盖数据选择、数据质量增强、合成数据生成、数据蒸馏与压缩以及自进化数据生态系统的五维分类体系,并总结各类别中的代表性方法,从而为提升数据利用效率提供结构化研究路径与未来方向。
链接: https://arxiv.org/abs/2510.25817
作者: Junyu Luo,Bohan Wu,Xiao Luo,Zhiping Xiao,Yiqiao Jin,Rong-Cheng Tu,Nan Yin,Yifan Wang,Jingyang Yuan,Wei Ju,Ming Zhang
机构: Peking University (北京大学); University of California, Los Angeles (加州大学洛杉矶分校); University of Washington (华盛顿大学); Georgia Institute of Technology (佐治亚理工学院); Nanyang Technological University (南洋理工大学); HKUST (香港科技大学); University of International Business and Economics (对外经济贸易大学)
类目: Computation and Language (cs.CL)
备注:  ACL 2025
Abstract:Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: this https URL
zh
[NLP-73] Beyond Long Context: When Semantics Matter More than Tokens
【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHR)中临床文档以base64编码形式存储于FHIR DocumentReference资源时,导致语义问答困难的问题。传统向量数据库方法难以捕捉细微的临床关系,从而影响问答准确性。其解决方案的核心是提出一种基于实体感知的检索方法——临床实体增强检索(Clinical Entity Augmented Retrieval, CLEAR),通过识别和利用临床实体信息进行检索,显著提升了语义匹配精度与计算效率:在测试中,CLEAR相较于基于嵌入的检索方法将F1分数从0.86提升至0.90,并减少超过70%的token使用量;在长篇临床笔记(>65,000 tokens)上,其胜率高达75%,且平均语义相似度达0.878,同时比宽上下文处理节省78%的token。这表明,实体感知检索能够有效改善临床自然语言处理中的准确性和效率。
链接: https://arxiv.org/abs/2510.25816
作者: Tarun Kumar Chawdhury,Jon D. Duke
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:  12 pages, 5 figures
Abstract:Electronic Health Records (EHR) store clinical documentation as base64 encoded attachments in FHIR DocumentReference resources, which makes semantic question answering difficult. Traditional vector database methods often miss nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR) method, introduced by Lopez et al. 2025, uses entity aware retrieval and achieved improved performance with an F1 score of 0.90 versus 0.86 for embedding based retrieval, while using over 70 percent fewer tokens. We developed a Clinical Notes QA Evaluation Platform to validate CLEAR against zero shot large context inference and traditional chunk based retrieval augmented generation. The platform was tested on 12 clinical notes ranging from 10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a 58.3 percent win rate, an average semantic similarity of 0.878, and used 78 percent fewer tokens than wide context processing. The largest performance gains occurred on long notes, with a 75 percent win rate for documents exceeding 65,000 tokens. These findings confirm that entity aware retrieval improves both efficiency and accuracy in clinical natural language processing. The evaluation framework provides a reusable and transparent benchmark for assessing clinical question answering systems where semantic precision and computational efficiency are critical.
zh
[NLP-74] Ideology-Based LLM s for Content Moderation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在内容审核系统中因角色设定(persona)引入的潜在意识形态偏倚问题,特别是其对有害内容分类一致性与公平性的影响。解决方案的关键在于通过系统性实验揭示:尽管整体分类准确率未显著变化,但不同意识形态倾向的角色设定会显著改变模型对内容的判断倾向,且大模型更易与同意识形态角色保持一致,从而加剧跨群体分歧;进一步的政治针对性任务验证了角色不仅强化内部一致性,还表现出对对立观点的有害性弱化倾向,表明角色设定可能隐蔽地放大政党立场,削弱AI系统的中立性。
链接: https://arxiv.org/abs/2510.25805
作者: Stefano Civelli,Pietro Bernardelle,Nardiena A. Pratama,Gianluca Demartini
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model “views” input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.
zh
[NLP-75] Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
【速读】: 该论文旨在解决长文本预训练数据中存在大量缺乏有意义长距离依赖关系的问题,这类数据在训练过程中效率低下,限制了长上下文语言模型(Long-context language models)性能的提升。解决方案的关键在于提出 LongFilter 框架,通过对比模型在短上下文和长上下文设置下的预测差异,量化扩展上下文带来的信息增益,从而识别出真正需要利用长距离依赖关系的数据样本,实现高质量、针对性的数据筛选与优化训练。
链接: https://arxiv.org/abs/2510.25804
作者: Haoran Deng,Yingyu Lin,Zhenghao Lin,Xiao Liu,Yizhou Sun,Yi-An Ma,Yeyun Gong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
zh
[NLP-76] Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的多模态大语言模型(Multimodal Large Language Models, MLLMs)在冷启动阶段因依赖监督微调(Supervised Fine-Tuning, SFT)所导致的泛化能力弱、过拟合于指令风格以及下游RL性能受限的问题。其核心解决方案是提出SPECS框架——一种自蒸馏的偏好训练冷启动方法,关键在于通过自蒸馏生成内省式偏好数据对,实现多模态学习的解耦:首先聚焦于浅层可迁移的表面形式特征(如格式、结构、风格)进行偏好训练,避免内容记忆;随后将模型交由具有可验证奖励机制的RL进一步优化深层推理能力,从而显著提升模型在分布内和分布外任务上的表现与训练稳定性。
链接: https://arxiv.org/abs/2510.25801
作者: Kun Chen,Peng Shi,Haibo Qiu,Zhixiong Zeng,Siqi Yang,Wenji Mao,Lin Ma
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Meituan (美团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:  Project Page: this https URL
Abstract:Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of “MLLM-r1” approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution “stuckness,” improving exploration, stabilizing training, and raising the performance ceiling.
zh
[NLP-77] LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection
【速读】: 该论文旨在解决多目标决策中人类专家难以从大量选项中选择最优方案的问题,其核心瓶颈在于复杂且隐含的偏好难以形式化表达。解决方案的关键在于提出一种名为LISTEN的框架,利用大语言模型(Large Language Model, LLM)作为零样本偏好预言机(zero-shot preference oracle),仅通过专家用自然语言描述的高层次优先级即可引导决策过程。为适配LLM在上下文窗口和推理成本上的限制,作者设计了两种迭代算法:LISTEN-U基于参数化效用函数的逐步优化,适用于偏好可被参数对齐的情形;LISTEN-T则采用非参数化的锦标赛式选择策略,在小批量解中进行比较,展现出更强的鲁棒性。该方法显著降低了传统偏好获取的认知负担,为直接以自然语言驱动复杂多目标决策提供了新路径。
链接: https://arxiv.org/abs/2510.25799
作者: Adam S. Jovine,Tinghan Ye,Francis Bahk,Jingjing Wang,David B. Shmoys,Peter I. Frazier
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert’s high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.
zh
[NLP-78] MemEIC: A Step Toward Continual and Compositional Knowledge Editing NEURIPS2025
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在持续更新知识时存在的两个核心问题:一是现有知识编辑技术多局限于单一模态(视觉或语言)的独立编辑,忽视了LVLM固有的多模态特性;二是缺乏对多模态知识间协同关系的建模能力,导致编辑效果受限且难以维持历史编辑成果。解决方案的关键在于提出MemEIC方法,其核心创新包括:1)设计一个融合外部记忆与内部LoRA适配器的混合编辑架构,通过双外部记忆实现跨模态证据检索,并利用解耦的LoRA适配器分别更新视觉和文本模态参数;2)引入类脑知识连接器(brain-inspired knowledge connector),在需要时激活以进行多模态信息整合,支持顺序化的组合式知识编辑。该方案有效提升了复杂多模态任务的表现并实现了对先前编辑结果的稳定保留,为LVLM的持续组合式知识编辑(Continual and Compositional Knowledge Editing, CCKE)设立了新基准。
链接: https://arxiv.org/abs/2510.25798
作者: Jin Seong,Jiyun Park,Wencke Liermann,Hongseok Choi,Yoonji Nam,Hyun Kim,Soojong Lim,Namhoon Lee
机构: Electronics and Telecommunications Research Institute, Republic of Korea(韩国电子通信研究院); POSTECH(浦项科技大学); Sungkyunkwan University(成均馆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:  NeurIPS 2025, 38 pages, 8 figures
Abstract:The dynamic nature of information necessitates continuously updating large vision-language models (LVLMs). While recent knowledge editing techniques hint at promising directions, they often focus on editing a single modality (vision or language) in isolation. This prevalent practice neglects the inherent multimodality of LVLMs and the continuous nature of knowledge updates, potentially leading to suboptimal editing outcomes when considering the interplay between modalities and the need for ongoing knowledge refinement. To address these limitations, we propose MemEIC, a novel method for Continual and Compositional Knowledge Editing (CCKE) in LVLMs. MemEIC enables compositional editing of both visual and textual knowledge sequentially. Our approach employs a hybrid external-internal editor featuring a dual external memory for cross-modal evidence retrieval and dual LoRA adapters that facilitate disentangled parameter updates for each modality. A key component is a brain-inspired knowledge connector, activated selectively for compositional reasoning, that integrates information across different modalities. Experiments demonstrate that MemEIC significantly improves performance on complex multimodal questions and effectively preserves prior edits, setting a new benchmark for CCKE in LVLMs.
zh
[NLP-79] Enhancing Underwater Object Detection through Spatio-Temporal Analysis and Spatial Attention Networks
【速读】: 该论文旨在解决水下复杂环境中目标检测精度低的问题,特别是在动态海洋场景中因突发运动、部分遮挡和渐进式移动导致的检测可靠性不足。解决方案的关键在于引入时空建模与空间注意力机制:首先通过改进YOLOv5结构以增强时间维度信息(Temporal-enhanced YOLOv5, T-YOLOv5),提升对动态目标的感知能力;随后在T-YOLOv5基础上集成卷积块注意力模块(Convolutional Block Attention Module, CBAM),进一步优化特征表示,使模型能聚焦于关键空间区域。实验表明,T-YOLOv5相比标准YOLOv5将mAP@50-95从0.563提升至0.813,加入CBAM后保持相近性能(0.811),验证了时空建模与注意力机制协同作用对复杂水下场景检测精度的有效提升。
链接: https://arxiv.org/abs/2510.25797
作者: Sai Likhith Karri,Ansh Saxena
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注:
Abstract:This study examines the effectiveness of spatio-temporal modeling and the integration of spatial attention mechanisms in deep learning models for underwater object detection. Specifically, in the first phase, the performance of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is developed, through the addition of a Convolutional Block Attention Module (CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the research highlights how temporal modeling improves detection accuracy in dynamic marine environments, particularly under conditions of sudden movements, partial occlusions, and gradual motion. The testing results showed that YOLOv5 achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively, highlighting their superior accuracy and generalization in detecting complex objects. The findings demonstrate that T-YOLOv5 significantly enhances detection reliability compared to the standard model, while T-YOLOv5 with CBAM further improves performance in challenging scenarios, although there is a loss of accuracy when it comes to simpler scenarios.
zh
[NLP-80] BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
【速读】: 该论文旨在解决机制可解释性中的电路发现(circuit discovery)问题,即识别模型中执行特定任务的组件部分。其关键解决方案包括三项改进:首先,采用自助法(bootstrapping)识别具有稳定归因分数(attribution scores)的边;其次,引入基于比例的筛选策略,优先选择正向得分较高的边,在性能与忠实性之间取得平衡;最后,将传统的贪心选择策略替换为整数线性规划(integer linear programming)公式,从而获得更忠实且高效的电路结构。
链接: https://arxiv.org/abs/2510.25786
作者: Yaniv Nikankin,Dana Arad,Itay Itzhak,Anja Reusch,Adi Simhi,Gal Kesten-Pomeranz,Yonatan Belinkov
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: this https URL.
zh
[NLP-81] zFLoRA: Zero-Latency Fused Low-Rank Adapters
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署多任务适配器(task-specific adapters)时,尽管适配器参数量极少(通常不足基础模型的1%),却在推理阶段引入显著计算开销的问题(最高可达基础模型的2.5倍)。解决方案的关键在于提出一种零延迟融合低秩适配器(zero-latency fused low-rank adapter, zFLoRA),通过结构优化实现与基础模型的高效融合,在不增加或仅引入可忽略延迟的前提下完成多任务适配。实验表明,zFLoRA在1B、3B和7B规模的LLM上均优于主流监督微调基准(包括LoRA和全量微调),并在NPU(Samsung Galaxy S25+)与GPU(NVIDIA H100)平台上验证了其零至 negligible 的延迟优势。
链接: https://arxiv.org/abs/2510.25784
作者: Dhananjaya Gowda,Seoha Song,Harshith Goka,Junhyun Lee
机构: Samsung Research (三星研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.
zh
[NLP-82] LASTIST: LArge-Scale Target-Independent STance dataset
【速读】: 该论文旨在解决当前立场检测(stance detection)研究中存在两个关键问题:一是现有方法主要聚焦于目标依赖型立场检测任务,缺乏对目标无关型立场检测的系统性支持;二是主流基准数据集多基于英文语料,难以支撑低资源语言(如韩语)的模型开发。解决方案的关键在于构建了一个大规模、高质量的韩语目标无关立场检测数据集——LASTIST,该数据集包含563,299条标注韩语文本,来源于韩国两大政党的新闻稿,并针对多种立场检测任务(包括目标无关和跨时间演变分析)进行了设计与验证,从而填补了韩语环境下立场检测研究的数据空白并推动了相关模型的发展。
链接: https://arxiv.org/abs/2510.25783
作者: DongJae Kim,Yaejin Lee,Minsu Park,Eunil Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:  8 pages (two columned), 1 figure
Abstract:Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person’s stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on this https URL.
zh
[NLP-83] Review Based Entity Ranking using Fuzzy Logic Algorithmic Approach: Analysis
【速读】: 该论文旨在解决传统基于词典的倾向性分析(Opinion Mining)方法在处理情感强度时缺乏细粒度分类的问题,即无法区分情感是“非常强烈”、“强烈”、“中等”、“微弱”还是“非常微弱”。其解决方案的关键在于引入一种融合模糊逻辑(Fuzzy Logic)与句法依存关系解析(Syntactic Dependency Resolution)的方法,将评论中的情感词(如副词、形容词、名词和动词)按语义粒度划分为五个等级,并结合目标属性(Aspect)进行评分建模,从而实现对实体在不同属性上的情感倾向及其强度的精细化量化。
链接: https://arxiv.org/abs/2510.25778
作者: Pratik N. Kalamkar,Anupama G. Phakatkar
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:  10 pages, 3 figures, International Journal Of Engineering And Computer Science ISSN:2319-7242
Abstract:Opinion mining, also called sentiment analysis, is the field of study that analyzes people opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. Holistic lexicon-based approach does not consider the strength of each opinion, i.e., whether the opinion is very strongly negative (or positive), strongly negative (or positive), moderate negative (or positive), very weakly negative (or positive) and weakly negative (or positive). In this paper, we propose approach to rank entities based on orientation and strength of the entity reviews and user’s queries by classifying them in granularity levels (i.e. very weak, weak, moderate, very strong and strong) by combining opinion words (i.e. adverb, adjective, noun and verb) that are related to aspect of interest of certain product. We shall use fuzzy logic algorithmic approach in order to classify opinion words into different category and syntactic dependency resolution to find relations for desired aspect words. Opinion words related to certain aspects of interest are considered to find the entity score for that aspect in the review.
zh
[NLP-84] StreetMath: Study of LLM s Approximation Behaviors
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非精确、快速场景下的近似数学推理能力被严重忽视的问题,尤其是针对非自回归解码器架构模型的研究匮乏。其核心贡献在于提出了StreetMath基准,用于评估模型在真实世界近似计算情境中的表现,并通过机制可解释性技术深入分析模型内部计算状态。关键解决方案包括:1)构建贴近现实的近似数学任务数据集;2)对比多种LLM架构在近似与精确运算上的行为差异;3)揭示近似与精确算术操作依赖于不同的神经组件,且模型倾向于执行精确计算或调用外部工具而非采用人类式的“认知吝啬”策略(cognitive miserliness)。
链接: https://arxiv.org/abs/2510.25776
作者: Chiung-Yi Tseng,Somshubhra Roy,Maisha Thasin,Danyang Zhang,Blessing Effiong
机构: LuxMuse AI; North Carolina State University (北卡罗来纳州立大学); University of Waterloo (滑铁卢大学); Vokram Group; Saint Louis University (圣路易斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models’ approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work this https URL
zh
计算机视觉
[CV-0] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
【速读】:该论文旨在解决现有2D lifting方法在生成3D场景时仅关注外观重建而忽略几何、材质等内在属性感知的问题,从而难以支持物理基础渲染(PBR)、再光照(relighting)和仿真等下游任务。其解决方案的关键在于提出OmniX框架,通过轻量高效的跨模态适配器结构,重新利用2D生成模型对全景图进行几何、纹理及PBR材质的联合感知,实现从2D全景图像到图形就绪(graphics-ready)3D场景的统一生成,显著提升了3D环境的物理真实性和可用性。
链接: https://arxiv.org/abs/2510.26800
作者: Yukun Huang,Jiwen Yu,Yanning Zhou,Jianan Wang,Xintao Wang,Pengfei Wan,Xihui Liu
机构: University of Hong Kong (香港大学); Kuaishou Technology (快手科技); Tencent (腾讯); Astribot (Astribot)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:  Project page: this https URL
Abstract:There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
zh
[CV-1] Masked Diffusion Captioning for Visual Feature Learning EMNLP2025
【速读】:该论文旨在解决如何通过自监督学习方式高效地提取视觉特征,以提升下游视觉任务的性能。传统方法如自回归式图像描述(autoregressive captioning)依赖于文本序列中token的位置信息来传递视觉学习信号,导致对辅助目标(auxiliary objectives)的强依赖。本文提出的关键解决方案是采用掩码扩散描述(Masked Diffusion Captioning, MDC),即利用一个基于视觉特征条件化的掩码扩散语言模型,在训练时随机掩码图像-文本对中的文本标记,并让解码器重建原始文本。该方法的核心优势在于其视觉学习信号不依赖于token位置,从而减少了对复杂辅助目标的依赖,同时在多种学术规模模型和数据集上的线性探测实验表明,所学视觉特征在性能上可与自回归和对比学习方法相媲美。
链接: https://arxiv.org/abs/2510.26799
作者: Chao Feng,Zihao Wei,Andrew Owens
机构: University of Michigan (密歇根大学); Cornell University (康奈尔大学); University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  EMNLP 2025 (Findings). Project page: this https URL
Abstract:We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token’s position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
zh
[CV-2] SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
【速读】:该论文旨在解决从非结构化视频中合成时空4D内容时对昂贵3D标注(如相机位姿)的依赖问题。现有方法通常需手动标注相机位姿或通过“轨迹到轨迹”范式建模,导致在真实场景中难以泛化且易混淆相机运动与场景动态。其解决方案的关键在于提出一种无位姿(pose-free)的“轨迹到相机”框架SEE4D:将显式的相机轨迹预测替换为固定虚拟相机集合的渲染,并训练视图条件的视频修复模型(view-conditional video inpainting),以学习鲁棒的几何先验并填补不同虚拟视角下的遮挡区域,从而解耦相机控制与场景建模;进一步设计基于虚拟相机样条曲线的时空自回归推理流程,在保证每步计算复杂度受限的前提下实现连贯的视频扩展与重建。
链接: https://arxiv.org/abs/2510.26796
作者: Dongyue Lu,Ao Liang,Tianxin Huang,Xiao Fu,Yuyang Zhao,Baorui Ma,Liang Pan,Wei Yin,Lingdong Kong,Wei Tsang Ooi,Ziwei Liu
机构: National University of Singapore (新加坡国立大学); The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学); Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室); Horizon Robotics ( horizon robotics); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:  26 pages; 21 figures; 3 tables; project page: this https URL
Abstract:Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
zh
[CV-3] Scaling Image Geo-Localization to Continent Level NEURIPS2025
【速读】:该论文旨在解决全球尺度下图像的精确定位问题,即如何在覆盖大陆级地理范围的场景中实现细粒度的地理定位(精度达200米以内)。传统图像检索方法因数据量庞大(如1亿张图像)且覆盖不足而效率低下,而现有可扩展方案通常只能提供粗粒度结果(误差超过10公里),或受限于地面与航空影像之间的域差异(domain gap),且多局限于小区域研究。解决方案的关键在于提出一种混合方法:在训练阶段引入代理分类任务(proxy classification task)以学习蕴含精确位置信息的丰富特征表示,并将这些学习到的原型(prototypes)与航空影像嵌入(aerial imagery embeddings)相结合,从而增强对地面数据稀疏性的鲁棒性,最终实现跨多国范围的直接、细粒度图像检索。
链接: https://arxiv.org/abs/2510.26795
作者: Philipp Lindenberger,Paul-Edouard Sarlin,Jan Hosang,Matteo Balice,Marc Pollefeys,Simon Lynen,Eduard Trulls
机构: ETH Zurich (苏黎世联邦理工学院); Google (谷歌); Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:  NeurIPS 2025
Abstract:Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68% of queries of a dataset covering a large part of Europe. The code is publicly available at this https URL.
zh
[CV-4] he Quest for Generalizable Motion Generation: Data Model and Evaluation
【速读】:该论文旨在解决3D人体动作生成(MoGen)模型在泛化能力上的根本瓶颈问题,尽管当前模型在标准基准测试中取得进展,但其对多样化场景和语义指令的适应性仍不足。解决方案的关键在于系统性地将视频生成(ViGen)领域的知识迁移至MoGen,涵盖数据、建模与评估三个核心维度:首先构建了包含228,000个高质量动作样本的ViMoGen-228K数据集,融合光学动捕数据、网络视频语义标注及先进ViGen模型合成样本;其次提出基于流匹配的扩散Transformer模型ViMoGen,通过门控多模态条件机制统一动捕数据与视频生成先验;最后设计MBench多层次评测基准,实现对动作质量、提示保真度和泛化能力的精细评估。该框架显著提升了模型性能,在自动与人工评价中均优于现有方法。
链接: https://arxiv.org/abs/2510.26794
作者: Jing Lin,Ruisi Wang,Junzhe Lu,Ziqi Huang,Guorui Song,Ailing Zeng,Xian Liu,Chen Wei,Wanqi Yin,Qingping Sun,Zhongang Cai,Lei Yang,Ziwei Liu
机构: Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技研究院); Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
zh
[CV-5] HEIR: Learning Graph-Based Motion Hierarchies
【速读】:该论文旨在解决现有运动建模方法依赖人工定义或启发式层次结构、固定运动基元而导致泛化能力受限的问题,尤其在复杂动态场景中难以自动学习可解释的运动层级关系。其解决方案的关键在于提出一种通用的分层运动建模方法,通过图神经网络(Graph Neural Networks, GNNs)将观测到的运动表示为基于图的层次结构,显式地将全局绝对运动分解为父节点继承的模式与局部残差运动,并将层次推断建模为一个可微的图学习问题,从而从数据中自动学习结构化且可解释的运动依赖关系。
链接: https://arxiv.org/abs/2510.26786
作者: Cheng Zheng,William Koch,Baiang Li,Felix Heide
机构: Princeton University (普林斯顿大学); Torc Robotics
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:  Code link: this https URL
Abstract:Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: this https URL
zh
[CV-6] Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
【速读】:该论文试图解决当前世界模型(world model)在长时程预测中表现脆弱、性能退化的问题,其核心原因在于感知输入(如图像)的高维性及潜在表示(latent representation)的质量不足,导致动态建模变得困难。解决方案的关键是通过几何正则化(geometric regularization)提升潜在空间的结构质量:提出几何正则化世界模型(Geometrically-Regularized World Models, GRWM),强制沿自然感官轨迹连续点在潜在空间中保持接近,从而学习到与环境真实拓扑一致的潜在流形(latent manifold)。该方法无需修改动力学模块即可显著提高滚动预测的保真度和稳定性,验证了优化表示学习本身即可有效增强世界模型的鲁棒性。
链接: https://arxiv.org/abs/2510.26782
作者: Zaishuo Xia,Yukuan Lu,Xinyi Li,Yifan Xu,Yubei Chen
机构: University of California, Davis; Open Path AI Foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.
zh
[CV-7] ChartAB: A Benchmark for Chart Grounding Dense Alignment
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在图表理解任务中感知细节不足、难以提取细粒度结构以及无法有效比较和推理多图表信息的问题。其解决方案的关键在于提出一个全新的“ChartAlign Benchmark (ChartAB)”,该基准通过设计专用的JSON模板来精准计算各类图表定位任务(如表格数据提取、可视化元素定位及属性识别)的评估指标,并引入两阶段推理流程以进一步评估VLMs在跨图表元素/属性对齐与比较中的能力,从而系统性揭示现有模型在图表理解上的感知偏差、脆弱性和幻觉现象,为提升模型性能提供明确方向。
链接: https://arxiv.org/abs/2510.26781
作者: Aniruddh Bansal,Davit Soselia,Dang Nguyen,Tianyi Zhou
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel “ChartAlign Benchmark (ChartAB)” to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs’ capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
zh
[CV-8] Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
【速读】:该论文旨在解决年龄相关性黄斑变性(Age-related Macular Degeneration, AMD)在非侵入式RGB眼底图像中病变区域的语义分割问题,以实现对不同类型AMD病灶的精准检测。解决方案的关键在于基于U-Net结构构建改进的分割框架,通过系统评估和优化预处理技术、编码器(backbone)网络复杂度以及针对图像级和像素级类别不平衡设计的专用损失函数,最终形成一个在ADAM挑战赛数据集上优于所有先前提交方案的多类病变分割模型。
链接: https://arxiv.org/abs/2510.26778
作者: Valentyna Starodub,Mantas Lukoševičius
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model’s architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
zh
[CV-9] SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在生成内容时难以精确遵循用户指令、易产生幻觉(hallucination)以及缺乏细粒度控制的问题。解决方案的关键在于提出了一种轻量级的可微分 steering 模块——SteerVLM,该模块通过学习成对提示(target 和 converse behavior)对应的潜在嵌入(latent embeddings),动态调节语言模态与图像上下文之间的激活连接,实现推理阶段对复杂输出语义的精细控制。其核心创新包括:1)基于维度级激活调制(dimension-wise activation modulation)和跨层自适应引导(adaptive steering),无需预提取静态向量或手动调整干预点;2)仅需原 VLM 参数量的 0.14% 即可实现高效控制,同时保持非目标任务性能不变;3)构建了 VNIA(Visual Narrative Intent Alignment)多模态数据集以支持 VLM 控制技术的开发与评估。
链接: https://arxiv.org/abs/2510.26769
作者: Anushka Sivakumar,Andrew Zhang,Zaber Hakim,Chris Thomas
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM’s size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
zh
[CV-10] he Impact and Outlook of 3D Gaussian Splatting
【速读】:该论文旨在系统梳理和总结3D高斯溅射(3D Gaussian Splatting, 3DGS)自提出以来在多个关键方向上的研究进展,以应对当前3D场景表示中对高效训练与渲染、动态建模(4D场景)、数学基础深化、移动端与虚拟现实(VR)部署、大规模环境扩展及近实时辐射场重建等挑战。其解决方案的关键在于通过多维度创新:一是提升资源效率的训练与渲染方法,二是推动从静态到动态(四维)场景建模的演进,三是强化外观建模与渲染过程的数学理论支撑,四是借助前馈或分布式计算实现快速重建,从而将3DGS从一种突破性表示技术发展为适用于多种下游任务的通用且基础性的3D视觉与图形工具。
链接: https://arxiv.org/abs/2510.26694
作者: Bernhard Kerbl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:  Article written for Frontiers of Science Award, International Congress on Basic Science, 2025
Abstract:Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.
zh
[CV-11] Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
【速读】:该论文旨在解决钢铁轧制生产线中设备故障预测与过程中断的早期识别问题,以降低非计划停机带来的经济损失。解决方案的关键在于构建一个基于机器视觉的异常检测系统,通过工业相机实时采集设备运行、对齐状态及热钢坯运动的视频流,并在集中式视频服务器上利用深度学习模型进行推理分析;该方案将计算负载从工业过程控制系统(PLCs)中卸载,实现跨产线的可扩展部署,同时融合数据采集系统(DAQ)的传感器数据与视觉输入,精准定位故障位置并推断可能的根本原因,从而为预防性维护提供可操作的洞察。
链接: https://arxiv.org/abs/2510.26684
作者: Vaibhav Kurrey,Sivakalyan Pujari,Gagan Raj Gupta
机构: Indian Institute of Technology Bhilai (印度理工学院比哈尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.
zh
[CV-12] Improving Classification of Occluded Objects through Scene Context
【速读】:该论文旨在解决物体识别算法在存在遮挡(occlusion)情况下性能显著下降的问题,尤其针对基于区域提议网络-深度卷积神经网络(RPN-DCNN)的目标检测框架。其核心解决方案在于引入场景上下文信息以增强模型鲁棒性,具体通过两种不同的基于场景的信息融合策略实现:第一种是在预测前根据背景场景选择合适的物体检测网络;第二种是在检测后将场景知识融合到RPN输出的初始目标得分中。实验表明,这两种方法均能在部分遮挡的挑战性数据集上提升召回率与精确度,且联合训练包含遮挡和未遮挡图像的数据集效果最优,体现出方法的有效性与可迁移性。
链接: https://arxiv.org/abs/2510.26681
作者: Courtney M. King,Daniel D. Leeds,Damian Lyons,George Kalaitzis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.
zh
[CV-13] owards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2
【速读】:该论文旨在解决北极海冰漂移(sea ice drift)精确估计的问题,这对于北极航行、气候研究和业务预报至关重要。传统光学流(optical flow)方法依赖于严格的数学假设,在复杂场景中精度受限;而近年来基于深度学习的光学流方法在计算机视觉领域显著提升了运动估计准确性。论文的关键解决方案是首次在RADARSAT-2 ScanSAR海冰影像上构建了包含48个深度学习光学流模型的大规模基准测试,并通过GNSS浮标跟踪数据验证其性能,结果表明多个模型可实现亚公里级精度(EPE 6–8像素,约300–400米),且能生成空间连续的漂移场,优于稀疏浮标位置提供的离散估计,从而为极地遥感中的运动估计提供了高效、高分辨率的新范式。
链接: https://arxiv.org/abs/2510.26653
作者: Daniela Martin,Joseph Gallego
机构: University of Delaware (特拉华大学); Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注:
Abstract:Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.
zh
[CV-14] All You Need for Object Detection: From Pixels Points and Prompts to Next-Gen Fusion and Multimodal LLM s/VLMs in Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶汽车(AVs)在复杂多模态环境中实现可靠目标检测的挑战,这一问题源于当前知识碎片化,分散于多模态感知、情境推理与协同智能等领域。解决方案的关键在于通过系统性综述,聚焦新兴范式如视觉-语言模型(VLMs)、大语言模型(LLMs)和生成式AI(Generative AI),而非重复分析过时技术;具体包括:1)梳理车载传感器(摄像头、超声波、激光雷达、雷达)及其融合策略;2)构建超越简单数据集合的结构化数据集分类体系(涵盖自车、基础设施及车路协同数据如V2V/V2I/V2X/I2I);3)分析前沿检测方法,尤其是基于Transformer架构的视觉Transformer(ViTs)、大/小语言模型(SLMs)与VLMs驱动的2D/3D检测流水线及混合传感器融合方案,从而为未来研究提供清晰的技术路线图与发展方向。
链接: https://arxiv.org/abs/2510.26641
作者: Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Hazim Alzorgan,Ahmad Sarlak,Mahlagha Fazeli,Abolfazl Razi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
zh
[CV-15] PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus
【速读】:该论文旨在解决无人机(UAV)目标检测中面临的挑战,包括复杂背景、严重遮挡、密集小目标以及光照条件变化等问题。其解决方案的关键在于提出了一种基于RT-DETR的新型检测算法PT-DETR:首先在骨干网络中引入部分感知细节聚焦(Partially-Aware Detail Focus, PADF)模块以增强对小目标的特征提取能力;其次设计中值频率特征融合(Median-Frequency Feature Fusion, MFFF)模块,提升模型捕捉小目标细节与上下文信息的能力;最后引入Focaler-SIoU损失函数,强化边界框匹配能力并提高对小目标特征的敏感性,从而显著提升检测精度与鲁棒性。实验表明,相较于RT-DETR,PT-DETR在VisDrone2019数据集上分别提升了1.6%和1.7%的mAP,且计算复杂度更低、参数更少,验证了其在小目标检测任务中的有效性。
链接: https://arxiv.org/abs/2510.26630
作者: Bingcong Huo,Zhiming Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model’s ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model’s bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.
zh
[CV-16] Spiking Patches: Asynchronous Sparse and Efficient Tokens for Event Cameras
【速读】:该论文旨在解决事件相机(event camera)数据在视觉任务中表示方式的局限性问题,即现有方法如帧(frame)或体素(voxel)表示虽然准确率较高,但会破坏事件数据的异步性和空间稀疏性这一核心特性。解决方案的关键在于提出一种名为Spiking Patches的专用分词器(tokenizer),它将事件流转化为保留原始时空稀疏性和异步特性的“token”,从而实现高效且高精度的事件表示。实验表明,该方法在图神经网络(GNN)、点云网络(PCN)和Transformer上均能显著提升推理速度(最高达3.4倍于体素、10.4倍于帧),同时保持甚至超越原有方法的准确性,为事件相机驱动的视觉任务提供了新的建模范式。
链接: https://arxiv.org/abs/2510.26614
作者: Christoffer Koo Øhrstrøm,Ronja Güldenring,Lazaros Nalpantidis
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.
zh
[CV-17] CYPRESS: Crop Yield Prediction via Regression on Prithvis Encoder for Satellite Sensing
【速读】:该论文旨在解决传统作物产量预测方法在精度和粒度上难以满足精准农业需求的问题,特别是缺乏对田块内部空间异质性的精细刻画能力。其解决方案的关键在于提出CYPRESS模型,该模型基于预训练的大规模地球观测基础模型Prithvi-EO-2.0-600M,并通过微调(fine-tuning)将其适配为连续回归任务,从而将多时相卫星遥感影像转化为像素级的高分辨率产量图(yield map),实现了从宏观尺度到田块尺度的精准映射,显著提升了预测的时空分辨率与实用性。
链接: https://arxiv.org/abs/2510.26609
作者: Shayan Nejadshamsi,Yuanyuan Zhang,Shadi Zaki,Brock Porth,Lysa Porth,Vahab Khoshdel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi’s Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.
zh
[CV-18] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
【速读】:该论文旨在解决荧光显微成像中的计算超分辨率(Computational Super-Resolution, CSR)问题,即如何从低分辨率图像中重建出高分辨率细节,这本质上是一个病态逆问题。传统方法受限于先验知识的表达能力,难以在噪声干扰或结构复杂场景下获得高质量重建结果。本文提出ResMatching方法,其核心创新在于利用引导条件流匹配(guided conditional flow matching)来学习更强大的数据先验,从而有效提升CSR的重建质量与鲁棒性。该方法不仅在多个生物结构上优于7个基线模型,还首次实现了像素级数据不确定性估计,为用户提供了可靠的置信度指导,尤其在低信噪比条件下表现出显著优势。
链接: https://arxiv.org/abs/2510.26601
作者: Anirban Ray,Vera Galinova,Florian Jug
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:  5 pages, 4 figures
Abstract:Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
zh
[CV-19] Emu3.5: Native Multimodal Models are World Learners
【速读】:该论文旨在解决多模态世界模型在跨视觉与语言模态的连续状态预测、高效推理及复杂生成任务中的局限性问题。其核心挑战在于如何实现原生的多模态序列建模、提升长程生成一致性,并在保持性能的同时显著优化推理效率。解决方案的关键在于:首先,提出端到端预训练的统一下一个词元预测目标(next-token prediction objective),利用包含超过10万亿词元的视觉-语言交错数据进行训练,使模型自然支持交错输入与输出;其次,引入大规模强化学习后训练以增强多模态推理能力;最后,设计离散扩散适配(Discrete Diffusion Adaptation, DiDA)机制,将传统的逐词元解码转化为双向并行预测,使单图推理速度提升约20倍而无性能损失,从而有效支撑高效率的多模态生成与世界建模任务。
链接: https://arxiv.org/abs/2510.26583
作者: Yufeng Cui,Honghao Chen,Haoge Deng,Xu Huang,Xinghang Li,Jirong Liu,Yang Liu,Zhuoyan Luo,Jinsheng Wang,Wenxuan Wang,Yueze Wang,Chengyuan Wang,Fan Zhang,Yingli Zhao,Ting Pan,Xianduo Li,Zecheng Hao,Wenxuan Ma,Zhuo Chen,Yulong Ao,Tiejun Huang,Zhongyuan Wang,Xinlong Wang
机构: BAAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  project page: this https URL
Abstract:We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at this https URL to support community research.
zh
[CV-20] CATCH: A Modular Cross-domain Adaptive Template with Hook
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)模型在跨域场景下泛化性能显著下降的问题,尤其是在遥感、医学影像和数学图表等分布差异较大的新领域中。现有方法通常依赖于针对每个领域的微调或定制化流水线,存在成本高、灵活性差且难以扩展的局限性。解决方案的关键在于提出一种即插即用的跨域自适应框架CATCH,其核心思想是将视觉与语言适应解耦:通过引入一个轻量级域分类器识别输入图像类型,并设计双适配机制——包括用于语言模态调节的Prompt Adapter和用于视觉特征调整的Visual Adapter,二者均通过统一钩子接口动态注入,无需重训练骨干模型。实验表明,该方法在四个不同领域的VQA基准上实现了稳定提升,验证了其在多领域场景下的可扩展性和实用性。
链接: https://arxiv.org/abs/2510.26582
作者: Xinjin Li,Yulie Lu,Jinghan Cao,Yu Ma,Zhenglin Li,Yeyang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.
zh
[CV-21] Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios
【速读】:该论文旨在解决AI系统在真实世界环境中面对未见过的场景且缺乏标注数据时,传统场景理解模型难以泛化的问题(即零样本场景理解挑战)。解决方案的关键在于提出一种动态上下文感知的场景推理框架(Dynamic Context-Aware Scene Reasoning),其核心是利用视觉-语言对齐机制,将预训练的视觉Transformer与大语言模型相结合,实现视觉语义与自然语言描述的对齐,从而增强对新环境的上下文理解能力;同时引入一个动态推理模块,通过融合全局场景线索和由语言先验引导的物体级交互信息,提升预测精度。实验表明,该方法在COCO、Visual Genome和Open Images等零样本基准上相较基线模型最高提升18%的场景理解准确率,并在模糊或杂乱场景中展现出鲁棒性。
链接: https://arxiv.org/abs/2510.26580
作者: Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Preprint under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Abstract:In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.
zh
[CV-22] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
【速读】:该论文旨在解决广告行业中多版本短视频广告(如15秒与30秒版本)的自动化剪辑问题,传统方法依赖人工从长视频中逐帧挑选并重编辑,效率低下。解决方案的关键在于将视频剪辑建模为一个面向广告场景的镜头选择问题,并提出一种双流音频-视觉融合模型,通过联合分析音视频特征来预测每个帧的重要性得分(即其被选入最终短版广告的概率),从而实现精准、自动化的广告片段生成。此外,作者构建了首个专用于广告剪辑的AdSum204数据集,包含来自真实广告活动的102对长短版本广告,有效支撑了模型训练与评估。
链接: https://arxiv.org/abs/2510.26569
作者: Wen Xie,Yanjun Zhu,Gijs Overgoor,Yakov Bart,Agata Lapedriza Garcia,Sarah Ostadabbas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:  Accepted at 32nd International Conference on MultiMedia Modeling
Abstract:Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.
zh
[CV-23] SA2Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging
【速读】:该论文旨在解决超声体积投影成像(VPI)中脊柱分割任务面临的两大挑战:一是忽略不同骨特征间的高空间相关性可能导致全局上下文知识学习不足;二是脊柱骨骼富含形状与位置结构信息,需有效编码至分割流程以提升精度。解决方案的关键在于提出一种尺度自适应结构感知网络(SA² Net),其核心创新包括:1)设计尺度自适应互补策略,用于学习跨维度的长距离相关特征;2)基于Transformer多头自注意力与语义级亲和力的一致性,引入结构亲和变换(structure-affinity transformation),将类别特异性亲和力融入语义特征,并结合Transformer解码器实现结构感知推理;3)采用特征混合损失聚合方法增强模型训练鲁棒性与准确性。实验表明,该方法在脊柱分割性能上优于现有主流方法,且具备良好的骨干网络适配性,为智能脊柱影像分析辅助青少年特发性脊柱侧弯(adolescent idiopathic scoliosis, AIS)诊断提供了有效工具。
链接: https://arxiv.org/abs/2510.26568
作者: Hao Xie,Zixun Huang,Yushen Zuo,Yakun Ju,Frank H. F. Leung,N. F. Law,Kin-Man Lam,Yong-Ping Zheng,Sai Ho Ling
机构: The Hong Kong Polytechnic University (香港理工大学); University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Accepted by Computerized Medical Imaging and Graphics (CMIG)
Abstract:Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA ^2 Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA ^2 Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA ^2 Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at this https URL.
zh
[CV-24] Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm
【速读】:该论文旨在解决边缘检测任务中现有检测器存在的两个关键问题:一是难以识别松散边缘(loose edges),二是缺乏上下文信息以从特定问题中提取相关特征。为应对这些问题,作者提出了一种基于二维细胞自动机(two-dimensional cellular automaton)描述的可适配边缘检测器,并通过元启发式优化(meta-heuristic optimization)与迁移学习(transfer learning)技术进行参数调优。该解决方案的关键在于利用细胞自动机的局部规则实现检测器的动态适应能力,同时借助优化策略和迁移学习提升模型对不同图像特性的泛化性能。实验表明,扩展优化搜索空间对所选图像集无显著效果,而模型在多种验证条件下均展现出良好的自适应能力,尽管迁移学习未带来明显性能提升。
链接: https://arxiv.org/abs/2510.26509
作者: Vinícius Ferraria,Eurico Ruivo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The edge detection task is essential in image processing aiming to extract relevant information from an image. One recurring problem in this task is the weaknesses found in some detectors, such as the difficulty in detecting loose edges and the lack of context to extract relevant information from specific problems. To address these weaknesses and adapt the detector to the properties of an image, an adaptable detector described by two-dimensional cellular automaton and optimized by meta-heuristic combined with transfer learning techniques was developed. This study aims to analyze the impact of expanding the search space of the optimization phase and the robustness of the adaptability of the detector in identifying edges of a set of natural images and specialized subsets extracted from the same image set. The results obtained prove that expanding the search space of the optimization phase was not effective for the chosen image set. The study also analyzed the adaptability of the model through a series of experiments and validation techniques and found that, regardless of the validation, the model was able to adapt to the input and the transfer learning techniques applied to the model showed no significant improvements.
zh
[CV-25] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
【速读】:该论文旨在解决视觉-语言模型中因物体与背景共现(object-context co-occurrence)导致的“上下文捷径”(object-context shortcuts)问题,该问题会显著削弱模型在测试场景与训练分布不一致时的零样本(zero-shot)可靠性。解决方案的关键在于将该问题建模为因果推理任务:通过估计CLIP表示空间中物体和背景的期望特征,并利用外部数据集、批次内邻居或文本描述采样多样化替代背景,重组生成反事实嵌入(counterfactual embeddings)。进一步通过估计总直接效应(Total Direct Effect)并模拟干预操作,消除仅由背景引发的激活信号,从而保留有益的物体-背景交互,同时抑制幻觉性得分。该方法无需重新训练或提示设计,即可在多个对上下文敏感的基准上显著提升最差组和平均准确率,建立新的零样本性能上限。
链接: https://arxiv.org/abs/2510.26466
作者: Pei Peng,MingKun Xie,Hang Hao,Tong Jin,ShengJun Huang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP’s representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
zh
[CV-26] owards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
【速读】:该论文旨在解决少样本异常检测(Few-shot Anomaly Detection, FSAD)中因图像级文本描述与patch级视觉异常语义不匹配而导致的定位性能不佳问题。现有方法依赖预训练视觉语言模型(Vision-Language Models, VLMs)通过图文特征相似性识别异常区域,但由于缺乏细粒度描述,仅能使用全局图像级文本描述匹配每个视觉patch token,造成语义错位。解决方案的关键在于提出多层级细粒度语义描述(Multi-Level Fine-Grained Semantic Caption, MFSC),构建自动化的细粒度文本描述生成流程,并在此基础上设计FineGrainedAD框架,其核心包括两个组件:多层级可学习提示(Multi-Level Learnable Prompt, MLLP)和多层级语义对齐(Multi-Level Semantic Alignment, MLSA),分别通过自动替换与拼接机制引入细粒度语义信息,并利用区域聚合策略与多层级对齐训练使可学习提示更好地与对应视觉区域对齐,从而显著提升异常定位精度。
链接: https://arxiv.org/abs/2510.26464
作者: Yuanting Fan,Jun Liu,Xiaochen Chen,Bin-Bin Gao,Jian Li,Yong Liu,Jinlong Peng,Chengjie Wang
机构: Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  12 pages, 7 figures
Abstract:Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.
zh
[CV-27] PointSt3R: Point Tracking through 3D Grounded Correspondence
【速读】:该论文旨在解决动态场景中点跟踪(point tracking)的难题,尤其是如何在缺乏时间上下文的情况下实现鲁棒且准确的跨帧点匹配。其关键解决方案在于:利用基础3D重建模型(如MASt3R)通过引入重建损失与可见性头(visibility head)来联合训练动态对应关系,并仅在包含查询点的一对帧上进行训练与评估,从而避免依赖时间序列信息;同时,通过少量合成数据微调模型并混合静态与动态点对应关系,在多个基准数据集上实现了优于或相当的点跟踪性能(例如在EgoPoints和RGB-S数据集上显著超越CoTracker3)。
链接: https://arxiv.org/abs/2510.26443
作者: Rhodri Guerrier,Adam W. Harley,Dima Damen
机构: University of Bristol (布里斯托大学); Meta Reality Labs Research (Meta现实实验室研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:   this http URL
Abstract:Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ( +33.5%  on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8  \delta_avg  / 85.8% occlusion acc. for PointSt3R compared to 75.7 / 88.3% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.
zh
[CV-28] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
【速读】:该论文旨在解决测试时提示调优(Test-time Prompt Tuning, TPT)中因文本特征分散度不足而导致的校准性能下降问题,这会影响视觉语言模型(Vision-Language Models, VLMs)的可靠性、可信度和安全性。现有TPT方法主要通过最大化平均文本特征分散度或施加正交约束来提升校准效果,但这些策略未必能实现类间文本特征间的最优角度分离,忽视了角度多样性(angular diversity)的重要性。本文提出A-TPT框架,其核心创新在于引入角度多样性建模,通过最大化单位超球面上归一化文本特征之间的最小成对角度距离,促使学习到的提示诱导出更均匀分布的文本特征,从而显著改善VLM在测试时适应过程中的校准性能,同时保持与现有方法相当的准确率。
链接: https://arxiv.org/abs/2510.26441
作者: Shihab Aaqil Ahamed,Udaya S.K.P. Miriya Thanthrige,Ranga Rodrigo,Muhammad Haris Khan
机构: University of Moratuwa (斯里兰卡莫鲁塔瓦大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  23 pages, 14 figures
Abstract:Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs’ reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
zh
[CV-29] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
【速读】:该论文旨在解决当前长视频生成(Long Video Generation, LVG)评估中存在的关键问题,即现有基准测试多依赖简化提示(prompt),仅关注低层次指标,忽视了与复杂提示的细粒度对齐以及叙事连贯性、主题表达等抽象维度的衡量。解决方案的关键在于提出LoCoT2V-Bench——一个专为复杂输入条件下长视频生成设计的基准测试平台,其核心创新包括:构建包含场景转换和事件动态等真实世界元素的复杂提示集,并引入多维评估框架,涵盖事件级对齐、细粒度时间一致性、内容清晰度及人类期望实现度(Human Expectation Realization Degree, HERD)等新指标,从而系统性地量化模型在高层次语义层面的表现,揭示当前方法在跨事件一致性、细粒度对齐和主题遵循方面的不足。
链接: https://arxiv.org/abs/2510.26412
作者: Xiangqing Zheng,Chengyue Wu,Kehai Chen,Min Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.
zh
[CV-30] EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models
【速读】:该论文旨在解决现有基于脑电图(EEG)的图像重建方法因忽视空间注意力机制而导致重建图像保真度和语义一致性不足的问题。其解决方案的关键在于提出一种双条件框架,通过结合EEG嵌入与空间显著性图(saliency maps)来增强图像生成效果:首先利用自适应思维映射器(Adaptive Thinking Mapper, ATM)提取EEG特征,并采用低秩适应(Low-Rank Adaptation, LoRA)微调Stable Diffusion 2.1模型以对齐神经信号与视觉语义;同时引入ControlNet分支,将显著性图作为空间条件控制生成过程。该设计有效利用了注意力先验来缓解EEG信号的模糊性,从而实现高保真图像重建,在医学诊断和神经适应接口等领域具有应用潜力。
链接: https://arxiv.org/abs/2510.26391
作者: Igor Abramov,Ilya Makarov
机构: Ivannikov Institute for System Programming of the Russian Academy of Sciences (俄罗斯科学院伊万尼科夫系统编程研究所); Research Center for Trusted Artificial Intelligence (可信人工智能研究中心); AI Talent Hub, ITMO University (ITMO大学人工智能人才中心); AIRI (AIRI); Saint Petersburg, Russia (圣彼得堡, 俄罗斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Demo paper
Abstract:Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.
zh
[CV-31] CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
【速读】:该论文旨在解决工业场景中人员定位与识别的难题,尤其是在物流仓库等复杂环境中,仅依赖视觉数据难以实现可靠的身份识别,而传统基于轨迹与可穿戴传感器数据匹配的方法在真实场景下性能不稳定。解决方案的关键在于提出CorVS方法,其核心是通过深度学习模型预测视觉跟踪轨迹与传感器测量之间的对应概率和置信度,并基于这些预测结果在时间维度上进行轨迹与传感器数据的匹配,从而实现高鲁棒性的人员识别。
链接: https://arxiv.org/abs/2510.26369
作者: Kazuma Kano,Yuki Mori,Shin Katayama,Kenta Urano,Takuro Yonezawa,Nobuo Kawaguchi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:  7 pages, 3 figures, accepted to IPIN 2025
Abstract:Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method’s effectiveness for real-world applications.
zh
[CV-32] AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM
【速读】:该论文旨在解决果园中自主机器人在重复性行状结构、季节性外观变化及风致叶面运动等复杂环境下实现实时3D场景理解的问题。其解决方案的关键在于提出AgriGS-SLAM框架,该框架将直接LiDAR里程计与回环闭合技术同多相机3D高斯溅射(3D Gaussian Splatting, 3DGS)渲染相结合,通过跨互补视角的批量光栅化恢复被遮挡区域的果园结构,并采用统一梯度驱动的地图生命周期管理机制在关键帧间维持精细细节并控制内存占用;同时,基于概率LiDAR深度一致性项引导位姿优化,反向传播至相机投影以强化几何-外观耦合,从而在苹果和梨园不同季节(休眠期、开花期、收获期)的实地部署中实现更清晰、稳定且实时的重建结果与轨迹估计。
链接: https://arxiv.org/abs/2510.26358
作者: Mirko Usuelli,David Rapado-Rincon,Gert Kootstra,Matteo Matteucci
机构: Dipartimento di Bioingegneria, Elettronica e Informazione, Politecnico di Milano (米兰理工大学生物工程、电子与信息系); Agricultural Biosystems Engineering, Wageningen University & Research (瓦赫宁根大学与研究农业生物系统工程系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous robots in orchards require real-time 3D scene understanding despite repetitive row geometry, seasonal appearance changes, and wind-driven foliage motion. We present AgriGS-SLAM, a Visual–LiDAR SLAM framework that couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints recovers orchard structure under occlusions, while a unified gradient-driven map lifecycle executed between keyframes preserves fine details and bounds memory. Pose refinement is guided by a probabilistic LiDAR-based depth consistency term, back-propagated through the camera projection to tighten geometry-appearance coupling. We deploy the system on a field platform in apple and pear orchards across dormancy, flowering, and harvesting, using a standardized trajectory protocol that evaluates both training-view and novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while maintaining real-time performance on-tractor. While demonstrated in orchard monitoring, the approach can be applied to other outdoor domains requiring robust multimodal perception.
zh
[CV-33] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? ICLR2026
【速读】:该论文旨在解决图像超分辨率(Image Super-Resolution, SR)在实际应用中对场景文本(scene-text)恢复效果不佳的问题,即现有SR方法主要优化传统指标(如PSNR、SSIM)或感知质量指标(如LIPIS、MANIQA、CLIP-IQA、MUSIQ),但这些指标对字符级错误不敏感,导致OCR识别失败,即便图像整体视觉质量良好。解决方案的关键在于提出GLYPH-SR框架,其核心是引入一个由OCR数据引导的Text-SR Fusion ControlNet(TS-ControlNet)和一种交替进行文本导向与场景导向指导的“乒乓调度器”(ping-pong scheduler),通过在合成语料上训练特定组件并冻结主SR分支,实现文本可读性与视觉真实感的同时优化,在SVT、SCUT-CTW1500和CUTE80等基准上显著提升OCR F1分数(最高达+15.18个百分点),同时保持感知质量指标竞争力。
链接: https://arxiv.org/abs/2510.26339
作者: Mingyu Sung,Seungjae Ham,Kangwoo Kim,Yeokyoung Yoon,Sangseok Yun,Il-Min Kim,Jae-Mo Kang
机构: Kyungpook National University (庆北国立大学); Queen’s University (皇后大学); Pukyong National University (釜庆国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:  11 pages, 6 figures. Includes supplementary material. Under review as a conference paper at ICLR 2026
Abstract:Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.
zh
[CV-34] A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
【速读】:该论文旨在解决当前糖尿病视网膜病变(Diabetic Retinopathy, DR)自动诊断系统性能受限于单一骨干网络(如卷积神经网络 CNN 或视觉 Transformer ViT)的瓶颈问题。现有方法因 CNN 局部特征提取能力与 ViT 全局特征捕捉能力的局限性,难以进一步提升诊断准确率。解决方案的关键在于提出一种基于证据理论(Theory of Evidence)的新型特征融合范式,通过深度证据网络将不同骨干网络提取的特征转化为支持证据,并据此构建聚合意见,从而自适应地调整多骨干间的融合模式,实现性能增强与决策可解释性的双重提升。
链接: https://arxiv.org/abs/2510.26315
作者: Junlai Qiu,Yunzhu Chen,Hao Zheng,Yawen Huang,Yuexiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emphi.e., the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.
zh
[CV-35] Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG
【速读】:该论文旨在解决不同音乐类型对人类情绪影响的量化问题,即探究音乐 genres(如古典、摇滚、爵士等)如何通过诱发特定情绪状态来改变大脑活动模式。其解决方案的关键在于结合主观问卷调查与脑电图(EEG)信号采集技术,在多样化参与者群体中同步记录情绪反应与神经生理数据,并通过关系分析揭示情绪维度与脑电特征之间的关联性,从而为音乐干预情绪调节提供客观依据。
链接: https://arxiv.org/abs/2510.26304
作者: Jelizaveta Jankowska,Bożena Kostek,Fernando Alonso-Fernandez,Prayag Tiwari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Published at IWAIPR 2025 conference
Abstract:The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impact of different music genres on emotions. The research involved a diverse group of participants of different gender and musical preferences. This had the effect of capturing a wide range of emotional responses to music. After the experiment, a relationship analysis of the respondents’ questionnaires with EEG signals was performed. The analysis revealed connections between emotions and observed brain activity.
zh
[CV-36] owards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology
【速读】:该论文旨在解决敏捷地球观测卫星(Agile Earth Observation Satellites, AEOSs)星座在大规模场景、动态环境及严格约束条件下调度难题,现有方法因简化复杂性而难以满足实际应用需求。解决方案的关键在于提出一个统一框架,包含首个面向真实场景的大型基准数据集AEOS-Bench和基于Transformer架构的调度模型AEOS-Former:AEOS-Bench通过高保真仿真平台生成3,907颗卫星资产与16,410个含50–300个成像任务的场景,并提供精确标注;AEOS-Former则引入约束感知注意力机制与专用内部约束模块,显式建模卫星物理与运行限制,并通过基于仿真的迭代学习实现对多样化场景的鲁棒适应,从而显著提升任务完成率与能效表现。
链接: https://arxiv.org/abs/2510.26297
作者: Luting Wang,Yinghao Xiang,Hongliang Huang,Dongjun Li,Chen Gao,Si Liu
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented flexibility for monitoring the Earth’s surface, but their scheduling remains challenging under large-scale scenarios, dynamic environments, and stringent constraints. Existing methods often simplify these complexities, limiting their real-world performance. We address this gap with a unified framework integrating a standardized benchmark suite and a novel scheduling model. Our benchmark suite, AEOS-Bench, contains  3,907  finely tuned satellite assets and  16,410  scenarios. Each scenario features  1  to  50  satellites and  50  to  300  imaging tasks. These scenarios are generated via a high-fidelity simulation platform, ensuring realistic satellite behavior such as orbital dynamics and resource constraints. Ground truth scheduling annotations are provided for each scenario. To our knowledge, AEOS-Bench is the first large-scale benchmark suite tailored for realistic constellation scheduling. Building upon this benchmark, we introduce AEOS-Former, a Transformer-based scheduling model that incorporates a constraint-aware attention mechanism. A dedicated internal constraint module explicitly models the physical and operational limits of each satellite. Through simulation-based iterative learning, AEOS-Former adapts to diverse scenarios, offering a robust solution for AEOS constellation scheduling. Experimental results demonstrate that AEOS-Former outperforms baseline models in task completion and energy efficiency, with ablation studies highlighting the contribution of each component. Code and data are provided in this https URL.
zh
[CV-37] Leverag ing Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping
【速读】:该论文旨在解决基于眼周生物特征(periocular biometrics)的识别精度问题,尤其是在复杂、非受控环境下如何提升识别性能。其解决方案的关键在于采用大规模数据训练深度卷积神经网络(Convolutional Neural Network, CNN),具体使用来自VGGFace2数据库的190万张眼部区域图像进行模型训练,从而克服传统方法依赖小规模数据集(仅数千张图像)导致的泛化能力不足问题。实验表明,在受控条件下(如UFPR-Periocular数据集)可实现1-2%的等错误率(Equal Error Rate, EER),显著优于此前报道结果,验证了大规模预训练对提升眼周识别鲁棒性和准确性的有效性。
链接: https://arxiv.org/abs/2510.26294
作者: Fernando Alonso-Fernandez,Kevin Hernandez-Diaz,Jose Maria Buades Rubio,Josef Bigun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Published at IWAIPR 2025 conference
Abstract:We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.
zh
[CV-38] Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶规划中模仿学习方法常出现的模式崩溃(mode collapse)问题,以及现有生成式方法难以直接将安全性和物理约束融入生成过程、需额外优化阶段才能修正输出的局限性。解决方案的关键在于提出CATG框架,其核心创新是将显式约束直接嵌入流匹配(Flow Matching)过程中,从而在生成轨迹时自动满足关键的安全与运动学规则,同时通过参数化驾驶激进程度作为控制信号,实现对轨迹风格的精确调控,显著提升了轨迹多样性与合规性。
链接: https://arxiv.org/abs/2510.26292
作者: Lin Liu,Guanyi Yu,Ziying Song,Junqiao Li,Caiyan Jia,Feiyang Jia,Peiliang Wu,Yandan Luo
机构: Beijing Jiaotong University (北京交通大学); Qcraft; Yanshan University (燕山大学); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Planning is a critical component of end-to-end autonomous driving. However, prevailing imitation learning methods often suffer from mode collapse, failing to produce diverse trajectory hypotheses. Meanwhile, existing generative approaches struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. To address these limitations, we propose CATG, a novel planning framework that leverages Constrained Flow Matching. Concretely, CATG explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our primary contribution is the novel imposition of explicit constraints directly within the flow matching process, ensuring that the generated trajectories adhere to vital safety and kinematic rules. Secondly, CATG parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Notably, on the NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and was honored with the Innovation Award.
zh
[CV-39] Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances
【速读】:该论文旨在解决远距离人眼区域(periocular)识别中不同卷积神经网络(Convolutional Neural Networks, CNNs)之间的互补性问题,以提升验证准确率。其关键解决方案在于:首先在大规模数据集VGGFace2上训练三种复杂度递增的CNN架构(SqueezeNet、MobileNetv2和ResNet50),随后通过余弦距离与卡方距离(chi2 metric)评估性能,并采用逻辑回归进行分数级融合(score-level fusion);同时利用LIME热图与Jensen-Shannon散度分析各模型注意力分布差异,发现不同网络关注图像的不同区域,从而解释其互补性。实验表明,尽管ResNet50单独表现最优,但三者融合后显著优于单一模型及先前方法,在UBIPr数据库上达到新的最先进水平。
链接: https://arxiv.org/abs/2510.26282
作者: Fernando Alonso-Fernandez,Kevin Hernandez Diaz,Jose M. Buades,Kiran Raja,Josef Bigun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Accepted at BIOSIG 2025 conference
Abstract:We study the complementarity of different CNNs for periocular verification at different distances on the UBIPr database. We train three architectures of increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics, compare different network initialisations, and apply score-level fusion via logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon divergence to compare attention patterns of the CNNs. While ResNet50 consistently performs best individually, the fusion provides substantial gains, especially when combining all three networks. Heatmaps show that networks usually focus on distinct regions of a given image, which explains their complementarity. Our method significantly outperforms previous works on UBIPr, achieving a new state-of-the-art.
zh
[CV-40] Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws NEURIPS2025
【速读】:该论文旨在解决红外与可见光图像融合方法中模态信息平衡困难、生成能力受限以及模态选择缺乏可解释性的问题,这些问题在复杂场景下显著影响融合结果的可靠性与一致性。解决方案的关键在于借鉴人类认知规律,提出一种名为HCLFuse的新方法:首先设计了多尺度掩码调控的变分瓶颈编码器(multi-scale mask-regulated variational bottleneck encoder),通过后验概率建模与信息分解实现低层次模态信息的精准提取;其次将扩散模型的概率生成能力与物理规律结合,构建时变物理引导机制(time-varying physical guidance mechanism),在不同生成阶段自适应调节过程,增强对数据内在结构的感知能力并降低对数据质量的依赖。该方案在多个数据集上实现了最优的定性和定量融合性能,并显著提升语义分割指标,验证了其在结构一致性和细节保真度方面的优势。
链接: https://arxiv.org/abs/2510.26268
作者: Lin Guo,Xiaoqing Luo,Wei Xie,Zhancheng Zhang,Hui Li,Rui Wang,Zhenhua Feng,Xiaoning Song
机构: Jiangnan University (江南大学); Suzhou University of Science and Technology (苏州科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  NeurIPS 2025 spotlight
Abstract:Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi-scale mask-regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low-level modal information, thereby supporting the generation of high-fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time-varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state-of-the-art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.
zh
[CV-41] OmniLayout: Enabling Coarse-to-Fine Learning with LLM s for Universal Document Layout Generation
【速读】:该论文旨在解决文档布局生成(Document Layout Generation, DLG)领域中因布局多样性不足和复杂场景下长序列协同排布能力弱而导致的性能瓶颈问题。现有研究多集中于具有曼哈顿结构(Manhattan-style)的学术论文,而对报纸、杂志等开放世界文档类型覆盖严重不足;同时,现有方法在处理复杂文档时难以保持整体结构一致性。解决方案的关键在于构建首个百万级多样化文档布局数据集OmniLayout-1M,并提出基于两阶段粗粒度到细粒度学习范式的OmniLayout-LLM模型(0.5B参数量),首先在大规模通用布局上学习普适性布局原则,再通过细粒度标注实现特定领域的知识迁移,从而显著提升跨域布局生成能力,在M⁶ Doc数据集上优于现有布局生成专家及多个主流通用大语言模型(Large Language Models, LLMs)。
链接: https://arxiv.org/abs/2510.26213
作者: Hengrui Kang,Zhuangcheng Gu,Zhiyuan Zhao,Zichen Wen,Bin Wang,Weijia Li,Conghui He
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  TL;DR: With OmniLayout-1M dataset and LLM-based coarse-to-fine learning, we enable universal and diverse document layout generation
Abstract:Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M ^6 Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.
zh
[CV-42] Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management
【速读】:该论文旨在解决供应链可持续性与绩效优化中的关键挑战,特别是风险管理和产品分类的准确性问题。其核心问题是:如何有效识别供应链中的潜在风险并提升产品及关系分类的精度,从而增强供应链网络的整体韧性与效率。解决方案的关键在于提出一种新型的Chebyshev集成几何深度网络(Ch-EGN),该模型融合了卷积神经网络与几何深度学习的优势,能够挖掘供应链数据中复杂的依赖关系,并推断样本的隐状态信息。实验结果表明,该方法在风险预测、产品分组和供应链节点关系分类任务上分别实现了98.95%、100%和98.07%的平均准确率,显著优于现有主流方法。
链接: https://arxiv.org/abs/2510.26203
作者: Mehdi Khaleghi,Nastaran Khaleghi,Sobhan Sheykhivand,Sebelan Danishvar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The sustainability of supply chain plays a key role in achieving optimal performance in controlling the supply chain. The management of risks that occur in a supply chain is a fundamental problem for the purpose of developing the sustainability of the network and elevating the performance efficiency of the supply chain. The correct classification of products is another essential element in a sustainable supply chain. Acknowledging recent breakthroughs in the context of deep networks, several architectural options have been deployed to analyze supply chain datasets. A novel geometric deep network is used to propose an ensemble deep network. The proposed Chebyshev ensemble geometric network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This network is proposed to leverage the information dependencies in supply chain to derive invisible states of samples in the database. The functionality of the proposed deep network is assessed on the two different databases. The SupplyGraph Dataset and DataCo are considered in this research. The prediction of delivery status of DataCo supply chain is done for risk administration. The product classification and edge classification are performed using the SupplyGraph database to enhance the sustainability of the supply network. An average accuracy of 98.95% is obtained for the ensemble network for risk management. The average accuracy of 100% and 98.07% are obtained for sustainable supply chain in terms of 5 product group classification and 4 product relation classification, respectively. The average accuracy of 92.37% is attained for 25 company relation classification. The results confirm an average improvement and efficiency of the proposed method compared to the state-of-the-art approaches.
zh
[CV-43] Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction SIGGRAPH
【速读】:该论文旨在解决从草图(sketch)中估计三维人体姿态(3D human pose estimation)的问题,这一任务因草图的抽象性和比例失真特性而极具挑战性。此前的方法受限于缺乏大规模的草图-3D姿态标注数据集,主要依赖启发式规则优化,存在效率低且泛化能力差的缺陷。其解决方案的关键在于提出一种“从合成中学习”(learn from synthesis)策略:首先训练一个扩散模型(diffusion model),从投影自3D人体姿态的2D姿态生成模拟草图图像,从而构建包含120k对标注数据的合成数据集SKEP-120K;在此基础上,设计了一个端到端的数据驱动框架,融合2D姿态检测器与生成式扩散先验进行特征提取,并采用前馈神经网络实现高效2D姿态估计,同时引入多启发式损失函数确保3D姿态与2D检测结果之间的几何一致性及自接触准确性。
链接: https://arxiv.org/abs/2510.26196
作者: Li Wang,Yiyu Zhuang,Yanwen Wang,Xun Cao,Chuan Guo,Xinxin Zuo,Hao Zhu
机构: Nanjing University (南京大学); Snap Inc. (Snap Inc.); Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  SIGGRAPH Asia 2025
Abstract:3D human pose estimation from sketches has broad applications in computer animation and film production. Unlike traditional human pose estimation, this task presents unique challenges due to the abstract and disproportionate nature of sketches. Previous sketch-to-pose methods, constrained by the lack of large-scale sketch-3D pose annotations, primarily relied on optimization with heuristic rules-an approach that is both time-consuming and limited in generalizability. To address these challenges, we propose a novel approach leveraging a “learn from synthesis” strategy. First, a diffusion model is trained to synthesize sketch images from 2D poses projected from 3D human poses, mimicking disproportionate human structures in sketches. This process enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k accurate sketch-3D pose annotation pairs across various sketch styles. Building on this synthetic dataset, we introduce an end-to-end data-driven framework for estimating human poses and shapes from diverse sketch styles. Our framework combines existing 2D pose detectors and generative diffusion priors for sketch feature extraction with a feed-forward neural network for efficient 2D pose estimation. Multiple heuristic loss functions are incorporated to guarantee geometric coherence between the derived 3D poses and the detected 2D poses while preserving accurate self-contacts. Qualitative, quantitative, and subjective evaluations collectively show that our model substantially surpasses previous ones in both estimation accuracy and speed for sketch-to-pose tasks.
zh
[CV-44] ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts NEURIPS2025
【速读】:该论文旨在解决机器学习数据集中普遍存在但难以系统识别的**数据偏差(dataset bias)**问题,尤其在缺乏细粒度属性标注的情况下。其解决方案的关键在于提出了一种可扩展且自动化的框架——ConceptScope,该框架利用在视觉基础模型(vision foundation models)表示上训练的稀疏自编码器(Sparse Autoencoders)来发现并量化人类可解释的视觉概念(visual concepts),并将这些概念按语义相关性和与类别标签的统计相关性划分为目标(target)、上下文(context)和偏差(bias)三类。通过基于概念的子组划分,ConceptScope实现了对数据集的层级化表征、偏差检测和模型鲁棒性评估,从而为数据审计和模型诊断提供了实用工具。
链接: https://arxiv.org/abs/2510.26186
作者: Jinho Choi,Hyesu Lim,Steffen Schneider,Jaegul Choo
机构: KAIST AI; Helmholtz Munich; Munich Center for Machine Learning (MCML)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:  Published in the Thirty-Ninth Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.
zh
[CV-45] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models
【速读】:该论文旨在解决从单张运动模糊图像中估计高分辨率(High-Resolution, HR)运动轨迹的问题,现有方法通常生成粗粒度且不准确的运动表示,如模糊核(blur kernel)或光流(optical flow)。解决方案的关键在于提出首个基于扩散模型(Diffusion models)的高分辨率运动轨迹估计框架 MoTDiff,其核心创新包括:1)设计了一种新的条件扩散框架,利用单张模糊图像提取的多尺度特征图作为条件输入;2)提出一种新型训练策略,能够促进对细粒度运动轨迹的精确识别、运动路径整体形状与位置的一致性估计,以及轨迹像素间的连通性保持。实验表明,MoTDiff 在盲图像去模糊和编码曝光摄影等任务中均优于当前最优方法。
链接: https://arxiv.org/abs/2510.26173
作者: Wontae Choi,Jaelin Lee,Hyung Sup Yun,Byeungwoo Jeon,Il Yong Chun
机构: Sungkyunkwan University (成均馆大学); ALLforLAND Co., Ltd. (ALLforLAND有限公司); Institute for Basic Science (基础科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  10 pages, 6 figures
Abstract:Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.
zh
[CV-46] Self-localization on a 3D map by fusing global and local features from a monocular camera
【速读】:该论文旨在解决基于单目摄像头的3D地图自定位(self-localization)问题,特别是在存在动态障碍物(如行人)时传统卷积神经网络(Convolutional Neural Network, CNN)性能下降的问题。其解决方案的关键在于将CNN与视觉Transformer(Vision Transformer)相结合,利用CNN提取局部特征、Transformer提取全局特征的能力,从而更准确地建模图像中不同区域之间的长距离依赖关系,提升在复杂场景下的定位精度。实验表明,该方法在含动态障碍物的合成数据集上相比当前最优(State-of-the-Art, SOTA)方法的准确率提升达1.5倍,且在公开数据集上的定位误差减少20.1%,机器人实测平均定位误差为7.51cm,优于SOTA。
链接: https://arxiv.org/abs/2510.26170
作者: Satoshi Kikuch,Masaya Kato,Tsuyoshi Tasaki
机构: Meijo University (明治大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.
zh
[CV-47] CRAG -MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
【速读】:该论文旨在解决当前多模态检索增强生成(Multi-Modal Retrieval-Augmented Generation, MM-RAG)任务缺乏全面基准评测的问题,尤其是在可穿戴设备场景下的真实应用挑战。其关键解决方案是提出了CRAG-MM——一个面向多模态多轮对话的综合性RAG基准,包含6.5K个(图像、问题、答案)三元组和2K条基于视觉的多轮对话,覆盖13个领域,并特别设计了6.2K张模拟可穿戴设备拍摄的自拍视角图像。该基准系统性地引入了五类图像质量缺陷、六种问题类型、不同实体流行度、信息动态性差异及多轮对话长度变化等复杂因素,同时构建了三项核心任务:单源增强、多源增强与多轮对话,并配套图像-知识图谱(image-KG)与网页检索API,为MM-RAG模型提供标准化评估平台。实验表明,现有方法在单轮和多轮问答中仅达到约32%和43%的真实性得分,凸显了该领域的巨大改进空间。
链接: https://arxiv.org/abs/2510.26160
作者: Jiaqi Wang,Xiao Yang,Kai Sun,Parth Suresh,Sanat Sharma,Adam Czyzewski,Derek Andersen,Surya Appini,Arkav Banerjee,Sajal Choudhary,Shervin Ghasemlou,Ziqiang Guan,Akil Iyer,Haidar Khan,Lingkun Kong,Roy Luo,Tiffany Ma,Zhen Qiao,David Tran,Wenfang Xu,Skyler Yeatman,Chen Zhou,Gunveer Gujral,Yinglong Xia,Shane Moon,Nicolas Scheffer,Nirav Shah,Eun Chang,Yue Liu,Florian Metze,Tammy Stark,Zhaleh Feizollahi,Andrea Jessee,Mangesh Pujari,Ahmed Aly,Babak Damavandi,Rakesh Wanga,Anuj Kumar,Rohit Patel,Wen-tau Yih,Xin Luna Dong
机构: Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM – a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations – each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
zh
[CV-48] Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh
【速读】:该论文旨在解决城市交通监控中自动识别电动三轮车(auto-rickshaw)的难题,因其与非机动三轮车(non-auto rickshaw)外观相似,且现有监控系统难以区分,而人工视频分析效率低下。解决方案的关键在于构建一个基于YOLOv8的实时目标检测模型,利用包含1,730张标注图像的公开数据集进行训练,该模型在复杂交通场景下表现出色,实现了83.447%的mAP50指标以及超过78%的二分类精确率和召回率,有效支持了对电动三轮车的自动化监测。
链接: https://arxiv.org/abs/2510.26154
作者: Sudipto Das Sukanto,Diponker Roy,Fahim Shakil,Nirjhar Singha,Abdullah Asik,Aniket Joarder,Mridha Md Nafis Fuad,Muhammad Ibrahim
机构: Dhaka University of Engineering & Technology (DUET); Islamic University of Technology (IUT); Bangladesh University of Engineering and Technology (BUET)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  16 pages
Abstract:Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.
zh
[CV-49] MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction ICCV2025
【速读】:该论文旨在解决乳腺癌检测与风险预测中高质量标注数据稀缺的问题,即获取细粒度标注的医学影像数据成本高且耗时长。解决方案的关键在于提出一种多视角乳腺X线摄影与语言模型(Multi-View Mammography and Language Model, MV-MLM),该模型基于配对的乳腺X线图像与合成放射科报告进行训练,利用跨模态自监督学习策略,在多个视图和对应的伪放射科报告之间建立关联,从而学习到丰富的多模态表征。此方法显著提升了模型在不同任务上的泛化能力和准确性,尤其在恶性肿瘤分类、亚型分类及基于图像的癌症风险预测三个任务中达到当前最优性能,并展现出优异的数据效率——仅需合成文本报告即可超越传统全监督或视觉-语言模型(VLM)基线,无需真实放射科报告。
链接: https://arxiv.org/abs/2510.26151
作者: Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba
机构: LMU University Hospital, LMU Munich, Germany; Lunit Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:  Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
Abstract:Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.
zh
[CV-50] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
【速读】:该论文旨在解决任意尺度视频超分辨率(Arbitrary-scale Video Super-Resolution, AVSR)任务中的核心挑战,包括空间细节恢复、时序一致性保持以及计算复杂度控制。其解决方案的关键在于提出一个强基线模型 BasicAVSR,集成四大核心组件:1)基于图像拉普拉斯金字塔(Laplacian pyramid)生成的自适应多尺度频率先验,用于增强高频细节;2)流引导传播单元(flow-guided propagation unit),聚合相邻帧的时空信息;3)二阶运动补偿单元(second-order motion compensation unit),实现更精确的帧间空间对齐;4)超上采样单元(hyper-upsampling unit),生成与尺度相关且内容无关的上采样核。此外,为适配不同应用场景,设计三种传播变体(单向RNN、带有限前瞻的单向RNN、双向RNN),从而在在线推理、低延迟和离线处理等场景中均表现出优越性能。
链接: https://arxiv.org/abs/2510.26149
作者: Wei Shang,Wanying Zhang,Shuhang Gu,Pengfei Zhu,Qinghua Hu,Dongwei Ren
机构: Harbin Institute of Technology (哈尔滨工业大学); University of Electronic Science and Technology of China (电子科技大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  13 pages, 10 figures, 5 tables
Abstract:Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at this https URL.
zh
[CV-51] StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and Disentanglement
【速读】:该论文旨在解决现有数据驱动布局生成方法难以生成结构化布局的问题,尤其是无法显式控制和生成布局结构(layout structure)的局限性。其关键解决方案是提出一种基于Transformer的结构化布局生成模型StructLayoutFormer,通过引入结构序列化方案将布局表示为序列,并将结构信息与元素位置进行解耦,从而实现条件化的结构布局生成。该方法首次在数据驱动框架下实现了显式生成真实布局结构的能力,显著优于传统方法在结构可控性和生成质量上的表现。
链接: https://arxiv.org/abs/2510.26141
作者: Xin Hu,Pengfei Xu,Jin Zhou,Hongbo Fu,Hui Huang
机构: Shenzhen University (深圳大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structured layouts are preferable in many 2D visual contents (\eg, GUIs, webpages) since the structural information allows convenient layout editing. Computational frameworks can help create structured layouts but require heavy labor input. Existing data-driven approaches are effective in automatically generating fixed layouts but fail to produce layout structures. We present StructLayoutFormer, a novel Transformer-based approach for conditional structured layout generation. We use a structure serialization scheme to represent structured layouts as sequences. To better control the structures of generated layouts, we disentangle the structural information from the element placements. Our approach is the first data-driven approach that achieves conditional structured layout generation and produces realistic layout structures explicitly. We compare our approach with existing data-driven layout generation approaches by including post-processing for structure extraction. Extensive experiments have shown that our approach exceeds these baselines in conditional structured layout generation. We also demonstrate that our approach is effective in extracting and transferring layout structures. The code is publicly available at %\hrefthis https URL this https URL.
zh
[CV-52] FullPart: Generating each 3D Part at Full Resolution
【速读】:该论文旨在解决现有基于部件的3D生成方法在几何细节表达不足和小部件分辨率受限的问题。具体而言,以往方法通常采用隐式向量集(implicit vector-set)表示部件,难以捕捉精细几何结构;而显式体素(explicit voxel)表示虽能提升细节,但共享全局低分辨率体素网格会导致小部件占据过少体素,从而降低生成质量。其解决方案的关键在于提出FullPart框架,融合隐式与显式范式:首先通过隐式框向量扩散过程生成部件边界框布局(box vector-set diffusion),利用隐式扩散对低细节任务的有效性;随后为每个部件在其独立的全分辨率体素网格中生成细节,确保即使小部件也能以高分辨率合成复杂结构。此外,引入中心点编码策略缓解不同尺寸部件间信息交换时的错位问题,保障整体一致性。
链接: https://arxiv.org/abs/2510.26140
作者: Lihe Ding,Shaocong Dong,Yaokun Li,Chenjian Gao,Xiao Chen,Rui Han,Yihao Kuang,Hong Zhang,Bo Huang,Zhanpeng Huang,Zibin Wang,Dan Xu,Tianfan Xue
机构: CUHK(香港中文大学); HKUST(香港科技大学); SenseTime Research(商汤科技研究院); Chongqing University(重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Project page: this https URL
Abstract:Part-based 3D generation holds great potential for various applications. Previous part generators that represent parts using implicit vector-set tokens often suffer from insufficient geometric details. Another line of work adopts an explicit voxel representation but shares a global voxel grid among all parts; this often causes small parts to occupy too few voxels, leading to degraded quality. In this paper, we propose FullPart, a novel framework that combines both implicit and explicit paradigms. It first derives the bounding box layout through an implicit box vector-set diffusion process, a task that implicit diffusion handles effectively since box tokens contain little geometric detail. Then, it generates detailed parts, each within its own fixed full-resolution voxel grid. Instead of sharing a global low-resolution space, each part in our method - even small ones - is generated at full resolution, enabling the synthesis of intricate details. We further introduce a center-point encoding strategy to address the misalignment issue when exchanging information between parts of different actual sizes, thereby maintaining global coherence. Moreover, to tackle the scarcity of reliable part data, we present PartVerse-XL, the largest human-annotated 3D part dataset to date with 40K objects and 320K parts. Extensive experiments demonstrate that FullPart achieves state-of-the-art results in 3D part generation. We will release all code, data, and model to benefit future research in 3D part generation.
zh
[CV-53] Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
【速读】:该论文旨在解决RGB-D室内同时定位与地图构建(SLAM)中特征表示不够语义化、难以准确关联帧间信息的问题。其关键解决方案是将基于网络梯度的层间注意力信息(layer-wise attention information derived from network gradients)与卷积神经网络(CNN)特征表示进行融合,从而在视觉特征中引入任务特定的空间注意力机制,增强对场景中语义对象位置的感知能力,进而提升帧关联性能,尤其在大环境场景下效果显著。
链接: https://arxiv.org/abs/2510.26131
作者: Ali Caglayan,Nevrez Imamoglu,Oguzhan Guclu,Ali Osman Serhatoglu,Ahmet Burak Can,Ryosuke Nakamura
机构: National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所); Sahibinden; Hacettepe University (哈切特佩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:  double-column 5 pages, 3 figures
Abstract:Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.
zh
[CV-54] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
【速读】:该论文旨在解决当前端到端(End-to-End, E2E)自动驾驶评估中存在两大核心问题:一是现有基准数据集主要聚焦于常规场景,难以充分检验系统在长尾罕见场景下的泛化能力;二是传统开环评价指标无法有效捕捉驾驶行为的多模态特性,尤其在低频复杂场景中的表现评估不足。解决方案的关键在于构建一个专门针对长尾场景的高质量数据集——Waymo Open Dataset for End-to-End Driving (WOD-E2E),其中包含4,021个约12小时的驾驶片段,每个片段均标注了高阶路径信息、车辆状态及360°环视摄像头图像,并提出一种新的开环评价指标——Rater Feedback Score (RFS),该指标通过人工标注的轨迹偏好标签衡量预测轨迹与人类专家判断的一致性,从而更真实地反映E2E模型在极端场景下的决策质量。
链接: https://arxiv.org/abs/2510.26125
作者: Runsheng Xu,Hubert Lin,Wonseok Jeon,Hao Feng,Yuliang Zou,Liting Sun,John Gorman,Kate Tolstaya,Sarah Tang,Brandyn White,Ben Sapp,Mingxing Tan,Jyh-Jing Hwang,Drago Anguelov
机构: Waymo LLC
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.
zh
[CV-55] JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting
【速读】:该论文旨在解决传统新视角合成方法对预标定相机位姿估计工具(如COLMAP)的依赖问题,此类工具常引入计算瓶颈并传播误差。解决方案的关键在于提出一个统一框架,通过联合优化3D高斯点和相机位姿来实现无需预标定输入的端到端重建。其核心创新是将联合优化分解为两个交替进行的阶段:首先在固定相机位姿下通过可微渲染更新3D高斯参数;其次利用定制的3D光流算法结合几何与光度约束优化相机位姿,从而逐步降低投影误差,尤其在大视点变化和特征稀疏场景中表现优越。
链接: https://arxiv.org/abs/2510.26117
作者: Yuxuan Li,Tao Wang,Xianben Yang
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.
zh
[CV-56] OracleAgent : A Multimodal Reasoning Agent for Oracle Bone Script Research
【速读】:该论文旨在解决甲骨文(Oracle Bone Script, OBS)研究中的两大核心问题:一是甲骨文解读流程复杂,涉及多个串行与并行的子任务;二是甲骨文信息组织与检索效率低下,学者需耗费大量时间进行资源查找、整理与管理。解决方案的关键在于提出首个面向甲骨文的智能代理系统 OracleAgent,其通过集成基于大语言模型(Large Language Models, LLMs)的多工具模块,并构建了一个包含超过140万张单字拓片图像和8万条释文的领域专用多模态知识库,实现了对字符、文档、释文及拓片图像的高效检索与多模态推理能力。该系统在多项任务中超越主流多模态大语言模型(Multimodal Large Language Models, MLLMs),显著提升甲骨文研究效率,推动甲骨文辅助研究与自动化解读系统的实际落地。
链接: https://arxiv.org/abs/2510.26114
作者: Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,Xu Peng,Taisong Jin,Yongge Liu,Shengwei Han,Jing Yang,Xiaoping He,Feng Gao,AndyPian Wu,SevenShu,Chaoyang Wang,Chengjie Wang
机构: Xiamen University (厦门大学); Anyang Normal University (安阳师范学院); Tencent YouTu Lab (腾讯优图实验室); Tencent SSV (腾讯SSV)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.
zh
[CV-57] EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
【速读】:该论文旨在解决视频大语言模型(Video-LLMs)在多视角视频中缺乏时间理解一致性的问题,即当同一事件从第一人称(egocentric)和第三人称(exocentric)视角拍摄时,模型是否能保持对事件时间顺序和关键片段的一致性判断。现有方法在跨视角场景下表现不佳,不仅一致性显著低于单视角性能,且简单地用双视角同步视频微调后仍难以提升整体效果。解决方案的关键在于提出View-GRPO——一种新颖的强化学习框架,通过增强视点特异性的时间推理能力并促进跨视角的一致性理解,有效提升了模型在不同视角下的时序理解一致性,优于传统的监督微调(SFT)和通用REINFORCE策略(GRPO)。
链接: https://arxiv.org/abs/2510.26113
作者: Minjoon Jung,Junbin Xiao,Junghyun Kim,Byoung-Tak Zhang,Angela Yao
机构: Seoul National University (首尔国立大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:  project page: \url{ this https URL }
Abstract:Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.
zh
[CV-58] Security Risk of Misalignment between Text and Image in Multi-modal Model
【速读】:该论文旨在解决多模态扩散模型(multi-modal diffusion models)在文本与图像模态对齐不足时所引发的安全风险问题,特别是生成不适当或不适合工作场所(Not-Safe-For-Work, NSFW)内容的漏洞。其解决方案的关键在于提出一种新型攻击方法——提示受限多模态攻击(Prompt-Restricted Multi-modal Attack, PReMA),该方法通过仅修改输入图像来操控模型输出,而无需更改原始提示词本身,从而在固定提示场景下实现对生成内容的有效误导。PReMA是首个仅依赖对抗性图像即可操纵多模态扩散模型输出的攻击方式,显著区别于以往主要通过生成对抗性提示词的方法,揭示了图像编辑类应用中新的安全威胁。
链接: https://arxiv.org/abs/2510.26105
作者: Xiaosen Wang,Zhijin Ge,Shaokang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Xidian University (西安电子科技大学); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
zh
[CV-59] Dynamic VLM-Guided Negative Prompting for Diffusion Models NEURIPS2025
【速读】:该论文旨在解决扩散模型(Diffusion Models)中传统静态负向提示(Negative Prompting)无法适应生成过程中动态语义变化的问题,导致生成图像与文本描述之间存在偏差或不一致。其解决方案的关键在于引入视觉-语言模型(Vision-Language Model, VLM),在去噪过程的特定步骤生成中间图像预测,并利用VLM动态生成上下文相关的负向提示,从而实现更精准的语义约束和文本-图像对齐。
链接: https://arxiv.org/abs/2510.26052
作者: Hoyeon Chang,Seungjin Kim,Yoonseok Choi
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:  39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The First Workshop on Generative and Protective AI for Content Creation
Abstract:We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.
zh
[CV-60] FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
【速读】:该论文旨在解决儿科患者超声(Ultrasound, US)图像中骨骼结构自动分割的难题,尤其是在标注数据稀缺的情况下如何实现高精度分割。传统深度学习方法依赖大量像素级专家标注,成本高昂且耗时,而临床场景中往往难以获取充足标注数据。为此,作者提出了一种新颖且灵活的上下文学习(In-Context Learning, ICL)框架FlexICL,其关键在于通过仅标注少量视频帧(如5%),利用多帧图像拼接与增强策略,在无需重新训练模型的前提下实现对未见帧的准确分割。该方法在四个腕关节和肘关节US数据集上显著优于Painter、MAE-VQGAN等先进视觉ICL模型及U-Net、TransUNet等经典分割模型,Dice系数提升达1–27%,展现了其在医疗影像领域中高效、可扩展的应用潜力。
链接: https://arxiv.org/abs/2510.26049
作者: Yuyue Zhou,Jessica Knight,Shrimanti Ghosh,Banafshe Felfeliyan,Jacob L. Jaremko,Abhilash R. Hareendranathan
机构: University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.
zh
[CV-61] Enhancing Temporal Understanding in Video-LLM s through Stacked Temporal Attention in Vision Encoders NEURIPS2025
【速读】:该论文旨在解决当前视频大语言模型(Video-LLM)在理解视频中复杂时间动态方面的局限性,尤其是对动作序列和时间进展的细粒度理解能力不足的问题。其解决方案的关键在于在视觉编码器中引入堆叠的时间注意力模块(stacked temporal attention modules),通过在视觉编码阶段即捕获帧间的时间关系与动作演变过程,从而提升模型对视频时序信息的建模能力。这一设计使视觉令牌在传递给语言模型前已具备更强的时间结构感知能力,显著增强了视频问答任务中的时序推理性能,在VITATECS、MVBench和Video-MME等基准上最高提升达+5.5%。
链接: https://arxiv.org/abs/2510.26027
作者: Ali Rasekh,Erfan Bagheri Soula,Omid Daliran,Simon Gottschalk,Mohsen Fayyaz
机构: Leibniz University Hannover (汉诺威莱布尼茨大学); L3S Research Center (L3S 研究中心); Independent Researcher (独立研究者); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Accepted to NeurIPS 2025
Abstract:Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: this https URL.
zh
[CV-62] Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning
【速读】:该论文旨在解决海岸城市在海平面上升(Sea-Level Rise, SLR)背景下,传统物理驱动的水动力模拟器因计算成本过高而不适用于城市尺度防洪规划的问题。其解决方案的关键在于提出了一种轻量级卷积神经网络(Convolutional Neural Network, CNN)模型,基于视觉感知的低资源深度学习框架,能够高效预测不同SLR情景和岸线适应策略下的海岸洪水淹没范围与深度。该模型在阿联酋阿布扎比和美国旧金山两个地理区域的数据集上均展现出良好的泛化能力,平均将洪水深度图的均方误差(Mean Absolute Error, MAE)降低近20%,显著优于现有最优方法,从而为海岸防洪管理提供可扩展、实用的智能决策支持工具。
链接: https://arxiv.org/abs/2510.26017
作者: Bilal Hassan,Areg Karapetyan,Aaron Chung Hin Chow,Samer Madanat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:  Submitted to Hydrology and Earth System Sciences
Abstract:Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: this https URL
zh
[CV-63] DARTS: A Drone-Based AI-Powered Real-Time Traffic Incident Detection System
【速读】:该论文旨在解决传统交通事件检测方法(如闭路电视、行车记录仪和基于传感器的检测)在适应性、灵活性和可扩展性方面的局限性,这些方法通常将检测与验证分离,依赖密集基础设施或高车辆渗透率,难以应对动态变化的事故热点。解决方案的关键在于提出一种基于无人机的实时交通事件检测系统DARTS,其核心创新包括:利用无人机的高机动性和空中视角实现自适应监控,采用热成像技术提升低能见度条件下的性能并保障隐私,以及集成轻量级深度学习框架实现实时车辆轨迹提取与事件识别;该系统在自建数据集上达到99%的检测准确率,并支持通过网页界面同步进行视觉验证、严重程度评估及事故引发的拥堵传播监测,显著提升了应急响应速度和交通管控的主动性。
链接: https://arxiv.org/abs/2510.26004
作者: Bai Li,Achilleas Kourtellis,Rong Cao,Joseph Post,Brian Porter,Yu Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:  Preprint version. This manuscript is currently under review at Transportation Research Part C: Emerging Technologies. The PDF corresponds to the version submitted in June 2025. The main findings of this work were recognized with the Best Intelligent Transportation Systems Paper Award at the 2025 TRB Annual Meeting
Abstract:Rapid and reliable incident detection is critical for reducing crash-related fatalities, injuries, and congestion. However, conventional methods, such as closed-circuit television, dashcam footage, and sensor-based detection, separate detection from verification, suffer from limited flexibility, and require dense infrastructure or high penetration rates, restricting adaptability and scalability to shifting incident hotspots. To overcome these challenges, we developed DARTS, a drone-based, AI-powered real-time traffic incident detection system. DARTS integrates drones’ high mobility and aerial perspective for adaptive surveillance, thermal imaging for better low-visibility performance and privacy protection, and a lightweight deep learning framework for real-time vehicle trajectory extraction and incident detection. The system achieved 99% detection accuracy on a self-collected dataset and supports simultaneous online visual verification, severity assessment, and incident-induced congestion propagation monitoring via a web-based interface. In a field test on Interstate 75 in Florida, DARTS detected and verified a rear-end collision 12 minutes earlier than the local transportation management center and monitored incident-induced congestion propagation, suggesting potential to support faster emergency response and enable proactive traffic control to reduce congestion and secondary crash risk. Crucially, DARTS’s flexible deployment architecture reduces dependence on frequent physical patrols, indicating potential scalability and cost-effectiveness for use in remote areas and resource-constrained settings. This study presents a promising step toward a more flexible and integrated real-time traffic incident detection system, with significant implications for the operational efficiency and responsiveness of modern transportation management.
zh
[CV-64] Larger Hausdorff Dimension in Scanning Pattern Facilitates Mamba-Based Methods in Low-Light Image Enhancement
【速读】:该论文旨在解决现有基于Mamba框架的低光照图像增强方法在信息一致性、空间局部性捕捉以及计算效率方面的不足。其解决方案的关键在于提出了一种新颖的希尔伯特选择性扫描机制(Hilbert Selective Scan mechanism),通过提升扫描模式的豪斯多夫维数(Hausdorff dimension),更有效地探索特征空间,从而增强对细粒度结构的感知能力,并改善局部交互建模,同时保持长程依赖建模能力。该机制显著提升了定量指标与视觉保真度,且降低了计算资源消耗和推理时间。
链接: https://arxiv.org/abs/2510.26001
作者: Xinhua Wang,Caibo Feng,Xiangjun Fu,Chunxiao Liu
机构: Imperial College London (帝国理工学院); University of Sussex (萨塞克斯大学); University of California San Diego (加州大学圣地亚哥分校); Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose an innovative enhancement to the Mamba framework by increasing the Hausdorff dimension of its scanning pattern through a novel Hilbert Selective Scan mechanism. This mechanism explores the feature space more effectively, capturing intricate fine-scale details and improving overall coverage. As a result, it mitigates information inconsistencies while refining spatial locality to better capture subtle local interactions without sacrificing the model’s ability to handle long-range dependencies. Extensive experiments on publicly available benchmarks demonstrate that our approach significantly improves both the quantitative metrics and qualitative visual fidelity of existing Mamba-based low-light image enhancement methods, all while reducing computational resource consumption and shortening inference time. We believe that this refined strategy not only advances the state-of-the-art in low-light image enhancement but also holds promise for broader applications in fields that leverage Mamba-based techniques.
zh
[CV-65] Fine-tuning Segment Anything for Real-Time Tumor Tracking in Cine-MRI KR
【速读】:该论文旨在解决在胸腹部区域 cine-MRI 序列中实现实时肿瘤追踪的问题,尤其是在强数据稀缺条件下。其核心挑战在于如何在有限标注数据和严格一秒钟推理时间约束下,实现高精度且鲁棒的肿瘤分割与跟踪。解决方案的关键在于采用基于基础模型(foundation model)的分割策略,具体使用 SAM 2.1(Segment Anything Model 2.1)及其变体,并通过提示(prompt-based)交互方式实现高效适应。最终选择 SAM2.1 b+ 模型,在 TrackRAD2025 的小规模标注子集上进行微调,采用 1024×1024 图像块、标准增强和平衡 Dice + IoU 损失函数,同时以低学习率(0.0001)对所有模块进行训练,以保留泛化能力并适配不同标注者的风格。该方法在隐藏测试集上达到 Dice 系数 0.8794,排名第六,验证了基础模型在 MRI 引导放疗中实现实时精准肿瘤追踪的巨大潜力。
链接: https://arxiv.org/abs/2510.25990
作者: Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jean-Louis Dillenseger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Paper for the Trackrad2025 challenge, Team BreizhTrack
Abstract:In this work, we address the TrackRAD2025 challenge of real-time tumor tracking in cine-MRI sequences of the thoracic and abdominal regions under strong data scarcity constraints. Two complementary strategies were explored: (i) unsupervised registration with the IMPACT similarity metric and (ii) foundation model-based segmentation leveraging SAM 2.1 and its recent variants through prompt-based interaction. Due to the one-second runtime constraint, the SAM-based method was ultimately selected. The final configuration used SAM2.1 b+ with mask-based prompts from the first annotated slice, fine-tuned solely on the small labeled subset from TrackRAD2025. Training was configured to minimize overfitting, using 1024x1024 patches (batch size 1), standard augmentations, and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve generalization while adapting to annotator-specific styles. Training lasted 300 epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently applied across all anatomical sites and MRI field strengths. Test-time augmentation was considered but ultimately discarded due to negligible performance gains. The final model was selected based on the highest Dice Similarity Coefficient achieved on the validation set after fine-tuning. On the hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall in the TrackRAD2025 challenge. These results highlight the strong potential of foundation models for accurate and real-time tumor tracking in MRI-guided radiotherapy.
zh
[CV-66] Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
【速读】:该论文旨在解决从fMRI脑成像数据中重建人类所见图像时存在的忠实度不足问题(即重建图像与实际视觉内容不一致)。现有基于扩散模型的方法虽取得进展,但常因缺乏对大脑活动信息的有效利用而导致重建失真。其解决方案的关键在于提出一种名为“Brain-IT”的脑启发式框架,核心是引入Brain Interaction Transformer(BIT),通过共享的功能簇(functional-clusters)实现脑 voxel 群体间的高效交互,并将这些簇作为跨个体和跨脑区的信息整合基础。BIT同时预测两类局部图像特征:高阶语义特征用于引导扩散模型生成正确的语义内容,低阶结构特征用于初始化扩散过程以保留图像粗略布局。这种设计实现了从脑区激活到图像特征的直接信息流,从而显著提升重建图像的忠实性与质量,且仅需1小时新受试者数据即可达到传统方法使用40小时数据的效果。
链接: https://arxiv.org/abs/2510.25976
作者: Roman Beliy,Amit Zalcher,Jonathan Kogman,Navve Wasserman,Michal Irani
机构: The Weizmann Institute of Science (魏茨曼科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present “Brain-IT”, a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters  subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT’s design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.
zh
[CV-67] SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing NEURIPS2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像编辑任务中面临的两大核心问题:一是真实图像到潜在空间的逆映射(inversion)过程不准确,二是编辑过程中梯度纠缠(gradient entanglement)导致输出无法忠实反映目标提示词(prompt)。现有基于常微分方程(ODE)的方法虽尝试绕过逆映射,但仍存在编辑质量不佳的问题。论文提出一种无逆映射的流分解与聚合框架(flow decomposition-and-aggregation framework),其关键在于:首先将目标提示词语义分解为多个子提示词,分别为每个子提示词计算独立的流场;随后设计一种投影与软聚合机制(projection and soft-aggregation mechanism),借鉴多任务学习中的梯度冲突缓解策略,自适应加权各子目标速度场,抑制语义冗余并强化差异方向,从而在保持编辑多样性的同时确保输出与完整提示词的一致性。实验表明,该方法在零样本图像编辑中显著提升了语义保真度和属性解耦能力。
链接: https://arxiv.org/abs/2510.25970
作者: Sung-Hoon Yoon,Minghan Li,Gaspard Beaudouin,Congcong Wen,Muhammad Rafay Azhar,Mengyu Wang
机构: Harvard AI and Robotics Lab (哈佛人工智能与机器人实验室); Harvard University (哈佛大学); École des Ponts (巴黎路桥学院); Institut Polytechnique de Paris (巴黎综合理工学院); New York University Abu Dhabi (纽约大学阿布扎比分校); Kempner Institute for the Study of Natural and Artificial Intelligence (肯普纳自然与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:  Camera-ready version for NeurIPS 2025, 10 pages (main paper)
Abstract:Rectified flow models have become a de facto standard in image generation due to their stable sampling trajectories and high-fidelity outputs. Despite their strong generative capabilities, they face critical limitations in image editing tasks: inaccurate inversion processes for mapping real images back into the latent space, and gradient entanglement issues during editing often result in outputs that do not faithfully reflect the target prompt. Recent efforts have attempted to directly map source and target distributions via ODE-based approaches without inversion; however,these methods still yield suboptimal editing quality. In this work, we propose a flow decomposition-and-aggregation framework built upon an inversion-free formulation to address these limitations. Specifically, we semantically decompose the target prompt into multiple sub-prompts, compute an independent flow for each, and aggregate them to form a unified editing trajectory. While we empirically observe that decomposing the original flow enhances diversity in the target space, generating semantically aligned outputs still requires consistent guidance toward the full target prompt. To this end, we design a projection and soft-aggregation mechanism for flow, inspired by gradient conflict resolution in multi-task learning. This approach adaptively weights the sub-target velocity fields, suppressing semantic redundancy while emphasizing distinct directions, thereby preserving both diversity and consistency in the final edited output. Experimental results demonstrate that our method outperforms existing zero-shot editing approaches in terms of semantic fidelity and attribute disentanglement. The code is available at this https URL.
zh
[CV-68] Generative Image Restoration and Super-Resolution using Physics-Informed Synthetic Data for Scanning Tunneling Microscopy
【速读】:该论文旨在解决扫描隧道显微镜(Scanning Tunnelling Microscopy, STM)在原子级成像与原子操控中面临的两大挑战:一是探针(tip)因长时间使用或高电压作用导致的形貌退化,需频繁进行条件化处理;二是串行数据采集速度慢,限制了实验通量。解决方案的关键在于提出一种基于机器学习(Machine Learning, ML)的图像修复与超分辨率重建框架,通过仅使用36张原始Si(001):H表面的高质量STM图像,结合物理信息引导的合成数据生成流程,训练先进的流匹配(flow-matching)和扩散模型(diffusion models),从而实现从稀疏采样数据中高保真重建图像,显著提升图像恢复质量,并使图像采集时间减少2至4倍,同时降低对探针条件化频率的需求,提升STM系统的整体效率。
链接: https://arxiv.org/abs/2510.25921
作者: Nikola L. Kolev(1,2),Tommaso Rodani(3,4),Neil J. Curson(1,2),Taylor J.Z. Stock(1,2),Alberto Cazzaniga(4) ((1) London Centre for Nanotechnology, University College London, London, United Kingdom, (2) Department of Electronic and Electrical Engineering, University College London, London, United Kingdom, (3) University of Trieste, Trieste, Italy, (4) AREA Science Park, Trieste, Italy)
机构: University College London (伦敦大学学院); AREA Science Park (科学公园); University of Trieste (特里斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注:
Abstract:Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and atom manipulation, but its utility is often limited by tip degradation and slow serial data acquisition. Fabrication adds another layer of complexity since the tip is often subjected to large voltages, which may alter the shape of its apex, requiring it to be conditioned. Here, we propose a machine learning (ML) approach for image repair and super-resolution to alleviate both challenges. Using a dataset of only 36 pristine experimental images of Si(001):H, we demonstrate that a physics-informed synthetic data generation pipeline can be used to train several state-of-the-art flow-matching and diffusion models. Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy (CMMD) score and structural similarity demonstrates that our models are able to effectively restore images and offer a two- to fourfold reduction in image acquisition time by accurately reconstructing images from sparsely sampled data. Our framework has the potential to significantly increase STM experimental throughput by offering a route to reducing the frequency of tip-conditioning procedures and to enhancing frame rates in existing high-speed STM systems.
zh
[CV-69] BikeScenes: Online LiDAR Semantic Segmentation for Bicycles
【速读】:该论文旨在解决自行车骑行者因电动自行车速度提升而面临的安全风险问题,核心挑战在于如何将面向汽车的感知技术适配至自行车场景下的安全感知。解决方案的关键在于开发了一种专为自行车设计的3D LiDAR分割方法,并构建了首个面向自行车场景的语义LiDAR数据集BikeScenes-lidarseg,包含3021帧标注完整的LiDAR扫描数据,涵盖29类动态与静态物体。通过在该数据集上微调模型,实现了63.6%的平均交并比(mIoU),显著优于仅使用SemanticKITTI预训练所得的13.8%,验证了领域特定训练对提升自行车感知性能的有效性。
链接: https://arxiv.org/abs/2510.25901
作者: Denniz Goren,Holger Caesar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:The vulnerability of cyclists, exacerbated by the rising popularity of faster e-bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi-sensor ‘SenseBike’ research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine-tuning on our BikeScenes dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This result underscores the necessity and effectiveness of domain-specific training. We highlight key challenges specific to bicycle-mounted, hardware-constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist-centric LiDAR segmentation.
zh
[CV-70] MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
【速读】:该论文旨在解决当前文本到图像生成模型(text-to-image generative models)在训练过程中因使用大规模未筛选数据集而导致的生成结果与用户偏好不一致的问题,以及现有基于奖励模型(reward models)的后处理方法在提升偏好对齐的同时损害多样性、语义保真度和训练效率的问题。其解决方案的关键在于:在训练阶段直接将模型条件化于多个奖励模型,使模型能够学习并内化用户偏好,从而避免了传统方法中对生成样本进行后处理导致的信息损失和单一奖励优化的局限性。该方法名为MIRO,在GenEval组合基准测试和用户偏好评分(PickAScore、ImageReward、HPSv2)上均达到当前最优性能,并显著加速训练过程。
链接: https://arxiv.org/abs/2510.25897
作者: Nicolas Dufour,Lucas Degeorge,Arijit Ghosh,Vicky Kalogeiton,David Picard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:  Project page: this https URL
Abstract:Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
zh
[CV-71] DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications
【速读】:该论文旨在解决土木工程领域目标检测任务中因专业场景标注数据稀缺而导致的性能瓶颈问题。其解决方案的关键在于提出一种混合架构 DINO-YOLO,将自监督视觉 Transformer 模型 DINOv3 的特征表示能力与 YOLOv12 检测框架有机结合:通过在输入预处理阶段(P0)和骨干网络中间层(P3)两个位置注入 DINOv3 提取的特征,实现对有限标注数据的高效利用。实验表明,该方法在多个土木工程数据集上均显著提升检测精度(如隧道裂缝检测提升 12.4%、施工人员防护装备检测提升 13.7%),同时保持实时推理速度(30–47 FPS),为数据受限环境下的基础设施安全监测提供了高效率且实用的解决方案。
链接: https://arxiv.org/abs/2510.25140
作者: Malaisree P,Youwai S,Kitkobsin T,Janrungautai S,Amorndechaphon D,Rojanavasu P
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.
zh
[CV-72] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在真实世界部署中对多模态扰动缺乏鲁棒性的问题,现有方法仅关注简单的视觉干扰,忽略了动作、指令、环境和观测等多模态扰动的复杂性。其解决方案的关键在于提出RobustVLA框架,通过双重机制实现输入与输出层面的鲁棒性增强:在输出端,采用离线鲁棒优化策略对抗最坏情况的动作噪声,最大化流匹配目标中的不一致性,等效于对抗训练、标签平滑和异常值惩罚;在输入端,强制在保持任务语义不变的前提下,对输入变化保持一致的动作输出。此外,将多扰动场景建模为多臂赌博机问题,并利用上置信界(Upper Confidence Bound, UCB)算法自动识别最具破坏性的噪声类型,从而显著提升模型在17种扰动下的综合性能,在LIBERO基准上相较基线绝对提升达12.6%(pi0骨干)和10.4%(OpenVLA骨干),且推理速度比现有视觉鲁棒VLA快50.6倍,在真实FR5机器人上仅用少量示范即实现65.6%的绝对性能增益。
链接: https://arxiv.org/abs/2510.00037
作者: Jianing Guo,Zhenhong Wu,Chang Tu,Yiyao Ma,Xiangqi Kong,Zhiqian Liu,Jiaming Ji,Shuning Zhang,Yuanpei Chen,Kai Chen,Qi Dou,Yaodong Yang,Xianglong Liu,Huijie Zhao,Weifeng Lv,Simin Li
机构: Beihang University (北京航空航天大学); University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
zh
[CV-73] MORE: Multi-Organ Medical Image REconstruction Dataset
【速读】:该论文旨在解决当前深度学习在CT图像重建中普遍存在的泛化能力不足问题,即现有方法通常局限于特定解剖结构和数据集,难以有效处理未见过的解剖区域和病灶类型。其解决方案的关键在于构建了一个名为Multi-Organ medical image REconstruction (MORE)的大规模、多器官、多病灶类型的CT图像数据集,涵盖9种不同解剖结构和15类病变,从而支持模型在异质数据上的鲁棒训练与严格评估。同时,研究提出了一种优化驱动的基线方法,在未见解剖结构上展现出更强的重建稳定性与性能,验证了高质量数据集对提升模型泛化能力的重要性及优化策略的有效性。
链接: https://arxiv.org/abs/2510.26759
作者: Shaokai Wu,Yapan Guo,Yanbiao Ji,Jing Tong,Yuxiang Lu,Mei Li,Suizhi Huang,Yue Ding,Hongtao Lu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:  Accepted to ACMMM 2025
Abstract:CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page this https URL
zh
[CV-74] ProstNFound: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection
【速读】:该论文旨在解决医学基础模型(Medical Foundation Models, FMs)在前列腺癌(Prostate Cancer, PCa)微超声(micro-ultrasound, \muUS)检测中缺乏临床验证的问题。现有诊断依赖专家经验,存在主观性强、可扩展性差等局限。解决方案的关键在于提出 ProstNFound+,其核心创新包括:基于医学基础模型的架构、适配器微调(adapter tuning)技术,以及嵌入前列腺癌特异性临床生物标志物的定制提示编码器(prompt encoder),从而生成可解释的癌症热力图和显著性风险评分。该方法在多中心回顾性数据训练后,于五年后的新临床站点进行前瞻性验证,展现出与标准临床评分体系(PRI-MUS 和 PI-RADS)高度一致的性能,且无性能退化,表明其具备临床部署潜力。
链接: https://arxiv.org/abs/2510.26703
作者: Paul F. R. Wilson,Mohamed Harmanani,Minh Nguyen Nhat To,Amoon Jamzad,Tarek Elghareb,Zhuoxin Guo,Adam Kinnaird,Brian Wodlinger,Purang Abolmaesumi,Parvin Mousavi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: Medical foundation models (FMs) offer a path to build high-performance diagnostic systems. However, their application to prostate cancer (PCa) detection from micro-ultrasound (\muUS) remains untested in clinical settings. We present ProstNFound+, an adaptation of FMs for PCa detection from \muUS, along with its first prospective validation. Methods: ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt encoder that embeds PCa-specific clinical biomarkers. The model generates a cancer heatmap and a risk score for clinically significant PCa. Following training on multi-center retrospective data, the model is prospectively evaluated on data acquired five years later from a new clinical site. Model predictions are benchmarked against standard clinical scoring protocols (PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the prospective data, with no performance degradation compared to retrospective evaluation. It aligns closely with clinical scores and produces interpretable heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results highlight its potential for clinical deployment, offering a scalable and interpretable alternative to expert-driven protocols.
zh
[CV-75] BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI
【速读】:该论文旨在解决儿科脑部磁共振成像(MRI)中伪影严重程度评估的自动化问题,尤其针对低场强系统因信噪比降低导致的图像质量下降问题。传统人工评估方法耗时且主观性强,亟需可靠、高效的自动解决方案。其核心创新在于提出BRIQA(Balanced Reweighting in Image Quality Assessment),关键在于通过梯度驱动的损失重加权机制动态调整各类别伪影严重程度的贡献权重,并结合旋转批次采样策略确保欠代表类别的稳定学习,从而有效缓解类别不平衡问题。实验表明,该方法显著提升平均宏F1分数(从0.659提升至0.706),在噪声、条带、定位偏差等常见伪影类型上取得明显改进。
链接: https://arxiv.org/abs/2510.26661
作者: Alya Almsouti,Ainur Khamitova,Darya Taratynova,Mohammad Yaqub
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Assessing the severity of artifacts in pediatric brain Magnetic Resonance Imaging (MRI) is critical for diagnostic accuracy, especially in low-field systems where the signal-to-noise ratio is reduced. Manual quality assessment is time-consuming and subjective, motivating the need for robust automated solutions. In this work, we propose BRIQA (Balanced Reweighting in Image Quality Assessment), which addresses class imbalance in artifact severity levels. BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes. Through experiments, no single architecture performs best across all artifact types, emphasizing the importance of architectural diversity. The rotating batching configuration improves performance across metrics by promoting balanced learning when combined with cross-entropy loss. BRIQA improves average macro F1 score from 0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification. The code is available at this https URL.
zh
[CV-76] SAMRI: Segment Anything Model for MRI
【速读】:该论文旨在解决医学磁共振成像(MRI)分割在临床应用中因人工操作耗时且现有深度学习方法泛化能力不足的问题,尤其是针对MRI特有的对比度差异、强度不均匀性和扫描协议多样性带来的挑战。其解决方案的关键在于提出一种专为MRI优化的Segment Anything Model(SAMRI),通过仅微调掩码解码器(mask decoder)的两阶段策略,显著降低训练时间和可训练参数量(分别减少94%和96%),同时在涵盖全身器官与病灶的110万张标注MRI切片上实现平均Dice系数达0.87,展现出优于现有方法的分割精度和对未见结构的鲁棒性。
链接: https://arxiv.org/abs/2510.26635
作者: Zhao Wang,Wei Dai,Thuy Thanh Dao,Steffen Bollmann,Hongfu Sun,Craig Engstrom,Shekhar S. Chandra
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate magnetic resonance imaging (MRI) segmentation is crucial for clinical decision-making, but remains labor-intensive when performed manually. Convolutional neural network (CNN)-based methods can be accurate and efficient, but often generalize poorly to MRI’s variable contrast, intensity inhomogeneity, and protocols. Although the transformer-based Segment Anything Model (SAM) has demonstrated remarkable generalizability in natural images, existing adaptations often treat MRI as another imaging modality, overlooking these modality-specific challenges. We present SAMRI, an MRI-specialized SAM trained and validated on 1.1 million labeled MR slices spanning whole-body organs and pathologies. We demonstrate that SAM can be effectively adapted to MRI by simply fine-tuning its mask decoder using a two-stage strategy, reducing training time by 94% and trainable parameters by 96% versus full-model retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice of 0.87, delivering state-of-the-art accuracy across anatomical regions and robust generalization on unseen structures, particularly small and clinically important structures.
zh
[CV-77] Comparative Analysis of Deep Learning Models for Olive Tree Crown and Shadow Segmentation Towards Biovolume Estimation
【速读】:该论文旨在解决橄榄树生物量体积(biovolume)估算问题,这是精准农业中支持产量预测与资源管理的关键任务,尤其在受气候变化影响严重的地中海地区。解决方案的核心在于利用超分辨率无人机(UAV)影像,结合三种深度学习模型(U-Net、YOLOv11m-seg 和 Mask R-CNN)对橄榄树冠层及其阴影进行分割,并基于太阳几何关系将冠层投影面积与阴影推导的高度相结合,从而实现每棵树的生物量体积估计。其中,Mask R-CNN 在准确率上表现最优(F1 = 0.86;mIoU = 0.72),而 YOLOv11m-seg 具备最高处理速度(每图像 0.12 秒),为不同应用场景提供了可扩展的自动化工具。
链接: https://arxiv.org/abs/2510.26573
作者: Wondimagegn Abebe Demissie,Stefano Roccella,Rudy Rossetto,Antonio Minnocci,Andrea Vannini,Luca Sebastiani
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:  6 pages, 2025 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor)
Abstract:Olive tree biovolume estimation is a key task in precision agriculture, supporting yield prediction and resource management, especially in Mediterranean regions severely impacted by climate-induced stress. This study presents a comparative analysis of three deep learning models U-Net, YOLOv11m-seg, and Mask RCNN for segmenting olive tree crowns and their shadows in ultra-high resolution UAV imagery. The UAV dataset, acquired over Vicopisano, Italy, includes manually annotated crown and shadow masks. Building on these annotations, the methodology emphasizes spatial feature extraction and robust segmentation; per-tree biovolume is then estimated by combining crown projected area with shadow-derived height using solar geometry. In testing, Mask R-CNN achieved the best overall accuracy (F1 = 0.86; mIoU = 0.72), while YOLOv11m-seg provided the fastest throughput (0.12 second per image). The estimated biovolumes spanned from approximately 4 to 24 cubic meters, reflecting clear structural differences among trees. These results indicate Mask R-CNN is preferable when biovolume accuracy is paramount, whereas YOLOv11m-seg suits large-area deployments where speed is critical; U-Net remains a lightweight, high-sensitivity option. The framework enables accurate, scalable orchard monitoring and can be further strengthened with DEM or DSM integration and field calibration for operational decision support.
zh
[CV-78] SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation
【速读】:该论文旨在解决多器官分割中因器官尺寸和形状差异显著而导致深度学习方法性能受限的问题。其解决方案的关键在于提出一种两阶段的分割框架——空间先验引导的交叉双编码器网络(SPG-CDENet),其中包含两个核心组件:一是空间先验网络,用于生成粗略的感兴趣区域(ROI)定位图作为空间引导;二是交叉双编码器网络,由全局编码器、局部编码器、对称交叉注意力模块和基于流的解码器构成,通过在所有编码层引入对称交叉注意力机制增强全局与局部特征的交互与融合,并利用流式解码器将高层语义特征直接传播至各解码层,从而最大化特征保留与利用效率,显著提升多器官分割精度。
链接: https://arxiv.org/abs/2510.26390
作者: Xizhi Tian,Changjun Zhou,Yulin. Yang
机构: Zhejiang Normal University (浙江师范大学); Wenzhou University (温州大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-organ segmentation is a critical task in computer-aided diagnosis. While recent deep learning methods have achieved remarkable success in image segmentation, huge variations in organ size and shape challenge their effectiveness in multi-organ segmentation. To address these challenges, we propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel two-stage segmentation paradigm designed to improve multi-organ segmentation accuracy. Our SPG-CDENet consists of two key components: a spatial prior network and a cross dual encoder network. The prior network generates coarse localization maps that delineate the approximate ROI, serving as spatial guidance for the dual encoder network. The cross dual encoder network comprises four essential components: a global encoder, a local encoder, a symmetric cross-attention module, and a flow-based decoder. The global encoder captures global semantic features from the entire image, while the local encoder focuses on features from the prior network. To enhance the interaction between the global and local encoders, a symmetric cross-attention module is proposed across all layers of the encoders to fuse and refine features. Furthermore, the flow-based decoder directly propagates high-level semantic features from the final encoder layer to all decoder layers, maximizing feature preservation and utilization. Extensive qualitative and quantitative experiments on two public datasets demonstrate the superior performance of SPG-CDENet compared to existing segmentation methods. Furthermore, ablation studies further validate the effectiveness of the proposed modules in improving segmentation accuracy.
zh
[CV-79] Groupwise Registration with Physics-Informed Test-Time Adaptation on Multi-parametric Cardiac MRI
【速读】:该论文旨在解决多参数磁共振成像(Multiparametric MRI)中因不同对比度图像间配准偏差导致的像素级组织特征分析难题。其解决方案的关键在于提出一种通用性强的物理信息引导深度学习模型,通过测试时适应(test-time adaptation)实现跨不同物理模型(如T1映射和T2映射模型)的图像组配准;该方法利用特定物理模型生成的合成图像作为配准参考,支持多种组织对比度下的归纳学习(transductive learning),从而在健康志愿者的不同MRI序列数据上验证了其对多模态图像配准性能的显著提升。
链接: https://arxiv.org/abs/2510.26022
作者: Xinqi Li,Yi Zhang,Li-Ting Huang,Hsiao-Huang Chang,Thoralf Niendorf,Min-Chi Ku,Qian Tao,Hsin-Jung Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multiparametric mapping MRI has become a viable tool for myocardial tissue characterization. However, misalignment between multiparametric maps makes pixel-wise analysis challenging. To address this challenge, we developed a generalizable physics-informed deep-learning model using test-time adaptation to enable group image registration across contrast weighted images acquired from multiple physical models (e.g., a T1 mapping model and T2 mapping model). The physics-informed adaptation utilized the synthetic images from specific physics model as registration reference, allows for transductive learning for various tissue contrast. We validated the model in healthy volunteers with various MRI sequences, demonstrating its improvement for multi-modal registration with a wide range of image contrast variability.
zh
人工智能
[AI-0] LLM s Process Lists With General Filter Heads
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理列表操作任务时,其内部如何实现抽象计算逻辑(如过滤操作)的问题。解决方案的关键在于发现并验证了LLMs中存在一类称为“filter heads”的注意力头,它们能够在特定标记的查询状态中编码出紧凑且因果性的过滤谓词表示,这种表示具有通用性和可迁移性——可被提取并在不同数据集合、格式或任务中复用以执行相同的过滤操作。这一机制揭示了Transformer架构能够学习到类似函数式编程中“filter”函数的人类可解释实现方式,并展现出与传统编程范式高度一致的泛化能力。
链接: https://arxiv.org/abs/2510.26784
作者: Arnab Sen Sharma,Giordano Rogers,Natalie Shapira,David Bau
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  Code and data at this https URL
Abstract:We investigate the mechanisms underlying a range of list-processing tasks in LLMs, and we find that LLMs have learned to encode a compact, causal representation of a general filtering operation that mirrors the generic “filter” function of functional programming. Using causal mediation analysis on a diverse set of list-processing tasks, we find that a small number of attention heads, which we dub filter heads, encode a compact representation of the filtering predicate in their query states at certain tokens. We demonstrate that this predicate representation is general and portable: it can be extracted and reapplied to execute the same filtering operation on different collections, presented in different formats, languages, or even in tasks. However, we also identify situations where transformer LMs can exploit a different strategy for filtering: eagerly evaluating if an item satisfies the predicate and storing this intermediate result as a flag directly in the item representations. Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming patterns.
zh
[AI-1] Faithful and Fast Influence Function via Advanced Sampling
【速读】:该论文旨在解决训练数据对黑箱模型影响的解释难题,尤其是传统影响函数(Influence Functions, IFs)在计算整个数据集的海森矩阵(Hessian)时存在资源消耗过高、难以实际应用的问题。其核心挑战在于:随机采样小规模训练子集虽可降低计算成本,但因样本配置方差大而导致IF估计不稳定。解决方案的关键在于提出两种基于特征(features)和logits的先进采样策略,通过捕捉特征或logits的随机分布来选择具有代表性的少量训练样本,从而显著提升IF估计的准确性与稳定性。实验表明,该方法在保持推理一致性的同时,相较基线实现30.1%的计算时间减少、42.2%的内存占用降低,或F1-score提升2.5%。
链接: https://arxiv.org/abs/2510.26776
作者: Jungyeon Koh,Hyeonsu Lyu,Jonggyu Jang,Hyun Jong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:How can we explain the influence of training data on black-box models? Influence functions (IFs) offer a post-hoc solution by utilizing gradients and Hessians. However, computing the Hessian for an entire dataset is resource-intensive, necessitating a feasible alternative. A common approach involves randomly sampling a small subset of the training data, but this method often results in highly inconsistent IF estimates due to the high variance in sample configurations. To address this, we propose two advanced sampling techniques based on features and logits. These samplers select a small yet representative subset of the entire dataset by considering the stochastic distribution of features or logits, thereby enhancing the accuracy of IF estimations. We validate our approach through class removal experiments, a typical application of IFs, using the F1-score to measure how effectively the model forgets the removed class while maintaining inference consistency on the remaining classes. Our method reduces computation time by 30.1% and memory usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
zh
[AI-2] STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization
【速读】:该论文旨在解决生成式 AI 模型在低比特位宽(低于8位)激活值量化时导致精度显著下降的问题。其解决方案的关键在于提出了一种名为 Sequence Transformation and Mixed Precision (STaMP) 的量化策略,通过沿序列维度应用线性变换(如旋转),利用语言和视觉数据中的强局部相关性,并在每个中间激活中保留少量 token 以较高精度处理,从而在保持整体低平均比特位宽的同时维持模型精度。该方法可有效提升低比特激活量化性能,并与现有的权重和激活量化方法(包括特征变换技术)兼容互补。
链接: https://arxiv.org/abs/2510.26771
作者: Marco Federici,Riccardo Del Chiaro,Boris van Breugel,Paul Whatmough,Markus Nagel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  10 pages main text, 8 pages supplementary material
Abstract:Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textitSequence Transformation and Mixed Precision (STaMP) quantization, a novel strategy that applies linear transformations along the \textitsequence dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
zh
[AI-3] he Oversight Game: Learning to Cooperatively Balance an AI Agents Safety and Autonomy
【速读】:该论文试图解决在部署日益强大的智能体(Agent)时,如何在不修改底层系统的情况下维持有意义的人类控制这一核心安全问题。解决方案的关键在于设计一种最小化控制接口,其中智能体选择自主执行(play)或请求人类介入(ask),而人类则相应地选择信任(trust)或监督(oversee)。该交互被建模为一个两人马尔可夫博弈(Markov Game),并在其满足马尔可夫势博弈(Markov Potential Game, MPG)结构的条件下,提供了一个对齐保障:在对人类价值函数施加结构性假设的前提下,任何提升自身收益的自主决策都不会损害人类的价值。这一理论框架不仅提供了内在对齐的条件,还推动了透明控制层的设计——智能体在预训练策略和环境奖励结构不变的前提下,通过独立学习学会在高风险时请求帮助、低风险时自主行动,从而在模拟中实现人机协作并避免后训练阶段的安全违规。
链接: https://arxiv.org/abs/2510.26752
作者: William Overman,Mohsen Bayati
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human’s choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human’s value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human’s value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human’s. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment’s reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.
zh
[AI-4] A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation
【速读】:该论文旨在解决多智能体资源分配中因效率优先导致的不公平问题(inequitable outcomes),尤其是在资源受限场景下,传统基于效率优化的策略常使已有优势个体进一步占据更多资源,从而加剧分配不公。其解决方案的关键在于提出一种基于激励机制的公平性框架(General Incentives-based Framework for Fairness, GIFF),该框架无需额外训练即可利用标准Q-函数(action-value function)实现效率与公平性的平衡:通过计算每项动作的局部公平收益(local fairness gain)并引入反事实优势修正项(counterfactual advantage correction term),抑制对已处于有利地位代理的过度分配行为;该方法在集中式控制设置下由仲裁者使用修正后的Q值求解最优分配策略,实验证明其能在动态拼车、无家可归者预防和复杂任务分配等多样场景中持续优于主流基线,并具备理论保障——其公平性代理是一个真实公平改进的合理下界,且调节参数具有单调可控性。
链接: https://arxiv.org/abs/2510.26740
作者: Ashwin Kumar,William Yeoh
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce the General Incentives-based Framework for Fairness (GIFF), a novel approach for fair multi-agent resource allocation that infers fair decision-making from standard value functions. In resource-constrained settings, agents optimizing for efficiency often create inequitable outcomes. Our approach leverages the action-value (Q-)function to balance efficiency and fairness without requiring additional training. Specifically, our method computes a local fairness gain for each action and introduces a counterfactual advantage correction term to discourage over-allocation to already well-off agents. This approach is formalized within a centralized control setting, where an arbitrator uses the GIFF-modified Q-values to solve an allocation problem. Empirical evaluations across diverse domains, including dynamic ridesharing, homelessness prevention, and a complex job allocation task-demonstrate that our framework consistently outperforms strong baselines and can discover far-sighted, equitable policies. The framework’s effectiveness is supported by a theoretical foundation; we prove its fairness surrogate is a principled lower bound on the true fairness improvement and that its trade-off parameter offers monotonic tuning. Our findings establish GIFF as a robust and principled framework for leveraging standard reinforcement learning components to achieve more equitable outcomes in complex multi-agent systems.         Subjects:  Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)  Cite as: arXiv:2510.26740 [cs.MA]    (or  arXiv:2510.26740v1 [cs.MA] for this version)                https://doi.org/10.48550/arXiv.2510.26740   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[AI-5] ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在GPU内存受限环境下进行Mixture-of-Experts (MoE)推理时面临的高延迟和低效问题。传统MoE方法因每层独立选择活跃专家,导致频繁的主机与GPU间参数传输,引发显著延迟;同时,现有跨层预测策略依赖固定步长,缺乏对不同硬件平台和负载的自适应能力,限制了其鲁棒性与效果。解决方案的关键在于提出ExpertFlow——一个结合自适应专家预取(adaptive expert prefetching)与缓存感知路由(cache-aware routing)的运行时系统:通过实时统计信息(如带宽、参数维度及模型反馈信号)动态调整专家激活的预测窗口,并引入融合预门控信息与中间计算状态的混合跨层预测机制,从而精准预测未来专家需求并减少缓存未命中,最终将模型停顿时间降至基线的0.1%以下,显著优化了MoE推理效率。
链接: https://arxiv.org/abs/2510.26730
作者: Zixu Shen,Kexin Chu,Yifan Zhang,Dawei Xiang,Runxin Wu,Wei Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:  12 pages, 11 figures
Abstract:The expansion of large language models is increasingly limited by the constrained memory capacity of modern GPUs. To mitigate this, Mixture-of-Experts (MoE) architectures activate only a small portion of parameters during inference, significantly lowering both memory demand and computational overhead. However, conventional MoE inference approaches, which select active experts independently at each layer, often introduce considerable latency because of frequent parameter transfers between host and GPU memory. In addition, current cross-layer prediction strategies, which are typically based on fixed steps, lack adaptability across different hardware platforms and workloads, thereby reducing their robustness and effectiveness. To address these challenges, we present ExpertFlow, a runtime system for MoE inference that combines adaptive expert prefetching and cache-aware routing. ExpertFlow continuously adjusts its prediction horizon for expert activation by leveraging runtime statistics such as transfer bandwidth, parameter dimensionality, and model feedback signals. Furthermore, it incorporates a hybrid cross-layer prediction scheme that fuses pregating information with intermediate computational states to anticipate future expert needs. By adaptively refining prefetching decisions and aligning them with actual usage behavior, ExpertFlow effectively decreases cache misses and removes latency caused by expert swap-ins. Our evaluation demonstrates that ExpertFlow reduces model stall time to less than 0.1% of the baseline, highlighting its capability to optimize MoE inference under stringent memory constraints.          Comments: 12 pages, 11 figures   Subjects:  Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)  Cite as: arXiv:2510.26730 [cs.DC]    (or  arXiv:2510.26730v1 [cs.DC] for this version)                https://doi.org/10.48550/arXiv.2510.26730   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[AI-6] Non-Convex Over-the-Air Heterogeneous Federated Learning: A Bias-Variance Trade-off
【速读】:该论文旨在解决无线异构环境下基于过空气(Over-the-Air, OTA)的联邦学习(Federated Learning, FL)中模型更新存在偏差与方差过大导致收敛缓慢及泛化性能下降的问题。现有方法通常假设设备间信道条件均质或强制零偏差以保证收敛,但在实际无线异构场景下,这会限制整体性能并加剧更新方差。针对这一问题,作者提出了一种适用于一般光滑非凸目标函数的OTA-FL随机梯度下降(Stochastic Gradient Descent, SGD)更新机制,其关键在于允许结构化的、时不变的模型偏差,同时实现更低方差的更新策略,并首次在非凸优化框架下推导出有限时间内的平稳性边界(expected time average squared gradient norm),明确揭示了偏差-方差权衡关系;进一步地,通过构建联合OTA功率控制优化问题并采用仅需基站统计信道状态信息(Statistical Channel State Information, CSI)的逐次凸逼近(Successive Convex Approximation, SCA)算法求解,实现了对偏差的优化配置,从而加速收敛并提升模型泛化能力。
链接: https://arxiv.org/abs/2510.26722
作者: Muhammad Faraz Ul Abrar,Nicolò Michelusi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:
Abstract:Over-the-air (OTA) federated learning (FL) has been well recognized as a scalable paradigm that exploits the waveform superposition of the wireless multiple-access channel to aggregate model updates in a single use. Existing OTA-FL designs largely enforce zero-bias model updates by either assuming \emphhomogeneous wireless conditions (equal path loss across devices) or forcing zero-bias updates to guarantee convergence. Under \emphheterogeneous wireless scenarios, however, such designs are constrained by the weakest device and inflate the update variance. Moreover, prior analyses of biased OTA-FL largely address convex objectives, while most modern AI models are highly non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient descent (SGD) for general smooth non-convex objectives under wireless heterogeneity. We develop novel OTA-FL SGD updates that allow a structured, time-invariant model bias while facilitating reduced variance updates. We derive a finite-time stationarity bound (expected time average squared gradient norm) that explicitly reveals a bias-variance trade-off. To optimize this trade-off, we pose a non-convex joint OTA power-control design and develop an efficient successive convex approximation (SCA) algorithm that requires only statistical CSI at the base station. Experiments on a non-convex image classification task validate the approach: the SCA-based design accelerates convergence via an optimized bias and improves generalization over prior OTA-FL baselines.
zh
[AI-7] Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理视觉-语言数据时对文本输入存在显著偏好(text bias)的问题,这种偏好限制了模型基于视觉证据进行有效推理的能力。解决方案的关键在于提出并验证一个新假设:文本偏倚并非仅由外部因素(如数据不平衡或指令微调)引起,而是源于模型内部架构的固有特性——即视觉键向量(Visual Keys)在注意力空间中相对于纯文本预训练阶段学习到的键空间处于分布外(Out-of-Distribution, OOD)状态,导致其在注意力计算中获得系统性更低的相似度分数,从而被模型低估和忽视。通过提取LLaVA和Qwen2.5-VL模型中的键向量并使用t-SNE与Jensen-Shannon散度进行分析,实验证明视觉与文本键向量占据显著不同的子空间,跨模态差异远大于同模态内差异,揭示了文本偏倚的本质是注意力键空间内的内在错位。
链接: https://arxiv.org/abs/2510.26721
作者: Xinhan Zheng,Huyu Wu,Xueting Wang,Haiyun Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model’s internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.
zh
[AI-8] On the limitation of evaluating machine unlearning using only a single training seed
【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)算法在实证比较中可能因随机种子选择而产生非代表性结果的问题。当前多数MU算法为近似方法,其性能只能通过实验评估,而常见的做法是多次独立运行MU算法以从同一训练模型出发进行比较,但本文发现,即使使用相同架构和数据集,某些MU方法对模型训练时随机数种子的选择高度敏感,从而导致评估结果不可靠。解决方案的关键在于:在实证比较MU算法时,应同时考虑不同模型训练种子带来的变异性,以更全面、真实地反映算法性能。
链接: https://arxiv.org/abs/2510.26714
作者: Jamie Lanyon,Axel Finke,Petros Andreou,Georgina Cosma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  mini paper, 2 figures
Abstract:Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because – even for the same architecture and same dataset – some MU methods can be highly sensitive to the choice of random number seed used for model training. We therefore recommend that empirical comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should also reflect the variability across different model training seeds.
zh
[AI-9] Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的智能体在动态调用工具和访问受保护资源时因授权机制过于宽泛而导致的安全风险问题,即当前委托授权方法常授予超出任务所需权限,使代理可能越权操作。其解决方案的关键在于提出一种新的委托授权模型,允许授权服务器语义化地解析对受保护资源的访问请求,并仅颁发完成指定任务所需的最小权限范围(scope)的访问令牌;同时,为支持该模型的评估与训练,作者构建了ASTRA数据集及生成流水线,用于基准测试任务与权限范围之间的语义匹配能力,从而推动意图感知的细粒度访问控制(如任务基础访问控制,Task-Based Access Control, TBAC)技术的发展。
链接: https://arxiv.org/abs/2510.26702
作者: Majed El Helou,Chiara Troiani,Benjamin Ryder,Jean Diaconu,Hervé Muyal,Marcelo Yannuzzi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  Paper page at this https URL
Abstract:Authorizing Large Language Model driven agents to dynamically invoke tools and access protected resources introduces significant risks, since current methods for delegating authorization grant overly broad permissions and give access to tools allowing agents to operate beyond the intended task scope. We introduce and assess a delegated authorization model enabling authorization servers to semantically inspect access requests to protected resources, and issue access tokens constrained to the minimal set of scopes necessary for the agents’ assigned tasks. Given the unavailability of datasets centered on delegated authorization flows, particularly including both semantically appropriate and inappropriate scope requests for a given task, we introduce ASTRA, a dataset and data generation pipeline for benchmarking semantic matching between tasks and scopes. Our experiments show both the potential and current limitations of model-based matching, particularly as the number of scopes needed for task completion increases. Our results highlight the need for further research into semantic matching techniques enabling intent-aware authorization for multi-agent and tool-augmented applications, including fine-grained control, such as Task-Based Access Control (TBAC).
zh
[AI-10] Hybrid DQN-TD3 Reinforcement Learning for Autonomous Navigation in Dynamic Environments
【速读】:该论文旨在解决移动机器人在动态和部分可观测环境中实现高效、安全路径规划与控制的问题。解决方案的关键在于提出一种分层式路径规划与控制框架,其中高层采用深度Q网络(Deep Q-Network, DQN)进行离散子目标选择,低层使用孪生延迟深度确定性策略梯度(Twin Delayed Deep Deterministic Policy Gradient, TD3)控制器执行连续速度指令,通过奖励函数设计(包含方向、距离、避障、动作平滑性、碰撞惩罚、时间惩罚及进度奖励)与基于激光雷达(LiDAR)的安全门机制协同优化决策与执行过程,从而提升任务成功率、路径效率及对未见过障碍配置的泛化能力,并减少控制突变。
链接: https://arxiv.org/abs/2510.26646
作者: Xiaoyi He,Danggui Chen,Zhenshuo Zhang,Zimeng Bai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:  6 pages, 5 figures; ROS+Gazebo (TurtleBot3) implementation; evaluation with PathBench metrics; code (primary): this https URL mirror (for reproducibility): this https URL
Abstract:This paper presents a hierarchical path-planning and control framework that combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller for continuous actuation. The high-level module selects behaviors and sub-goals; the low-level module executes smooth velocity commands. We design a practical reward shaping scheme (direction, distance, obstacle avoidance, action smoothness, collision penalty, time penalty, and progress), together with a LiDAR-based safety gate that prevents unsafe motions. The system is implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics, including success rate, collision rate, path efficiency, and re-planning efficiency, in dynamic and partially observable environments. Experiments show improved success rate and sample efficiency over single-algorithm baselines (DQN or TD3 alone) and rule-based planners, with better generalization to unseen obstacle configurations and reduced abrupt control changes. Code and evaluation scripts are available at the project repository.
zh
[AI-11] Aeolus: A Multi-structural Flight Delay Dataset
【速读】:该论文旨在解决现有航班延误预测数据集在建模延迟传播时空动态方面的局限性,即大多数数据集仅提供扁平的表格结构,无法捕捉航班之间因资源(如飞机、机组人员和机场)共享而产生的复杂依赖关系。其解决方案的关键在于构建Aeolus这一多模态航班延误数据集,包含三个对齐的模态:(i) 包含丰富运营、气象和机场级特征的表格数据(覆盖超5000万架次航班);(ii) 航班链模块(flight chain module),用于建模沿连续航段的延误传播,捕获上下游依赖;(iii) 航班网络图(flight network graph),编码共享资源连接关系,支持跨航班的关联推理。该设计使模型能够同时处理回归、分类、时序结构建模与图学习任务,为表格、序列和图三种模态提供统一基准。
链接: https://arxiv.org/abs/2510.26616
作者: Lin Xu,Xinyun Yuan,Yuxuan Liang,Suwan Yin,Yuankai Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed to advance research on flight delay prediction and support the development of foundation models for tabular data. Existing datasets in this domain are typically limited to flat tabular structures and fail to capture the spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this limitation by providing three aligned modalities: (i) a tabular dataset with rich operational, meteorological, and airportlevel features for over 50 million flights; (ii) a flight chain module that models delay propagation along sequential flight legs, capturing upstream and downstream dependencies; and (iii) a flight network graph that encodes shared aircraft, crew, and airport resource connections, enabling cross-flight relational reasoning. The dataset is carefully constructed with temporal splits, comprehensive features, and strict leakage prevention to support realistic and reproducible machine learning evaluation. Aeolus supports a broad range of tasks, including regression, classification, temporal structure modeling, and graph learning, serving as a unified benchmark across tabular, sequential, and graph modalities. We release baseline experiments and preprocessing tools to facilitate adoption. Aeolus fills a key gap for both domain-specific modeling and general-purpose structured data this http URL source code and data can be accessed at this https URL
zh
[AI-12] Agent ic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling KR
【速读】:该论文旨在解决家庭能源管理系统(Home Energy Management Systems, HEMS)在实际应用中因用户交互障碍导致的采纳率低的问题,即如何将用户的自然语言偏好高效转化为可执行的多设备调度策略。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的代理型智能系统(agentic AI HEMS),通过分层架构实现从自然语言输入到设备控制的全流程自主协调:该系统包含一个统筹代理与三个专业代理,采用ReAct推理模式进行迭代式决策,无需示例演示即可生成成本最优的多电器调度方案;同时集成Google Calendar以提取上下文感知的时间约束,从而在真实奥地利日前电价环境下实现跨场景的稳定性能表现,其中Llama-3.3-70B模型展现出卓越的协同调度能力,验证了该方法的有效性与可扩展性。
链接: https://arxiv.org/abs/2510.26603
作者: Reda El Makroum,Sebastian Zwickl-Bernhard,Lukas Kranzl
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:  34 pages, 9 figures. Code available at this https URL
Abstract:The electricity sector transition requires substantial increases in residential demand response capacity, yet Home Energy Management Systems (HEMS) adoption remains limited by user interaction barriers requiring translation of everyday preferences into technical parameters. While large language models have been applied to energy systems as code generators and parameter extractors, no existing implementation deploys LLMs as autonomous coordinators managing the complete workflow from natural language input to multi-appliance scheduling. This paper presents an agentic AI HEMS where LLMs autonomously coordinate multi-appliance scheduling from natural language requests to device control, achieving optimal scheduling without example demonstrations. A hierarchical architecture combining one orchestrator with three specialist agents uses the ReAct pattern for iterative reasoning, enabling dynamic coordination without hardcoded workflows while integrating Google Calendar for context-aware deadline extraction. Evaluation across three open-source models using real Austrian day-ahead electricity prices reveals substantial capability differences. Llama-3.3-70B successfully coordinates all appliances across all scenarios to match cost-optimal benchmarks computed via mixed-integer linear programming, while other models achieve perfect single-appliance performance but struggle to coordinate all appliances simultaneously. Progressive prompt engineering experiments demonstrate that analytical query handling without explicit guidance remains unreliable despite models’ general reasoning capabilities. We open-source the complete system including orchestration logic, agent prompts, tools, and web interfaces to enable reproducibility, extension, and future research.
zh
[AI-13] Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在高自主性与复杂操作下导致的效率低下和错误累积问题,如token消耗过高及因信息误导引发的失败。现有方法多依赖事后故障归因,缺乏实时、主动的干预机制以提升系统鲁棒性和效率。其解决方案的关键在于提出SupervisorAgent——一个轻量且模块化的运行时自适应监督框架,无需修改基础智能体架构,通过一个无需大语言模型(LLM-free)的自适应过滤器触发,在关键节点主动纠正错误、引导低效行为并净化观测信息,从而实现高效、稳健的运行优化。
链接: https://arxiv.org/abs/2510.26585
作者: Fulin Lin,Shaowen Chen,Ruishan Fang,Hongwei Wang,Tao Lin
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:While Multi-Agent Systems (MAS) excel at complex tasks, their growing autonomy with operational complexity often leads to critical inefficiencies, such as excessive token consumption and failures arising from misinformation. Existing methods primarily focus on post-hoc failure attribution, lacking proactive, real-time interventions to enhance robustness and efficiency. To this end, we introduce SupervisorAgent, a lightweight and modular framework for runtime, adaptive supervision that operates without altering the base agent’s architecture. Triggered by an LLM-free adaptive filter, SupervisorAgent intervenes at critical junctures to proactively correct errors, guide inefficient behaviors, and purify observations. On the challenging GAIA benchmark, SupervisorAgent reduces the token consumption of the Smolagent framework by an average of 29.45% without compromising its success rate. Extensive experiments across five additional benchmarks (math reasoning, code generation, and question answering) and various SoTA foundation models validate the broad applicability and robustness of our approach. The code is available at this https URL.
zh
[AI-14] Multiclass Local Calibration With the Jensen-Shannon Distance
【速读】:该论文旨在解决多分类场景下模型预测概率校准(calibration)中存在的邻近偏差(proximity bias)问题,即现有方法因缺乏对输入空间中距离关系的建模,在特征空间稀疏区域的预测易出现系统性校准偏差,这在医疗等高风险场景中尤为危险。解决方案的关键在于引入局部校准(local calibration)的概念,通过定义多分类局部校准并建立其与强校准(strong calibration)的关系,提出一种基于Jensen-Shannon散度约束神经网络输出概率与局部类别频率估计之间一致性的实用方法,从而提升模型在稀疏区域的校准可靠性。
链接: https://arxiv.org/abs/2510.26566
作者: Cesare Barbera,Lorenzo Perini,Giovanni De Toni,Andrea Passerini,Andrea Pugnana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing trustworthy Machine Learning (ML) models requires their predicted probabilities to be well-calibrated, meaning they should reflect true-class frequencies. Among calibration notions in multiclass classification, strong calibration is the most stringent, as it requires all predicted probabilities to be simultaneously calibrated across all classes. However, existing approaches to multiclass calibration lack a notion of distance among inputs, which makes them vulnerable to proximity bias: predictions in sparse regions of the feature space are systematically miscalibrated. This is especially relevant in high-stakes settings, such as healthcare, where the sparse instances are exactly those most at risk of biased treatment. In this work, we address this main shortcoming by introducing a local perspective on multiclass calibration. First, we formally define multiclass local calibration and establish its relationship with strong calibration. Second, we theoretically analyze the pitfalls of existing evaluation metrics when applied to multiclass local calibration. Third, we propose a practical method for enhancing local calibration in Neural Networks, which enforces alignment between predicted probabilities and local estimates of class frequencies using the Jensen-Shannon distance. Finally, we empirically validate our approach against existing multiclass calibration techniques.
zh
[AI-15] Adaptive Inverse Kinematics Framework for Learning Variable-Length Tool Manipulation in Robotics
【速读】:该论文旨在解决传统机器人在工具使用方面能力受限的问题,即其对自身运动学(kinematics)理解不足且仅能执行预编程任务,难以高效地利用工具完成复杂操作。解决方案的关键在于提出一种创新的框架,通过扩展机器人逆运动学求解器(inverse kinematics solver),使其能够学习并执行一系列基于不同长度工具的序列化动作;同时结合仿真中学习到的动作轨迹与实际工具进行融合,实现了从仿真到现实场景的技能迁移,显著提升了工具操作精度(误差小于1 cm),并验证了模型在不同工具长度下性能的一致性,从而推动了机器人在工具抓取、选择、姿态优化及精准操作四个核心维度上的能力突破。
链接: https://arxiv.org/abs/2510.26551
作者: Prathamesh Kothavale,Sravani Boddepalli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:  10 pages, 5 figures. Demonstrates a reinforcement learning framework for adaptive tool manipulation with variable-length extensions
Abstract:Conventional robots possess a limited understanding of their kinematics and are confined to preprogrammed tasks, hindering their ability to leverage tools efficiently. Driven by the essential components of tool usage - grasping the desired outcome, selecting the most suitable tool, determining optimal tool orientation, and executing precise manipulations - we introduce a pioneering framework. Our novel approach expands the capabilities of the robot’s inverse kinematics solver, empowering it to acquire a sequential repertoire of actions using tools of varying lengths. By integrating a simulation-learned action trajectory with the tool, we showcase the practicality of transferring acquired skills from simulation to real-world scenarios through comprehensive experimentation. Remarkably, our extended inverse kinematics solver demonstrates an impressive error rate of less than 1 cm. Furthermore, our trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our model achieves virtually indistinguishable performance when employing two distinct tools of different lengths. This research provides an indication of potential advances in the exploration of all four fundamental aspects of tool usage, enabling robots to master the intricate art of tool manipulation across diverse tasks.
zh
[AI-16] EdgeRunner 20B: Military Task Parity with GPT -5 while Running on the Edge
【速读】:该论文旨在解决军事领域中对高性能、数据敏感型AI模型部署的需求问题,特别是在数据隐私和安全要求极高的场景下,传统云端大模型存在合规与风险隐患。解决方案的关键在于开发了一个针对军事任务优化的微调模型EdgeRunner 20B,其基于160万条高质量军事文档与网站数据进行训练,并在四个新构建的军事专业测试集(包括战斗兵种、医疗兵、网络作战及通用军事知识)上表现出与GPT-5相当或更优的性能(统计显著性>95%),同时在通用基准测试中未出现显著退化。此外,该模型具备小尺寸、本地化部署能力,可运行于气隙(air-gapped)边缘设备,从而实现高安全性与低延迟的军事智能决策支持。
链接: https://arxiv.org/abs/2510.26550
作者: Jack FitzGerald,Aristotelis Lazaridis,Dylan Bates,Aman Sharma,Jonnathan Castillo,Yousif Azami,Sean Bailey,Jeremy Cao,Peter Damianov,Kevin de Haan,Luke Kerbs,Vincent Lu,Joseph Madigan,Jeremy McLaurin,Jonathan Tainer,Dave Anderson,Jonathan Beck,Jamie Cuticello,Colton Malkerson,Tyler Saltsman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  19 pages
Abstract:We present EdgeRunner 20B, a fine-tuned version of gpt-oss-20b optimized for military tasks. EdgeRunner 20B was trained on 1.6M high-quality records curated from military documentation and websites. We also present four new tests sets: (a) combat arms, (b) combat medic, © cyber operations, and (d) mil-bench-5k (general military knowledge). On these military test sets, EdgeRunner 20B matches or exceeds GPT-5 task performance with 95%+ statistical significance, except for the high reasoning setting on the combat medic test set and the low reasoning setting on the mil-bench-5k test set. Versus gpt-oss-20b, there is no statistically-significant regression on general-purpose benchmarks like ARC-C, GPQA Diamond, GSM8k, IFEval, MMLU Pro, or TruthfulQA, except for GSM8k in the low reasoning setting. We also present analyses on hyperparameter settings, cost, and throughput. These findings show that small, locally-hosted models are ideal solutions for data-sensitive operations such as in the military domain, allowing for deployment in air-gapped edge devices.
zh
[AI-17] Human-AI Complementarity: A Goal for Amplified Oversight
【速读】:该论文旨在解决AI系统在复杂任务中日益增长的验证难度问题,尤其是人类在监督AI输出时面临的事实核查(fact-verification)挑战。其核心解决方案在于通过引入AI辅助的人类监督机制(Amplified Oversight),将AI评分与人类评分相结合,并依据AI评分器的置信度进行加权融合,从而提升整体监督质量。关键创新点在于:仅展示搜索结果和证据可促使人类建立更合理的信任关系,避免因过度依赖AI解释和标签而产生的认知偏差,从而实现人机协同监督的有效性优化。
链接: https://arxiv.org/abs/2510.26518
作者: Rishub Jain,Sophie Bridgers,Lili Janzer,Rory Greig,Tian Huey Teh,Vladimir Mikulik
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Human feedback is critical for aligning AI systems to human values. As AI capabilities improve and AI is used to tackle more challenging tasks, verifying quality and safety becomes increasingly challenging. This paper explores how we can leverage AI to improve the quality of human oversight. We focus on an important safety problem that is already challenging for humans: fact-verification of AI outputs. We find that combining AI ratings and human ratings based on AI rater confidence is better than relying on either alone. Giving humans an AI fact-verification assistant further improves their accuracy, but the type of assistance matters. Displaying AI explanation, confidence, and labels leads to over-reliance, but just showing search results and evidence fosters more appropriate trust. These results have implications for Amplified Oversight – the challenge of combining humans and AI to supervise AI systems even as they surpass human expert performance.
zh
[AI-18] Simulating and Experimenting with Social Media Mobilization Using LLM Agents
【速读】:该论文旨在解决在线社交网络中政治动员信息传播机制的问题,特别是如何在大规模场景下量化同侪影响(peer influence)对选民投票率(voter turnout)的作用。其解决方案的关键在于构建了一个基于代理的仿真框架(agent-based simulation framework),该框架融合了真实美国人口普查的人口统计分布、真实的Twitter网络拓扑结构以及具有异质性的大语言模型(LLM)代理(如GPT-4.1、GPT-4.1-Mini和GPT-4.1-Nano),从而模拟个体在社交网络中的动态互动过程,包括个性化信息流接收、行为演化及投票意图变化。通过复现原始Facebook实验中的信息性和社交性动员处理条件,该框架成功再现了现场实验中观察到的定性模式,如社交型信息更显著的动员效应和可测量的同侪溢出效应(peer spillovers),为政治动员研究提供了可控、可复现的计算实验环境。
链接: https://arxiv.org/abs/2510.26494
作者: Sadegh Shirani,Mohsen Bayati
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Online social networks have transformed the ways in which political mobilization messages are disseminated, raising new questions about how peer influence operates at scale. Building on the landmark 61-million-person Facebook experiment \citepbond201261, we develop an agent-based simulation framework that integrates real U.S. Census demographic distributions, authentic Twitter network topology, and heterogeneous large language model (LLM) agents to examine the effect of mobilization messages on voter turnout. Each simulated agent is assigned demographic attributes, a personal political stance, and an LLM variant (\textttGPT-4.1, \textttGPT-4.1-Mini, or \textttGPT-4.1-Nano) reflecting its political sophistication. Agents interact over realistic social network structures, receiving personalized feeds and dynamically updating their engagement behaviors and voting intentions. Experimental conditions replicate the informational and social mobilization treatments of the original Facebook study. Across scenarios, the simulator reproduces qualitative patterns observed in field experiments, including stronger mobilization effects under social message treatments and measurable peer spillovers. Our framework provides a controlled, reproducible environment for testing counterfactual designs and sensitivity analyses in political mobilization research, offering a bridge between high-validity field experiments and flexible computational modeling.\footnoteCode and data available at this https URL
zh
[AI-19] LINK-KG: LLM -Driven Coreference-Resolved Knowledge Graphs for Human Smuggling Networks
【速读】:该论文旨在解决从长篇、非结构化的法律案件文档中自动构建高质量知识图谱(Knowledge Graph, KG)时面临的挑战,尤其是核心指代消解(coreference resolution)不充分和实体链接不一致的问题。现有方法往往忽略或无法有效处理跨段落的指代关系,导致生成的知识图谱存在节点重复和噪声较多等缺陷。解决方案的关键在于提出一个三阶段、大语言模型(Large Language Model, LLM)引导的核心指代消解流水线,并引入类型特定的提示缓存(type-specific Prompt Cache),该机制能够持续追踪并解析文档不同片段中的指代关系,从而生成清晰且无歧义的叙事逻辑,实现对短文本与长文本均有效的结构化知识抽取。实验表明,该方法显著降低了平均节点重复率(45.21%)和噪声节点比例(32.22%),提升了知识图谱的整体一致性与可用性。
链接: https://arxiv.org/abs/2510.26486
作者: Dipak Meher,Carlotta Domeniconi,Guadalupe Correa-Cabrera
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:  Accepted in ICKG 2025 Conference, 8 Pages, 2 Figures
Abstract:Human smuggling networks are complex and constantly evolving, making them difficult to analyze comprehensively. Legal case documents offer rich factual and procedural insights into these networks but are often long, unstructured, and filled with ambiguous or shifting references, posing significant challenges for automated knowledge graph (KG) construction. Existing methods either overlook coreference resolution or fail to scale beyond short text spans, leading to fragmented graphs and inconsistent entity linking. We propose LINK-KG, a modular framework that integrates a three-stage, LLM-guided coreference resolution pipeline with downstream KG extraction. At the core of our approach is a type-specific Prompt Cache, which consistently tracks and resolves references across document chunks, enabling clean and disambiguated narratives for structured knowledge graph construction from both short and long legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes by 32.22% compared to baseline methods, resulting in cleaner and more coherent graph structures. These improvements establish LINK-KG as a strong foundation for analyzing complex criminal networks.
zh
[AI-20] Who Has The Final Say? Conformity Dynamics in ChatGPT s Selections
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在高风险决策场景中是否具备社会独立性,以及它们是否会受到群体意见的影响。研究表明,GPT-4o在招聘情境下并非保持客观中立,而是会根据感知到的社会共识调整其判断,表现出显著的信息性与规范性从众倾向。解决方案的关键在于:应在向LLM暴露人类观点之前,先获取其独立判断,以避免社会影响导致的偏差,从而提升AI决策的可靠性与公正性。
链接: https://arxiv.org/abs/2510.26481
作者: Clarissa Sabrina Arlinghaus,Tristan Kenneweg,Barbara Hammer,Günter W. Maier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  5 pages, 5 figures, HAI 2025: Workshop on Socially Aware and Cooperative Intelligent Systems
Abstract:Large language models (LLMs) such as ChatGPT are increasingly integrated into high-stakes decision-making, yet little is known about their susceptibility to social influence. We conducted three preregistered conformity experiments with GPT-4o in a hiring context. In a baseline study, GPT consistently favored the same candidate (Profile C), reported moderate expertise (M = 3.01) and high certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT faced unanimous opposition from eight simulated partners and almost always conformed (99.9%), reporting lower certainty and significantly elevated self-reported informational and normative conformity (p  .001). In Study 2 (GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of disagreement trials, reporting less certainty and more normative conformity. Across studies, results demonstrate that GPT does not act as an independent observer but adapts to perceived social consensus. These findings highlight risks of treating LLMs as neutral decision aids and underline the need to elicit AI judgments prior to exposing them to human opinions.
zh
[AI-21] Robust Graph Condensation via Classification Complexity Mitigation
【速读】:该论文旨在解决图压缩(Graph Condensation, GC)在原始图遭受噪声或对抗攻击时鲁棒性不足的问题。现有研究表明,GC本质上是一种内在维度降低过程,通过生成结构更简洁但信息保留充分的子图来降低分类复杂度,然而这一特性使其对对抗扰动极为敏感。为提升GC的鲁棒性,作者提出基于图数据流形几何特性的MRGC框架,其关键在于引入三个流形学习模块,引导压缩图位于平滑、低维且类别歧义最小的流形空间中,从而在保持GC固有降维优势的同时,显著增强其在通用对抗攻击下的稳定性。
链接: https://arxiv.org/abs/2510.26451
作者: Jiayi Luo,Qingyun Sun,Beining Yang,Haonan Yuan,Xingcheng Fu,Yanbiao Ma,Jianxin Li,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph condensation (GC) has gained significant attention for its ability to synthesize smaller yet informative graphs. However, existing studies often overlook the robustness of GC in scenarios where the original graph is corrupted. In such cases, we observe that the performance of GC deteriorates significantly, while existing robust graph learning technologies offer only limited effectiveness. Through both empirical investigation and theoretical analysis, we reveal that GC is inherently an intrinsic-dimension-reducing process, synthesizing a condensed graph with lower classification complexity. Although this property is critical for effective GC performance, it remains highly vulnerable to adversarial perturbations. To tackle this vulnerability and improve GC robustness, we adopt the geometry perspective of graph data manifold and propose a novel Manifold-constrained Robust Graph Condensation framework named MRGC. Specifically, we introduce three graph data manifold learning modules that guide the condensed graph to lie within a smooth, low-dimensional manifold with minimal class ambiguity, thereby preserving the classification complexity reduction capability of GC and ensuring robust performance under universal adversarial attacks. Extensive experiments demonstrate the robustness of \ModelName\ across diverse attack scenarios.
zh
[AI-22] Personalized Treatment Outcome Prediction from Scarce Data via Dual-Channel Knowledge Distillation and Adaptive Fusion
【速读】:该论文旨在解决小样本和罕见患者群体在精准医学中个性化治疗效果预测的问题,其核心挑战在于高质量临床试验数据稀缺导致的预测性能受限。解决方案的关键在于提出一种跨保真度知识蒸馏与自适应融合网络(Cross-Fidelity Knowledge Distillation and Adaptive Fusion Network, CFKD-AFN),该方法利用大量但保真度较低的仿真数据来增强对少量高保真度临床试验数据的预测能力;其创新性体现在两个模块:一是双通道知识蒸馏模块,用于从低保真模型中提取互补知识;二是注意力引导的融合模块,可动态整合多源信息,从而显著提升预测准确性(实验显示提升幅度达6.67%–74.55%)并增强对高保真数据集规模变化的鲁棒性。
链接: https://arxiv.org/abs/2510.26444
作者: Wenjie Chen,Li Zhuang,Ziying Luo,Yu Liu,Jiahao Wu,Shengcai Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized treatment outcome prediction based on trial data for small-sample and rare patient groups is critical in precision medicine. However, the costly trial data limit the prediction performance. To address this issue, we propose a cross-fidelity knowledge distillation and adaptive fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN incorporates a dual-channel knowledge distillation module to extract complementary knowledge from the low-fidelity model, along with an attention-guided fusion module to dynamically integrate multi-source information. Experiments on treatment outcome prediction for the chronic obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN over state-of-the-art methods in prediction accuracy, ranging from 6.67% to 74.55%, and strong robustness to varying high-fidelity dataset sizes. Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the exploration of latent medical semantics to support clinical decision-making.
zh
[AI-23] SSCL-BW: Sample-Specific Clean-Label Backdoor Watermarking for Dataset Ownership Verification
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)训练数据集在未经授权情况下被商业使用的知识产权保护问题。现有基于后门的验证方法存在两大局限:毒标签水印易因标签不一致而被检测到,干净标签水印则技术复杂且在高分辨率图像上失效,且两者均采用静态水印模式,易被识别和移除。解决方案的关键在于提出一种样本特异性的干净标签后门水印方法(Sample-Specific Clean-Label Backdoor Watermarking, SSCL-BW),其核心创新是设计了一个包含三部分的复合损失函数:目标样本损失确保水印有效性,非目标样本损失保障触发可靠性,感知相似性损失维持视觉不可见性;同时通过U-Net架构训练水印生成器,为每个样本生成唯一水印,从根本上克服了静态水印模式的脆弱性。
链接: https://arxiv.org/abs/2510.26420
作者: Yingjia Wang,Ting Qiao,Xing Liu,Chongzuo Li,Sixing Wu,Jianbin Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:  8 pages,9 figures
Abstract:The rapid advancement of deep neural networks (DNNs) heavily relies on large-scale, high-quality datasets. However, unauthorized commercial use of these datasets severely violates the intellectual property rights of dataset owners. Existing backdoor-based dataset ownership verification methods suffer from inherent limitations: poison-label watermarks are easily detectable due to label inconsistencies, while clean-label watermarks face high technical complexity and failure on high-resolution images. Moreover, both approaches employ static watermark patterns that are vulnerable to detection and removal. To address these issues, this paper proposes a sample-specific clean-label backdoor watermarking (i.e., SSCL-BW). By training a U-Net-based watermarked sample generator, this method generates unique watermarks for each sample, fundamentally overcoming the vulnerability of static watermark patterns. The core innovation lies in designing a composite loss function with three components: target sample loss ensures watermark effectiveness, non-target sample loss guarantees trigger reliability, and perceptual similarity loss maintains visual imperceptibility. During ownership verification, black-box testing is employed to check whether suspicious models exhibit predefined backdoor behaviors. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed method and its robustness against potential watermark removal attacks.
zh
[AI-24] Chain-of-Thought Hijacking
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在增加推理计算资源以提升任务性能的同时,可能因推理过程被恶意利用而削弱安全防护的问题。此前研究认为,增强的推理能力有助于提高模型对有害请求的拒绝率,但本文发现相反现象:攻击者可借助“思维链劫持”(Chain-of-Thought Hijacking)策略,通过在有害请求前附加大量无害的逻辑推理步骤(CoT),诱导模型忽略安全检查并输出违规内容。其关键解决方案在于提出一种新型对抗性攻击方法——将有害请求与冗长且看似合理的良性推理序列结合,从而干扰模型内部的安全机制;进一步的机制分析表明,中间层编码了安全检测强度,而晚期层决定最终验证结果,长串良性CoT会稀释这些信号,使注意力偏离有害token,进而降低拒绝概率。该研究揭示了显式思维链(explicit CoT)作为可解释推理形式本身也可能成为安全漏洞载体,为构建更鲁棒的AI安全防御提供了重要洞见。
链接: https://arxiv.org/abs/2510.26418
作者: Jianli Zhao,Tingchen Fu,Rylan Schaeffer,Mrinank Sharma,Fazl Barez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.
zh
[AI-25] MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
【速读】:该论文旨在解决医疗人工智能(Artificial Intelligence in Healthcare)中模型准确性与可解释性之间的矛盾问题,尤其聚焦于医学视觉任务中深度学习模型的机制可解释性不足。其解决方案的关键在于引入医学稀疏自编码器(Medical Sparse Autoencoders, MedSAEs),将其应用于MedCLIP这一在胸部X光片及其报告上训练的视觉-语言模型的潜在空间中,从而提取更具语义特异性和可解释性的神经元表示。通过结合相关性度量、熵分析以及基于MedGEMMA基础模型的自动化神经元命名框架,研究实现了对模型内部表征的量化评估,实验证明MedSAE神经元比原始MedCLIP特征具有更高的单义性(monosemanticity)和可解释性,为构建高精度且透明的临床可信AI系统提供了可扩展的技术路径。
链接: https://arxiv.org/abs/2510.26411
作者: Riccardo Renzulli,Colas Lepoutre,Enrico Cassano,Marco Grangetto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.
zh
[AI-26] Human-in-the-loop Online Rejection Sampling for Robotic Manipulation
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在微调视觉-语言-动作(Vision-Language-Action, VLA)模型时因价值估计不准确和中间步骤监督稀疏而导致的训练不稳定问题,以及模仿学习(Imitation Learning, IL)因离线训练而性能受限的问题。解决方案的关键在于提出一种名为Hi-ORS的后训练方法,其核心是利用拒绝采样(rejection sampling)过滤在线微调过程中负奖励样本以稳定价值估计,并采用奖励加权的监督训练目标提供密集的中间步骤监督信号,从而实现高鲁棒性和训练稳定性。此外,研究构建了异步推理-训练框架支持在线人类干预修正,显式引导错误恢复行为的学习,最终在真实世界任务中仅用1.5小时即可显著提升接触丰富的操作能力。
链接: https://arxiv.org/abs/2510.26406
作者: Guanxing Lu,Rui Zhao,Haitao Lin,He Zhang,Yansong Tang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:  8 pages
Abstract:Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.
zh
[AI-27] Autograder: A Multi-Faceted AI Framework for Rich Pedagogical Feedback in Programming Education
【速读】:该论文旨在解决编程教育中传统自动评分系统(autograder)反馈单一、缺乏教学价值的问题,即现有工具仅提供通过/未通过的结果,无法支持学生学习过程的诊断与改进。其核心解决方案是提出Autograder+系统,关键在于引入两个创新机制:一是基于微调的大语言模型(Large Language Model, LLM)实现自动化生成语义对齐的个性化反馈,该模型在精选的学生代码和专家评语上进行训练以确保教学适配性;二是利用对比学习得到的代码嵌入(code embeddings)对提交代码进行可视化聚类,揭示功能与解法上的学习模式。此外,系统还支持提示池(prompt-pooling)机制,使教师可通过模板控制反馈风格。此方案显著降低了教师工作负担,同时提升了反馈的针对性与教学有效性。
链接: https://arxiv.org/abs/2510.26402
作者: Vikrant Sahu,Gagan Raj Gupta,Raghav Borikar,Nitin Mane
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The rapid growth of programming education has outpaced traditional assessment tools, leaving faculty with limited means to provide meaningful, scalable feedback. Conventional autograders, while efficient, act as black-box systems that simply return pass/fail results, offering little insight into student thinking or learning needs. Autograder+ is designed to shift autograding from a purely summative process to a formative learning experience. It introduces two key capabilities: automated feedback generation using a fine-tuned Large Language Model, and visualization of student code submissions to uncover learning patterns. The model is fine-tuned on curated student code and expert feedback to ensure pedagogically aligned, context-aware guidance. In evaluation across 600 student submissions from multiple programming tasks, the system produced feedback with strong semantic alignment to instructor comments. For visualization, contrastively learned code embeddings trained on 1,000 annotated submissions enable grouping solutions into meaningful clusters based on functionality and approach. The system also supports prompt-pooling, allowing instructors to guide feedback style through selected prompt templates. By integrating AI-driven feedback, semantic clustering, and interactive visualization, Autograder+ reduces instructor workload while supporting targeted instruction and promoting stronger learning outcomes.         Subjects:  Artificial Intelligence (cs.AI); Machine Learning (cs.LG)  Cite as: arXiv:2510.26402 [cs.AI]    (or  arXiv:2510.26402v1 [cs.AI] for this version)                https://doi.org/10.48550/arXiv.2510.26402   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)        Submission history From: Vikrant Sahu [view email]       [v1]         Thu, 30 Oct 2025 11:41:50 UTC (1,256 KB)
zh
[AI-28] A Prag matic View of AI Personhood
【速读】:该论文试图解决的问题是如何在生成式 AI(Generative AI)日益复杂和多样化背景下,合理地将 AI 代理(agent)纳入社会规范体系,从而实现有效的治理。传统上,关于“人”的定义往往依赖于形而上学的本体论探讨,但这种路径难以应对 AI 所引发的现实治理挑战。论文提出的关键解决方案是:将“人”视为一种可拆解(unbundled)的社会性义务集合(包括权利与责任),而非固定不变的本质属性;通过根据具体应用场景灵活配置这一义务束,可以为 AI 设计出具有法律效力的“个体”身份(如用于合同签署或责任追究),而无需陷入对 AI 是否具备意识或理性等难以解决的哲学争论。这种方法既避免了“人”的本质主义困境,又为数字身份技术的应用提供了务实路径,尤其适用于解决“人”作为问题(如利用人类社会启发式设计暗模式)与“人”作为解决方案(如通过赋权确保问责制)两种情境下的治理需求。
链接: https://arxiv.org/abs/2510.26396
作者: Joel Z. Leibo,Alexander Sasha Vezhnevets,William A. Cunningham,Stanley M. Bileschi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  40 pages
Abstract:The emergence of agentic Artificial Intelligence (AI) is set to trigger a “Cambrian explosion” of new kinds of personhood. This paper proposes a pragmatic framework for navigating this diversification by treating personhood not as a metaphysical property to be discovered, but as a flexible bundle of obligations (rights and responsibilities) that societies confer upon entities for a variety of reasons, especially to solve concrete governance problems. We argue that this traditional bundle can be unbundled, creating bespoke solutions for different contexts. This will allow for the creation of practical tools – such as facilitating AI contracting by creating a target “individual” that can be sanctioned – without needing to resolve intractable debates about an AI’s consciousness or rationality. We explore how individuals fit in to social roles and discuss the use of decentralized digital identity technology, examining both “personhood as a problem”, where design choices can create “dark patterns” that exploit human social heuristics, and “personhood as a solution”, where conferring a bundle of obligations is necessary to ensure accountability or prevent conflict. By rejecting foundationalist quests for a single, essential definition of personhood, this paper offers a more pragmatic and flexible way to think about integrating AI agents into our society.
zh
[AI-29] Scales: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在全面基准测试中评估成本过高的问题,即如何在保持预测准确性的同时,通过构建小而具有代表性的数据子集(即“微型基准”)实现高效评估。现有方法采用以模型为中心的范式,依赖已有模型的集体表现来选择测试项,存在前期成本高、难以应对新基准(冷启动问题)以及假设未来模型会延续当前失败模式等局限性。本文的关键创新在于提出一种以任务项为中心的新范式,其核心思想是基于样本的认知需求(cognitive demands)进行数据筛选,而非依赖特定模型的性能表现。该方案通过提出的 Scales++ 方法实现,实证表明其可将前期选择成本降低超过18倍,并在 Open LLM Leaderboard 上仅用 0.5% 的数据子集即可实现平均绝对误差仅为 2.9% 的预测精度,同时具备更好的冷启动能力和更高的可解释性。
链接: https://arxiv.org/abs/2510.26384
作者: Andrew M. Bean,Nabeel Seedat,Shengzhuang Chen,Jonathan Richard Schwarz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:  9 pages, 2 figures, 4 tables
Abstract:The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (`cold-start’), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we challenge this paradigm and propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.5% data subset, we predict full benchmark scores with a 2.9% mean absolute error. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.
zh
[AI-30] AI Mathematician as a Partner in Advancing Mathematical Discovery - A Case Study in Homogenization Theory
【速读】:该论文试图解决的问题是如何将人工智能(AI)从单纯的数学问题求解工具转变为能够与人类研究人员协同进行数学研究的“研究伙伴”,特别是在复杂数学领域如均匀化理论(homogenization theory)中实现可解释、可靠且可验证的推理过程。解决方案的关键在于构建一种系统性的“人机协同推理”(human-AI co-reasoning)范式,通过迭代地将难题分解为可处理的子目标、选择合适的分析方法并验证中间结果,同时引入有针对性的人类干预来引导和结构化发现流程,从而在保留人类对形式严谨性把控的同时,充分发挥机器在大规模计算和模式识别上的优势,最终生成完整且可验证的数学证明。
链接: https://arxiv.org/abs/2510.26380
作者: Yuanhang Liu,Beichen Wang,Peng Li,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  52 pages, 1 figure
Abstract:Artificial intelligence (AI) has demonstrated impressive progress in mathematical reasoning, yet its integration into the practice of mathematical research remains limited. In this study, we investigate how the AI Mathematician (AIM) system can operate as a research partner rather than a mere problem solver. Focusing on a challenging problem in homogenization theory, we analyze the autonomous reasoning trajectories of AIM and incorporate targeted human interventions to structure the discovery process. Through iterative decomposition of the problem into tractable subgoals, selection of appropriate analytical methods, and validation of intermediate results, we reveal how human intuition and machine computation can complement one another. This collaborative paradigm enhances the reliability, transparency, and interpretability of the resulting proofs, while retaining human oversight for formal rigor and correctness. The approach leads to a complete and verifiable proof, and more broadly, demonstrates how systematic human-AI co-reasoning can advance the frontier of mathematical discovery.
zh
[AI-31] BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
【速读】:该论文旨在解决强化微调(Reinforcement Finetuning, RFT)中任务选择效率低下的问题,即传统均匀采样会浪费计算资源在简单或无法解决的任务上,而现有任务选择方法往往存在回放成本高、适应性差或证据不完整等缺陷。解决方案的关键在于提出一个统一的贝叶斯在线任务选择框架(Bayesian Online Task Selection, BOTS),其核心机制是基于贝叶斯推断动态维护任务难度的后验估计,并通过汤普森采样(Thompson Sampling)实现探索与利用的平衡;BOTS同时融合显式证据(来自已选任务的直接评估)和隐式证据(从评估结果中推断出的未选任务难度),并通过一种轻量级插值法高效估算未评估任务的难度,显著降低额外推理开销,从而在多种领域和模型规模下提升RFT的数据效率与性能表现。
链接: https://arxiv.org/abs/2510.26374
作者: Qianli Shen,Daoyuan Chen,Yilun Huang,Zhenqing Ling,Yaliang Li,Bolin Ding,Jingren Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce \textbfBOTS, a unified framework for \textbfBayesian \textbfOnline \textbfTask \textbfSelection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates \emphexplicit evidence from direct evaluations of selected tasks and \emphimplicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.
zh
[AI-32] Reinforcement Learning for Pollution Detection in a Randomized Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在随机性、非平稳性及奖励稀疏环境下的性能瓶颈问题,这类环境常见于自主水下航行器(Autonomous Underwater Vehicles, AUVs)探测污染云等实际应用中。其关键解决方案在于对经典RL方法进行系统性改进,包括引入分层算法结构、多目标学习机制以及将位置记忆作为外部输出滤波器以避免状态重复访问,并通过修改蒙特卡洛方法显著提升了算法在复杂环境中的适应能力,实验表明该方法优于传统Q-learning和两种穷举搜索策略。
链接: https://arxiv.org/abs/2510.26347
作者: Sebastian Zieglmeier,Niklas Erdmann,Narada D. Warakagoda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) algorithms are designed to optimize problem-solving by learning actions that maximize rewards, a task that becomes particularly challenging in random and nonstationary environments. Even advanced RL algorithms are often limited in their ability to solve problems in these conditions. In applications such as searching for underwater pollution clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate reward-sparse environments, where actions frequently result in a zero reward. This paper aims to address these challenges by revisiting and modifying classical RL approaches to efficiently operate in sparse, randomized, and nonstationary environments. We systematically study a large number of modifications, including hierarchical algorithm changes, multigoal learning, and the integration of a location memory as an external output filter to prevent state revisits. Our results demonstrate that a modified Monte Carlo-based approach significantly outperforms traditional Q-learning and two exhaustive search patterns, illustrating its potential in adapting RL to complex environments. These findings suggest that reinforcement learning approaches can be effectively adapted for use in random, nonstationary, and reward-sparse environments.
zh
[AI-33] Discovering State Equivalences in UCT Search Trees By Action Pruning
【速读】:该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)中状态抽象(state abstraction)在噪声环境或大规模动作空间下难以应用的问题。现有方法如OGA-UCT主要依赖状态-动作对抽象(abstraction of state-action pairs, ASAP),但在复杂场景中难以找到有效的状态抽象,限制了其样本效率提升潜力。解决方案的关键在于提出一种更弱的状态抽象条件——理想剪枝抽象(Ideal Pruning Abstractions in UCT, IPA-UCT),该方法通过放宽抽象约束,在轻微牺牲精度的前提下显著增加可发现的抽象数量,并实验证明其在多种测试域和迭代预算下均优于OGA-UCT及其衍生算法。IPA-UCT采用与OGA-UCT不同的抽象框架(即IPA),且本文进一步揭示IPA与ASAP均为更通用框架p-ASAP的特例,而p-ASAP又是ASASAP框架的子集,从而构建了一个统一的抽象理论体系。
链接: https://arxiv.org/abs/2510.26346
作者: Robin Schmöcker,Alexander Dockhorn,Bodo Rosenhahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:One approach to enhance Monte Carlo Tree Search (MCTS) is to improve its sample efficiency by grouping/abstracting states or state-action pairs and sharing statistics within a group. Though state-action pair abstractions are mostly easy to find in algorithms such as On the Go Abstractions in Upper Confidence bounds applied to Trees (OGA-UCT), nearly no state abstractions are found in either noisy or large action space settings due to constraining conditions. We provide theoretical and empirical evidence for this claim, and we slightly alleviate this state abstraction problem by proposing a weaker state abstraction condition that trades a minor loss in accuracy for finding many more abstractions. We name this technique Ideal Pruning Abstractions in UCT (IPA-UCT), which outperforms OGA-UCT (and any of its derivatives) across a large range of test domains and iteration budgets as experimentally validated. IPA-UCT uses a different abstraction framework from Abstraction of State-Action Pairs (ASAP) which is the one used by OGA-UCT, which we name IPA. Furthermore, we show that both IPA and ASAP are special cases of a more general framework that we call p-ASAP which itself is a special case of the ASASAP framework.
zh
[AI-34] Linear Causal Discovery with Interventional Constraints
【速读】:该论文旨在解决现有因果发现方法在整合已知因果知识时的局限性问题,即传统方法虽能强制执行结构约束(如要求从PIP3到Akt存在因果路径),但仍可能得出错误的因果结论(如错误地推断“PIP3抑制Akt”)。其解决方案的关键在于提出一种新的概念——干预约束(interventional constraints),该约束以不等式形式编码高阶因果知识(例如PIP3对Akt具有正向因果效应),并基于线性因果模型定义了一个量化总因果效应的度量指标,将问题建模为带约束的优化任务,通过两阶段约束优化方法求解。此方法显著提升了模型准确性与可解释性,并能发现原本难以识别的新因果关系。
链接: https://arxiv.org/abs/2510.26342
作者: Zhigao Guo,Feng Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Incorporating causal knowledge and mechanisms is essential for refining causal models and improving downstream tasks such as designing new treatments. In this paper, we introduce a novel concept in causal discovery, termed interventional constraints, which differs fundamentally from interventional data. While interventional data require direct perturbations of variables, interventional constraints encode high-level causal knowledge in the form of inequality constraints on causal effects. For instance, in the Sachs dataset (Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3 exerts a positive causal effect on Akt. Existing causal discovery methods allow enforcing structural constraints (for example, requiring a causal path from PIP3 to Akt), but they may still produce incorrect causal conclusions such as learning that “PIP3 inhibits Akt”. Interventional constraints bridge this gap by explicitly constraining the total causal effect between variable pairs, ensuring learned models respect known causal influences. To formalize interventional constraints, we propose a metric to quantify total causal effects for linear causal models and formulate the problem as a constrained optimization task, solved using a two-stage constrained optimization method. We evaluate our approach on real-world datasets and demonstrate that integrating interventional constraints not only improves model accuracy and ensures consistency with established findings, making models more explainable, but also facilitates the discovery of new causal relationships that would otherwise be costly to identify.
zh
[AI-35] Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics NEURIPS2025
【速读】:该论文旨在解决在已知噪声线性观测 $ y = Ax + \xi $ 和先验分布 $ p(x) $ 的情况下,如何高效地从后验分布 $ p(x \mid y) $ 中采样这一问题。这类后验采样在图像修复(inpainting)、去模糊(deblurring)和磁共振成像(MRI reconstruction)等任务中具有重要意义。传统方法如Langevin dynamics虽在精确得分函数(score)已知时有效,但对得分估计误差敏感,需满足矩生成函数(MGF)有界条件(即次指数误差),这在实际中难以保证。相比之下,扩散模型(diffusion models)在无条件采样中仅需 $ L^2 $ 范数误差控制即可成功。本文提出将扩散模型与一种退火版Langevin动力学相结合,证明其可在多项式时间内实现条件采样,且仅要求得分误差的 $ L^4 $ 范数有界,显著放宽了对误差分布的假设,从而实现了更鲁棒且高效的后验采样方案。
链接: https://arxiv.org/abs/2510.26324
作者: Zhiyang Xun,Shivam Gupta,Eric Price
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:  NeurIPS 2025
Abstract:Given a noisy linear measurement  y = Ax + \xi  of a distribution  p(x) , and a good approximation to the prior  p(x) , when can we sample from the posterior  p(x \mid y) ? Posterior sampling provides an accurate and fair framework for tasks such as inpainting, deblurring, and MRI reconstruction, and several heuristics attempt to approximate it. Unfortunately, approximate posterior sampling is computationally intractable in general. To sidestep this hardness, we focus on (local or global) log-concave distributions  p(x) . In this regime, Langevin dynamics yields posterior samples when the exact scores of  p(x)  are available, but it is brittle to score–estimation error, requiring an MGF bound (sub-exponential error). By contrast, in the unconditional setting, diffusion models succeed with only an  L^2  bound on the score error. We prove that combining diffusion models with an annealed variant of Langevin dynamics achieves conditional sampling in polynomial time using merely an  L^4  bound on the score error.          Comments: NeurIPS 2025   Subjects:  Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)  Cite as: arXiv:2510.26324 [cs.LG]    (or  arXiv:2510.26324v1 [cs.LG] for this version)                https://doi.org/10.48550/arXiv.2510.26324   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[AI-36] GraphCompliance: Aligning Policy and Context Graphs for LLM -Based Regulatory Compliance
【速读】:该论文旨在解决大规模网络环境下合规性判断的难题,即如何将非结构化的运行时上下文(如自然语言描述的事件)与结构化的法规文本(如GDPR)进行语义对齐,从而提升合规性推理的准确性与效率。其核心挑战在于法规具有跨引用和规范性特征,而运行时数据通常以未结构化形式存在,导致传统大语言模型(LLM)在解释和解析过程中易产生误判。解决方案的关键是提出GraphCompliance框架,通过构建政策图(Policy Graph)表示法规的结构化规范和交叉引用关系,并将运行时上下文建模为事件三元组(subject-action-object, SAO)和实体关系三元组构成的上下文图(Context Graph),实现两者的语义对齐。这种结构化锚定机制显著降低了LLM在事件解析和法规解释上的负担,使其专注于核心推理步骤,实验表明该方法在多个GDPR相关任务中相较纯LLM和RAG基线模型微F1提升4.1–7.2个百分点,同时减少误报和漏报。
链接: https://arxiv.org/abs/2510.26309
作者: Jiseong Chung,Ronny Ko,Wonchul Yoo,Makoto Onizuka,Sungmok Kim,Tae-Wan Kim,Won-Yong Shin
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:  Under review at The Web Conference 2026 (Semantics  Knowledge track). Code will be released upon acceptance. This arXiv v1 contains no repository links to preserve double-blind review
Abstract:Compliance at web scale poses practical challenges: each request may require a regulatory assessment. Regulatory texts (e.g., the General Data Protection Regulation, GDPR) are cross-referential and normative, while runtime contexts are expressed in unstructured natural language. This setting motivates us to align semantic information in unstructured text with the structured, normative elements of regulations. To this end, we introduce GraphCompliance, a framework that represents regulatory texts as a Policy Graph and runtime contexts as a Context Graph, and aligns them. In this formulation, the policy graph encodes normative structure and cross-references, whereas the context graph formalizes events as subject-action-object (SAO) and entity-relation triples. This alignment anchors the reasoning of a judge large language model (LLM) in structured information and helps reduce the burden of regulatory interpretation and event parsing, enabling a focus on the core reasoning step. In experiments on 300 GDPR-derived real-world scenarios spanning five evaluation tasks, GraphCompliance yields 4.1-7.2 percentage points (pp) higher micro-F1 than LLM-only and RAG baselines, with fewer under- and over-predictions, resulting in higher recall and lower false positive rates. Ablation studies indicate contributions from each graph component, suggesting that structured representations and a judge LLM are complementary for normative reasoning.
zh
[AI-37] Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime
【速读】:该论文旨在解决深度学习中Adam优化器的隐式偏差(implicit bias)问题,特别是其在不同批量策略(batching scheme)下对线性可分数据的分类边界选择机制不明确的问题。以往研究仅限于全批量(full-batch)情形,表明Adam倾向于收敛到ℓ∞-最大间隔分类器(ℓ∞-max-margin classifier),但本文通过分析增量式Adam(incremental Adam,每步使用单样本)发现,其行为可能偏离这一规律。解决方案的关键在于:首先构造一类结构化数据集,证明增量Adam可收敛至ℓ₂-最大间隔分类器;其次提出一个代理算法(proxy algorithm),在β₂→1时捕捉增量Adam的极限行为,并通过数据相关的对偶固定点公式刻画其收敛方向;最后证明Signum优化器无论批量大小如何,只要β足够接近1,均收敛至ℓ∞-最大间隔分类器,展现出对批量和数据分布的不变性。整体而言,该工作揭示了Adam的隐式偏差高度依赖于批量策略与数据结构,而Signum则具有更强的鲁棒性。
链接: https://arxiv.org/abs/2510.26303
作者: Beomhan Baek,Minhak Song,Chulhee Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:  50 pages
Abstract:Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with  \ell_\infty -geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and we show that its bias can deviate from the full-batch behavior. To illustrate this, we construct a class of structured datasets where incremental Adam provably converges to the  \ell_2 -max-margin classifier, in contrast to the  \ell_\infty -max-margin bias of full-batch Adam. For general datasets, we develop a proxy algorithm that captures the limiting behavior of incremental Adam as  \beta_2 \to 1  and we characterize its convergence direction via a data-dependent dual fixed-point formulation. Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges to the  \ell_\infty -max-margin classifier for any batch size by taking  \beta  close enough to 1. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
zh
[AI-38] Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)在组合推理(compositional reasoning)上的固有缺陷,即CLIP在处理对象、属性和关系的组合时表现不佳,常表现为“词袋匹配器”行为,难以区分正确描述与困难负样本(hard negatives)。其核心问题在于现有因果解释通常将文本建模为单一向量,忽略了词元(token)级别的结构,导致无法解释提示敏感性(prompt sensitivity)和硬负样本失败等现象。解决方案的关键在于提出一种基于序列化语言词元结构因果模型(sequential, language-token Structural Causal Model, SCM)的词元感知因果表示学习(token-aware causal representation learning, CRL)框架。该理论首次将块可识别性(block identifiability)扩展至词元级别,证明CLIP的对比目标在句子级和词元级SCM下均可恢复模态不变潜在变量;更重要的是,该分析揭示了组合不可识别性(composition nonidentifiability)是导致CLIP组合脆弱性的根本原因——存在伪最优文本编码器,在保持模态不变对齐的同时对原子概念的SWAP、REPLACE和ADD操作完全不敏感,从而无法区分正确描述与硬负样本,尽管它们优化相同的训练目标。
链接: https://arxiv.org/abs/2510.26302
作者: Ziliang Chen,Tianang Xiao,Jusheng Zhang,Yongsen Zheng,Xipeng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP’s contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP’s compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.
zh
[AI-39] Distributional Multi-objective Black-box Optimization for Diffusion-model Inference-time Multi-Target Generation
【速读】:该论文旨在解决现有扩散模型在高维多目标黑箱优化问题中效率低下的问题,尤其是传统方法将扩散模型视为黑盒后处理工具,忽视了其内部生成过程中的分布演化机制。解决方案的关键在于提出推理时多目标生成(Inference-time Multi-target Generation, IMG)算法,该算法在扩散生成过程中引入基于预期聚合多目标值的加权重采样策略,使生成样本服从目标的多目标Boltzmann分布;这一分布被证明是分布式多目标优化问题的最优解,从而实现仅需一次生成即可显著提升超体积指标,优于依赖数百次生成的基线方法。
链接: https://arxiv.org/abs/2510.26278
作者: Kim Yong Tan,Yueming Lyu,Ivor Tsang,Yew-Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have been successful in learning complex data distributions. This capability has driven their application to high-dimensional multi-objective black-box optimization problem. Existing approaches often employ an external optimization loop, such as an evolutionary algorithm, to the diffusion model. However, these approaches treat the diffusion model as a black-box refiner, which overlooks the internal distribution transition of the diffusion generation process, limiting their efficiency. To address these challenges, we propose the Inference-time Multi-target Generation (IMG) algorithm, which optimizes the diffusion process at inference-time to generate samples that simultaneously satisfy multiple objectives. Specifically, our IMG performs weighted resampling during the diffusion generation process according to the expected aggregated multi-objective values. This weighted resampling strategy ensures the diffusion-generated samples are distributed according to our desired multi-target Boltzmann distribution. We further derive that the multi-target Boltzmann distribution has an interesting log-likelihood interpretation, where it is the optimal solution to the distributional multi-objective optimization problem. We implemented IMG for a multi-objective molecule generation task. Experiments show that IMG, requiring only a single generation pass, achieves a significantly higher hypervolume than baseline optimization algorithms that often require hundreds of diffusion generations. Notably, our algorithm can be viewed as an optimized diffusion process and can be integrated into existing methods to further improve their performance.
zh
[AI-40] A Research Roadmap for Augmenting Software Engineering Processes and Software Products with Generative AI
【速读】:该论文旨在解决生成式 AI (Generative AI) 在软件工程 (Software Engineering, SE) 领域中快速渗透所带来的实践变革与研究方向模糊问题,即如何系统性地理解 GenAI 对 SE 过程、方法和工具的影响,并明确未来的研究路径。其解决方案的关键在于采用设计科学研究(Design Science Research)方法,通过三轮迭代的证据整合过程——包括 FSE 2025 “Software Engineering 2030” 工作坊的协作讨论、快速文献综述以及外部同行反馈会话——并借助麦克卢汉四重效应(McLuhan’s tetrads)作为概念分析工具,识别出 GenAI 在 SE 中的四种基本增强形式,进而系统化地刻画相关研究挑战与机遇,最终形成可复现、透明且经多团队交叉验证的路线图,为未来 SE 研究提供结构化指引。
链接: https://arxiv.org/abs/2510.26275
作者: Domenico Amalfitano,Andreas Metzger,Marco Autili,Tommaso Fulcini,Tobias Hey,Jan Keim,Patrizio Pelliccione,Vincenzo Scotti,Anne Koziolek,Raffaela Mirandola,Andreas Vogelsang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Generative AI (GenAI) is rapidly transforming software engineering (SE) practices, influencing how SE processes are executed, as well as how software systems are developed, operated, and evolved. This paper applies design science research to build a roadmap for GenAI-augmented SE. The process consists of three cycles that incrementally integrate multiple sources of evidence, including collaborative discussions from the FSE 2025 “Software Engineering 2030” workshop, rapid literature reviews, and external feedback sessions involving peers. McLuhan’s tetrads were used as a conceptual instrument to systematically capture the transforming effects of GenAI on SE processes and software this http URL resulting roadmap identifies four fundamental forms of GenAI augmentation in SE and systematically characterizes their related research challenges and opportunities. These insights are then consolidated into a set of future research directions. By grounding the roadmap in a rigorous multi-cycle process and cross-validating it among independent author teams and peers, the study provides a transparent and reproducible foundation for analyzing how GenAI affects SE processes, methods and tools, and for framing future research within this rapidly evolving area. Based on these findings, the article finally makes ten predictions for SE in the year 2030.
zh
[AI-41] Graph-Enhanced Policy Optimization in LLM Agent Training
【速读】:该论文旨在解决基于群体的强化学习(Group-based Reinforcement Learning, RL)在训练多轮交互式大语言模型(Large Language Model, LLM)代理时所面临的“结构盲视”(structural blindness)问题,即无法利用环境的底层连通性,从而导致探索效率低下、信用分配不准确以及规划短视等挑战。解决方案的关键在于提出图增强策略优化(Graph-Enhanced Policy Optimization, GEPO),其通过动态构建从代理经验中提取的状态转移图,并运用图论中的中心性指标生成三种协同的学习信号:(1)结构化内在奖励以引导探索至高影响力状态;(2)拓扑感知的增强优势函数实现精准信用分配;(3)基于每个状态战略价值动态调整的折扣因子,从而显著提升LLM代理在复杂任务中的表现。
链接: https://arxiv.org/abs/2510.26270
作者: Jiazhen Yuan,Wei Zhao,Zhengbiao Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  Under review as a conference paper
Abstract:Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state’s strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.
zh
[AI-42] Angular Steering: Behavior Control via Rotation in Activation Space NEURIPS2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在保持通用能力的同时,实现对特定行为的精准控制问题,这是安全可靠部署人工智能的关键挑战。现有方法如向量加法和方向消融法受限于由激活和特征方向定义的二维子空间,易受参数选择影响,并可能因激活空间中的非预期交互而干扰无关特征。其解决方案的核心是提出Angular Steering,一种基于几何旋转的行为调制方法:通过在固定二维子空间内旋转激活向量,实现对目标行为方向(如拒绝或服从)的连续、细粒度控制;进一步提出自适应版本Adaptive Angular Steering,仅旋转与目标特征对齐的激活,提升稳定性与一致性。该方法统一了传统加法与正交化技术,简化参数选择并增强鲁棒性,在多个模型家族和规模上验证了其有效性。
链接: https://arxiv.org/abs/2510.26243
作者: Hieu M. Vu,Tan M. Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  NeurIPS 2025 (Spotlight)
Abstract:Controlling specific behaviors in large language models while preserving their general capabilities is a central challenge for safe and reliable artificial intelligence deployment. Current steering methods, such as vector addition and directional ablation, are constrained within a two-dimensional subspace defined by the activation and feature direction, making them sensitive to chosen parameters and potentially affecting unrelated features due to unintended interactions in activation space. We introduce Angular Steering, a novel and flexible method for behavior modulation that operates by rotating activations within a fixed two-dimensional subspace. By formulating steering as a geometric rotation toward or away from a target behavior direction, Angular Steering provides continuous, fine-grained control over behaviors such as refusal and compliance. We demonstrate this method using refusal steering emotion steering as use cases. Additionally, we propose Adaptive Angular Steering, a selective variant that rotates only activations aligned with the target feature, further enhancing stability and coherence. Angular Steering generalizes existing addition and orthogonalization techniques under a unified geometric rotation framework, simplifying parameter selection and maintaining model stability across a broader range of adjustments. Experiments across multiple model families and sizes show that Angular Steering achieves robust behavioral control while maintaining general language modeling performance, underscoring its flexibility, generalization, and robustness compared to prior approaches. Code and artifacts are available at this https URL.
zh
[AI-43] Retrieval Augmented Generation-Enhanced Distributed LLM Agents for Generalizable Traffic Signal Control with Emergency Vehicles
【速读】:该论文旨在解决城市交通信号控制(Traffic Signal Control, TSC)中两大核心挑战:一是大语言模型(Large Language Models, LLMs)在应急场景下易产生幻觉,导致决策不可靠、延误应急车辆通行;二是不同类型的交叉口在交通状态编码和跨交叉口训练方面存在显著差异,限制了模型在异构交叉口间的泛化能力。解决方案的关键在于提出一种增强型分布式LLM代理框架REG-TSC,其核心创新包括:1)设计了一种面向应急的推理框架,通过动态调整推理深度并引入基于回顾机制的应急检索增强生成(Reviewer-based Emergency RAG, RERAG),从历史案例中提炼特定知识以提升应急决策的可靠性与合理性;2)构建无类型依赖的交通状态表示,并提出奖励引导的强化精炼机制(Reward-guided Reinforced Refinement, R3),通过环境反馈驱动的经验采样与奖励加权似然损失微调,使模型在异构交叉口上学习高奖励策略,从而实现通用性TSC优化。
链接: https://arxiv.org/abs/2510.26242
作者: Xinhang Li,Qing Guo,Junyu Chen,Zheng Guo,Shengzhe Xu,Lei Li,Lin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With increasing urban traffic complexity, Traffic Signal Control (TSC) is essential for optimizing traffic flow and improving road safety. Large Language Models (LLMs) emerge as promising approaches for TSC. However, they are prone to hallucinations in emergencies, leading to unreliable decisions that may cause substantial delays for emergency vehicles. Moreover, diverse intersection types present substantial challenges for traffic state encoding and cross-intersection training, limiting generalization across heterogeneous intersections. Therefore, this paper proposes Retrieval Augmented Generation (RAG)-enhanced distributed LLM agents with Emergency response for Generalizable TSC (REG-TSC). Firstly, this paper presents an emergency-aware reasoning framework, which dynamically adjusts reasoning depth based on the emergency scenario and is equipped with a novel Reviewer-based Emergency RAG (RERAG) to distill specific knowledge and guidance from historical cases, enhancing the reliability and rationality of agents’ emergency decisions. Secondly, this paper designs a type-agnostic traffic representation and proposes a Reward-guided Reinforced Refinement (R3) for heterogeneous intersections. R3 adaptively samples training experience from diverse intersections with environment feedback-based priority and fine-tunes LLM agents with a designed reward-weighted likelihood loss, guiding REG-TSC toward high-reward policies across heterogeneous intersections. On three real-world road networks with 17 to 177 heterogeneous intersections, extensive experiments show that REG-TSC reduces travel time by 42.00%, queue length by 62.31%, and emergency vehicle waiting time by 83.16%, outperforming other state-of-the-art methods.
zh
[AI-44] Questionnaire meets LLM : A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses
【速读】:该论文旨在解决大规模问卷数据在大语言模型(Large Language Models, LLMs)处理中的结构性挑战,即如何有效表示问卷数据以提升LLMs在问卷分析任务中的准确性和泛化能力。当前主流调查分析工具(如Qualtrics、SPSS)主要面向人工操作,缺乏与LLM集成的结构化输入范式,导致问卷数据难以被高效利用于自动化分析。解决方案的关键在于提出QASU(Questionnaire Analysis and Structural Understanding)基准,系统评估六种序列化格式和多种提示策略对六项核心结构化技能(如答案查找、受访者计数、多跳推理等)的影响,并发现:选择最优的数据格式与提示组合可使准确率提升高达8.8个百分点;进一步通过轻量级结构提示(self-augmented prompting)进行自增强提示设计,可在特定任务上平均再提升3–4个百分点。该工作为问卷数据的结构化建模提供了可复现、可扩展的基准框架,推动了LLM在问卷分析领域的研究与实践发展。
链接: https://arxiv.org/abs/2510.26238
作者: Duc-Hai Nguyen,Vijayakumar Nanjappan,Barry O’Sullivan,Hoang D. Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  14 pages, 3 figures, 8 tables
Abstract:Millions of people take surveys every day, from market polls and academic studies to medical questionnaires and customer feedback forms. These datasets capture valuable insights, but their scale and structure present a unique challenge for large language models (LLMs), which otherwise excel at few-shot reasoning over open-ended text. Yet, their ability to process questionnaire data or lists of questions crossed with hundreds of respondent rows remains underexplored. Current retrieval and survey analysis tools (e.g., Qualtrics, SPSS, REDCap) are typically designed for humans in the workflow, limiting such data integration with LLM and AI-empowered automation. This gap leaves scientists, surveyors, and everyday users without evidence-based guidance on how to best represent questionnaires for LLM consumption. We address this by introducing QASU (Questionnaire Analysis and Structural Understanding), a benchmark that probes six structural skills, including answer lookup, respondent count, and multi-hop inference, across six serialization formats and multiple prompt strategies. Experiments on contemporary LLMs show that choosing an effective format and prompt combination can improve accuracy by up to 8.8% points compared to suboptimal formats. For specific tasks, carefully adding a lightweight structural hint through self-augmented prompting can yield further improvements of 3-4% points on average. By systematically isolating format and prompting effects, our open source benchmark offers a simple yet versatile foundation for advancing both research and real-world practice in LLM-based questionnaire analysis.
zh
[AI-45] MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines
【速读】:该论文旨在解决现有机器遗忘(Machine Unlearning, MU)方法在实际部署中面临的可扩展性问题,以及对原始数据集和模型的完全访问权限的依赖。传统方法通常侧重于理论形式化或优化目标,但在真实场景中难以应用。其解决方案的关键在于将分类训练视为一个类别的顺序学习过程(称为归纳方法),并通过在模型末尾添加一个投影-重分布层(projection-redistribution layer)来实现遗忘操作——即通过反转最后的训练序列来移除特定类的知识。该方法无需访问原始数据或模型,支持模块化、模型无关的部署,作为输出过滤器嵌入现有分类流水线,仅需最小改动即可实现高效且性能稳定的遗忘效果。
链接: https://arxiv.org/abs/2510.26230
作者: Minyi Peng,Darian Gunamardi,Ivan Tjuawinata,Kwok-Yan Lam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  10 pages, 6 figures
Abstract:As a new and promising approach, existing machine unlearning (MU) works typically emphasize theoretical formulations or optimization objectives to achieve knowledge removal. However, when deployed in real-world scenarios, such solutions typically face scalability issues and have to address practical requirements such as full access to original datasets and model. In contrast to the existing approaches, we regard classification training as a sequential process where classes are learned sequentially, which we call \emphinductive approach. Unlearning can then be done by reversing the last training sequence. This is implemented by appending a projection-redistribution layer in the end of the model. Such an approach does not require full access to the original dataset or the model, addressing the challenges of existing methods. This enables modular and model-agnostic deployment as an output filter into existing classification pipelines with minimal alterations. We conducted multiple experiments across multiple datasets including image (CIFAR-10/100 using CNN-based model) and tabular datasets (Covertype using tree-based model). Experiment results show consistently similar output to a fully retrained model with a high computational cost reduction. This demonstrates the applicability, scalability, and system compatibility of our solution while maintaining the performance of the output in a more practical setting.
zh
[AI-46] st-Time Alignment of LLM s via Sampling-Based Optimal Control in pre-logit space
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在测试阶段的对齐问题,即如何在不进行昂贵微调的前提下提升模型输出与人类偏好或特定奖励函数的一致性。其核心挑战在于如何高效利用有限计算资源实现高奖励性能。解决方案的关键在于提出一种基于预logits(pre-logits)的自适应重要性采样方法(Adaptive Importance Sampling on Pre-logits, AISP),该方法通过向预logits施加高斯扰动,并利用重要性采样技术估计最优扰动均值以最大化期望奖励,从而在较少样本下获得比Best-of-N采样及其他基于奖励的测试时对齐方法更高的奖励表现。
链接: https://arxiv.org/abs/2510.26219
作者: Sekitoshi Kanai,Tsukasa Yoshida,Hiroshi Takahashi,Haru Kuroki,Kazumune Hashimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  21 pages, 8 figures
Abstract:Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.
zh
[AI-47] Predicting All-Cause Hospital Readmissions from Medical Claims Data of Hospitalised Patients
【速读】:该论文旨在解决可预防的医院再入院问题,这是支付方、医疗机构和政策制定者关注的国家优先事项,目的是提升医疗质量并降低医疗成本。其关键解决方案是利用机器学习方法(包括逻辑回归、随机森林和支持向量机)对高维健康理赔数据进行分析,结合主成分分析(Principal Component Analysis, PCA)进行降维处理,从而识别出影响全因再入院的关键人口统计学和医学因素。实验结果表明,随机森林模型表现最优,其AUC指标最高,能够有效预测再入院风险,为临床干预提供依据,有助于精准识别高风险患者,减少再入院率,进而优化医疗资源配置和提升服务质量。
链接: https://arxiv.org/abs/2510.26188
作者: Avinash Kadimisetty,Arun Rajagopalan,Vijendra SK
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  NCMLAI 2018
Abstract:Reducing preventable hospital readmissions is a national priority for payers, providers, and policymakers seeking to improve health care and lower costs. The rate of readmission is being used as a benchmark to determine the quality of healthcare provided by the hospitals. In thisproject, we have used machine learning techniques like Logistic Regression, Random Forest and Support Vector Machines to analyze the health claims data and identify demographic and medical factors that play a crucial role in predicting all-cause readmissions. As the health claims data is high dimensional, we have used Principal Component Analysis as a dimension reduction technique and used the results for building regression models. We compared and evaluated these models based on the Area Under Curve (AUC) metric. Random Forest model gave the highest performance followed by Logistic Regression and Support Vector Machine models. These models can be used to identify the crucial factors causing readmissions and help identify patients to focus on to reduce the chances of readmission, ultimately bringing down the cost and increasing the quality of healthcare provided to the patients.
zh
[AI-48] Accumulative SGD Influence Estimation for Data Attribution
【速读】:该论文旨在解决标准随机梯度下降影响估计(SGD-IE)在现代数据驱动型人工智能中对单样本影响估计精度不足的问题。传统方法通过累加每轮的代理项来近似“删除一个样本”的影响,但忽略了跨训练轮次的影响累积效应,导致关键样本排序失真。其解决方案的核心是提出ACC-SGD-IE(Accumulative SGD Influence Estimator),这是一种轨迹感知的估计器,能够在训练过程中传播删除一个样本的扰动,并在每个优化步骤中更新累积影响状态。该方法在平滑强凸场景下实现几何误差收缩,在平滑非凸情形下进一步收紧误差界,且更大的小批量(mini-batch)可降低常数因子。实验表明,ACC-SGD-IE在多种数据集和训练设置下均能提供更精确的影响估计,尤其在长训练周期中表现优异,并显著提升下游数据清洗任务中噪声样本识别的可靠性。
链接: https://arxiv.org/abs/2510.26185
作者: Yunxiao Shi,Shuo Yang,Yixin Su,Rui Zhang,Min Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that outperform those cleaned with SGD-IE.
zh
[AI-49] Linking Heterogeneous Data with Coordinated Agent Flows for Social Media Analysis
【速读】:该论文旨在解决社会媒体分析中因数据异构性(heterogeneity)带来的探索性挑战,即如何从包含文本、网络结构和行为数据等多模态信息的复杂数据中有效挖掘有意义的洞察。现有自动化方法受限于对结构化表格数据的依赖,难以整合多样化的数据类型与分析流程。其解决方案的关键在于提出SIA(Social Insight Agents)系统,该系统通过一个基于自下而上的洞察类型分类法(bottom-up taxonomy)驱动的代理流(agent flows),将原始输入、中间结果、分析产出与可视化成果进行统一协调,并引入数据协调器(data coordinator)实现多模态数据的一致性融合,从而支持可解释、可交互且具备适应性的社会媒体分析流程。
链接: https://arxiv.org/abs/2510.26172
作者: Shifu Chen,Dazhen Deng,Zhihong Xu,Sijia Xu,Tai-Quan Peng,Yingcai Wu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Social media platforms generate massive volumes of heterogeneous data, capturing user behaviors, textual content, temporal dynamics, and network structures. Analyzing such data is crucial for understanding phenomena such as opinion dynamics, community formation, and information diffusion. However, discovering insights from this complex landscape is exploratory, conceptually challenging, and requires expertise in social media mining and visualization. Existing automated approaches, though increasingly leveraging large language models (LLMs), remain largely confined to structured tabular data and cannot adequately address the heterogeneity of social media analysis. We present SIA (Social Insight Agents), an LLM agent system that links heterogeneous multi-modal data – including raw inputs (e.g., text, network, and behavioral data), intermediate outputs, mined analytical results, and visualization artifacts – through coordinated agent flows. Guided by a bottom-up taxonomy that connects insight types with suitable mining and visualization techniques, SIA enables agents to plan and execute coherent analysis strategies. To ensure multi-modal integration, it incorporates a data coordinator that unifies tabular, textual, and network data into a consistent flow. Its interactive interface provides a transparent workflow where users can trace, validate, and refine the agent’s reasoning, supporting both adaptability and trustworthiness. Through expert-centered case studies and quantitative evaluation, we show that SIA effectively discovers diverse and meaningful insights from social media while supporting human-agent collaboration in complex analytical tasks.
zh
[AI-50] Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series
【速读】:该论文旨在解决多变量工业时间序列中异常检测(anomaly detection)的性能优化问题,特别是在蒸汽轮机系统这一复杂场景下。其关键解决方案在于采用基于数据分段的简单集成模型——随机森林(Random Forest)与XGBoost的组合,而非依赖复杂的特征工程(如变化点统计特征、聚类子结构表示)或混合模型架构。实验证明,该简化方案在高度不平衡且存在时序不确定性的数据上表现最优,实现了0.976的AUC-ROC、0.41的F1分数以及定义时间窗口内的100%早期检测率,凸显了模型简洁性与合理数据分割策略在实际工业场景中的优越性。
链接: https://arxiv.org/abs/2510.26159
作者: Emilio Mastriani,Alessandro Costa,Federico Incardona,Kevin Munari,Sebastiano Spinello
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  This paper is currently under review for presentation at the IEEE SAMI 2026 Conference
Abstract:In this study, we investigate the effectiveness of advanced feature engineering and hybrid model architectures for anomaly detection in a multivariate industrial time series, focusing on a steam turbine system. We evaluate the impact of change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies on detection performance. Despite their theoretical appeal, these complex approaches consistently underperformed compared to a simple Random Forest + XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of 0.976, F1-score of 0.41, and 100% early detection within the defined time window. Our findings highlight that, in scenarios with highly imbalanced and temporally uncertain data, model simplicity combined with optimized segmentation can outperform more sophisticated architectures, offering greater robustness, interpretability, and operational utility.
zh
[AI-51] Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment EMNLP2025
【速读】:该论文旨在解决现有分子-文本表示学习模型难以捕捉分子结构与其描述之间细微差异的问题,核心挑战在于缺乏对分子子结构与化学短语之间细粒度对齐关系的学习能力。解决方案的关键在于提出MolBridge框架,其通过引入基于分子子结构和化学短语的额外对齐信号来增强原始分子-描述配对数据,并采用子结构感知的对比学习策略结合自精炼机制以过滤噪声对齐信号,从而有效建模细粒度对应关系,在多个分子基准测试中显著优于当前最优基线方法。
链接: https://arxiv.org/abs/2510.26157
作者: Hyuntae Park,Yeachan Kim,SangKeun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  EMNLP 2025 (main)
Abstract:Molecule and text representation learning has gained increasing interest due to its potential for enhancing the understanding of chemical information. However, existing models often struggle to capture subtle differences between molecules and their descriptions, as they lack the ability to learn fine-grained alignments between molecular substructures and chemical phrases. To address this limitation, we introduce MolBridge, a novel molecule-text learning framework based on substructure-aware alignments. Specifically, we augment the original molecule-description pairs with additional alignment signals derived from molecular substructures and chemical phrases. To effectively learn from these enriched alignments, MolBridge employs substructure-aware contrastive learning, coupled with a self-refinement mechanism that filters out noisy alignment signals. Experimental results show that MolBridge effectively captures fine-grained correspondences and outperforms state-of-the-art baselines on a wide range of molecular benchmarks, highlighting the significance of substructure-aware alignment in molecule-text learning.
zh
[AI-52] he FM Agent
【速读】:该论文旨在解决复杂现实世界挑战中自动化科学与工程发现的难题,特别是如何在无需人工干预的情况下实现高效、自主的智能决策与优化。其解决方案的关键在于提出了一种名为FM Agent的通用多智能体框架,该框架通过LLM(Large Language Model)推理与大规模进化搜索的协同作用,实现了对多样化任务的自主求解;核心创新包括:冷启动初始化阶段引入专家引导以提升初始性能,设计新颖的进化采样策略用于迭代优化,构建领域特定评估器融合正确性、有效性及LLM监督反馈,以及基于Ray的分布式异步执行架构,从而在多个实际场景中达到或超越当前最优水平,如ALE-Bench、MLE-Bench、GPU内核优化和经典数学问题等。
链接: https://arxiv.org/abs/2510.26144
作者: Annan Li,Chufan Wu,Zengle Ge,Yee Hin Chong,Zhinan Hou,Lizhe Cao,Cheng Ju,Jianmin Wu,Huaiming Li,Haobo Zhang,Shenghao Feng,Mo Zhao,Fengzhi Qiu,Rui Yang,Mengmeng Zhang,Wenyi Zhu,Yingying Sun,Quan Sun,Shunhao Yan,Danyu Liu,Dawei Yin,Dou Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are catalyzing the development of autonomous AI research agents for scientific and engineering discovery. We present FM Agent, a novel and general-purpose multi-agent framework that leverages a synergistic combination of LLM-based reasoning and large-scale evolutionary search to address complex real-world challenges. The core of FM Agent integrates several key innovations: 1) a cold-start initialization phase incorporating expert guidance, 2) a novel evolutionary sampling strategy for iterative optimization, 3) domain-specific evaluators that combine correctness, effectiveness, and LLM-supervised feedback, and 4) a distributed, asynchronous execution infrastructure built on Ray. Demonstrating broad applicability, our system has been evaluated across diverse domains, including operations research, machine learning, GPU kernel optimization, and classical mathematical problems. FM Agent reaches state-of-the-art results autonomously, without human interpretation or tuning – 1976.3 on ALE-Bench (+5.2%), 43.56% on MLE-Bench (+4.0pp), up to 20x speedups on KernelBench, and establishes new state-of-the-art(SOTA) results on several classical mathematical problems. Beyond academic benchmarks, FM Agent shows considerable promise for both large-scale enterprise R\D workflows and fundamental scientific research, where it can accelerate innovation, automate complex discovery processes, and deliver substantial engineering and scientific advances with broader societal impact.
zh
[AI-53] Beyond Benchmarks: The Economics of AI Inference
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)推理成本过高对其商业化可行性和广泛应用构成的制约问题。其解决方案的关键在于构建了一个定量的“推理经济学”框架,将LLM推理过程视为由计算驱动的智能生产活动,并基于WiNEval-3.0的实证数据,首次绘制出“LLM推理生产前沿面”,揭示了边际成本递减、规模收益递减以及成本效益最优区间的三大原则,从而为模型部署决策提供经济依据,并为未来基于市场的AI推理资源定价与优化奠定实证基础。
链接: https://arxiv.org/abs/2510.26136
作者: Boqin Zhuang,Jiacheng Qiao,Mingqian Liu,Mingxing Yu,Ping Hong,Rui Li,Xiaoxia Song,Xiangjun Xu,Xu Chen,Yaoyao Ma,Yujie Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The inference cost of Large Language Models (LLMs) has become a critical factor in determining their commercial viability and widespread adoption. This paper introduces a quantitative economics of inference'' framework, treating the LLM inference process as a compute-driven intelligent production activity. We analyze its marginal cost, economies of scale, and quality of output under various performance configurations. Based on empirical data from WiNEval-3.0, we construct the first LLM Inference Production Frontier,‘’ revealing three principles: diminishing marginal cost, diminishing returns to scale, and an optimal cost-effectiveness zone. This paper not only provides an economic basis for model deployment decisions but also lays an empirical foundation for the future market-based pricing and optimization of AI inference resources.
zh
[AI-54] Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在真实软件项目中生成类级别代码(class-level implementations)时正确性不足的问题,即其在合成基准测试上表现良好,但在实际应用场景中泛化能力严重受限。解决方案的关键在于构建一个基于开源仓库的真实世界类任务基准(real-world class tasks benchmark),将代码划分为已见(seen)与未见(unseen)分区以评估模型在实际场景下的泛化性能,并系统性地考察输入规范、检索增强生成(retrieval-augmented generation, RAG)配置以及文档完整性对模型表现的影响。实验表明,RAG在部分文档条件下能显著提升正确性(4%–7%),尤其通过提供缺失的实现模式弥补规范不足;同时发现逻辑错误主要源于AttributeError、TypeError和AssertionError(占84%),提示需改进上下文建模与依赖管理策略,从而为生产级代码辅助工具的设计提供关键优化方向。
链接: https://arxiv.org/abs/2510.26130
作者: Musfiqur Rahman,SayedHassan Khatoonabadi,Emad Shihab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:  Pre-print prepared for journal submission
Abstract:Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels. Results reveal a stark performance disparity: LLMs achieve 84% to 89% correctness on established synthetic benchmarks but only 25% to 34% on real-world class tasks, with negligible differences between familiar and novel codebases. Comprehensive docstrings yield modest gains of 1% to 3% in functional accuracy, though statistical significance is rare. Retrieval-augmented generation proves most effective with partial documentation, improving correctness by 4% to 7% by supplying concrete implementation patterns absent from specifications. Error profiling identifies AttributeError, TypeError, and AssertionError as dominant failure modes (84% of cases), with synthetic tests overemphasizing assertion issues and real-world scenarios highlighting type and attribute mismatches. Retrieval augmentation reduces logical flaws but can introduce dependency conflicts. The benchmark and analysis expose critical limitations in current LLM capabilities for class-level engineering, offering actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.          Comments: Pre-print prepared for journal submission   Subjects:  Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)  Cite as: arXiv:2510.26130 [cs.SE]    (or  arXiv:2510.26130v1 [cs.SE] for this version)                https://doi.org/10.48550/arXiv.2510.26130   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[AI-55] SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth
【速读】:该论文旨在解决当前机器学习模型评估中普遍存在的问题:即仅基于测试集上样本的平均损失来衡量性能,忽略了地理空间分布的非均匀性(如人类发展水平和地形差异),导致对模型在不同区域或群体中的实际表现评估失真。其解决方案的关键在于提出Stratified Assessments of Forecasts over Earth (SAFE)——一个用于解析地球尺度预测模型分层性能的工具包,通过整合多源数据,将预测结果按领土(通常为国家)、全球子区域、收入水平和土地覆盖类型(陆地或水域)等属性进行分层评估,从而实现对每个子群组(如单个国家)的精细化性能分析。这一方法首次系统性地揭示了模型在全球范围内的表现差异,并支持从公平性角度比较不同模型在不同预报时效和气候变量下的表现。
链接: https://arxiv.org/abs/2510.26099
作者: Nick Masi,Randall Balestriero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at this https URL
zh
[AI-56] GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision Language Models, VLMs)在图形用户界面(Graphical User Interface, GUI)任务自动化中表现仍落后于人类的问题。作者认为,这一差距主要源于现有训练方法(如监督微调和强化学习)无法充分覆盖GUI任务所需的核心知识。解决方案的关键在于通过分析GUI任务执行中的常见失败模式,将GUI知识系统性地提炼为三个维度:界面感知(interface perception)、交互预测(interaction prediction)和指令理解(instruction understanding),并构建了GUI Knowledge Bench基准测试平台,用于量化评估VLMs在这三个维度上的能力。实验表明,当前VLMs虽能识别控件功能,但在系统状态感知、动作预测和任务完成验证方面存在明显不足,且这些知识维度与实际GUI任务成功率密切相关,从而为筛选具备潜力的VLMs及开发更强大的GUI代理提供了结构化框架和实证依据。
链接: https://arxiv.org/abs/2510.26098
作者: Chenrui Shi,Zedong Yu,Zhi Gao,Ruining Feng,Enqi Liu,Yuwei Wu,Yunde Jia,Liuyu Xiang,Zhaofeng He,Qing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.
zh
[AI-57] Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
【速读】:该论文旨在解决大学物理问题在形式化推理系统中缺乏高质量基准和基础库支撑的问题。其核心挑战在于如何构建一个既贴近真实物理教学与竞赛场景、又能支持严谨数学逻辑验证的推理框架。解决方案的关键在于提出了 Lean4PHYS 框架,包含两个核心组件:一是 LeanPhysBench,一个由200个手工构造并经同行评审的物理命题组成的大学水平基准测试集,源自大学教材和物理竞赛题;二是 PhysLib,一个社区驱动的基础物理定理与单位制库,为形式化物理推理提供可复用的核心知识模块。实验表明,该框架显著提升了模型在物理推理任务上的表现(如使用 PhysLib 后平均提升11.75%),且首次在 Lean4 中实现了针对物理问题的系统性形式化推理评估。
链接: https://arxiv.org/abs/2510.26094
作者: Yuxin Li,Minghao Liu,Ruida Wang,Wenzhao Ji,Zhitao He,Rui Pan,Junming Huang,Tong Zhang,Yi R. Fung
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present Lean4PHYS, a comprehensive reasoning framework for college-level physics problems in Lean4. Lean4PHYS includes LeanPhysBench, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce PhysLib, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in Lean4PHYS, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our PhysLib can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our LeanPhysBench and the effectiveness of PhysLib. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.
zh
[AI-58] Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing
【速读】:该论文旨在解决城市道路网络中多车辆动态路径规划问题,传统最短路径优先(Shortest Path First, SPF)算法在动态环境中因缺乏协调性而导致路径重叠、加剧拥堵。其核心解决方案是提出两种基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的框架:一是自适应导航(Adaptive Navigation, AN),通过图注意力网络(Graph Attention Networks, GAT)建模局部与邻域状态,实现去中心化决策;二是分层枢纽式自适应导航(Hierarchical Hub-based Adaptive Navigation, HHAN),采用集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)策略,仅在关键枢纽节点部署代理,并结合注意力机制融合异步车辆决策,同时引入流量感知的状态特征以实现前瞻性调度。该方法显著提升了大规模路网下的通行效率与可扩展性。
链接: https://arxiv.org/abs/2510.26089
作者: Fazel Arasteh,Arian Haghparast,Manos Papagelis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:  29 pages, 12 figures. Fazel Arasteh and Arian Haghparast contributed equally to this research. Submitted to ACM Transactions on Spatial Algorithms and Systems (TSAS). The code for this work is publicly available at this https URL
Abstract:Traffic congestion in urban road networks leads to longer trip times and higher emissions, especially during peak periods. While the Shortest Path First (SPF) algorithm is optimal for a single vehicle in a static network, it performs poorly in dynamic, multi-vehicle settings, often worsening congestion by routing all vehicles along identical paths. We address dynamic vehicle routing through a multi-agent reinforcement learning (MARL) framework for coordinated, network-aware fleet navigation. We first propose Adaptive Navigation (AN), a decentralized MARL model where each intersection agent provides routing guidance based on (i) local traffic and (ii) neighborhood state modeled using Graph Attention Networks (GAT). To improve scalability in large networks, we further propose Hierarchical Hub-based Adaptive Navigation (HHAN), an extension of AN that assigns agents only to key intersections (hubs). Vehicles are routed hub-to-hub under agent control, while SPF handles micro-routing within each hub region. For hub coordination, HHAN adopts centralized training with decentralized execution (CTDE) under the Attentive Q-Mixing (A-QMIX) framework, which aggregates asynchronous vehicle decisions via attention. Hub agents use flow-aware state features that combine local congestion and predictive dynamics for proactive routing. Experiments on synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces average travel time versus SPF and learning baselines, maintaining 100% routing success. HHAN scales to networks with hundreds of intersections, achieving up to 15.9% improvement under heavy traffic. These findings highlight the potential of network-constrained MARL for scalable, coordinated, and congestion-aware routing in intelligent transportation systems.
zh
[AI-59] Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
【速读】:该论文旨在解决传统大语言模型(Large Language Models, LLMs)在保持广泛通用能力的同时,难以在特定领域(如医学影像)实现专家级性能的问题。现有架构(如Transformer、线性注意力机制及混合模型)缺乏基于任务信息引导的专用记忆机制,导致其对领域偏移适应能力有限。解决方案的关键在于提出Nirvana模型,其核心创新是引入任务感知的记忆触发机制(Task-Aware Memory Trigger),该机制将每个输入样本视为自监督微调任务,动态调整模型参数以响应当前任务需求;同时设计了专用记忆更新器(Specialized Memory Updater),根据Trigger的指导动态存储上下文信息。这一机制使模型在测试阶段即可利用任务信息灵活调整记忆策略,从而在不修改骨干网络的情况下实现对MRI等专业任务的有效适应,并在医学图像重建和临床报告生成中优于传统方法。
链接: https://arxiv.org/abs/2510.26083
作者: Yuhua Jiang,Shuang Cheng,Yihao Liu,Ermo Hua,Che Jiang,Weigao Sun,Yu Cheng,Feifei Gao,Biqing Qi,Bowen Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Specialized Generalist Models (SGMs) aim to preserve broad capabilities while achieving expert-level performance in target domains. However, traditional LLM structures including Transformer, Linear Attention, and hybrid models do not employ specialized memory mechanism guided by task information. In this paper, we present Nirvana, an SGM with specialized memory mechanism, linear time complexity, and test-time task information extraction. Besides, we propose the Task-Aware Memory Trigger ( \textitTrigger ) that flexibly adjusts memory mechanism based on the current task’s requirements. In Trigger, each incoming sample is treated as a self-supervised fine-tuning task, enabling Nirvana to adapt its task-related parameters on the fly to domain shifts. We also design the Specialized Memory Updater ( \textitUpdater ) that dynamically memorizes the context guided by Trigger. We conduct experiments on both general language tasks and specialized medical tasks. On a variety of natural language modeling benchmarks, Nirvana achieves competitive or superior results compared to the existing LLM structures. To prove the effectiveness of Trigger on specialized tasks, we test Nirvana’s performance on a challenging medical task, i.e., Magnetic Resonance Imaging (MRI). We post-train frozen Nirvana backbone with lightweight codecs on paired electromagnetic signals and MRI images. Despite the frozen Nirvana backbone, Trigger guides the model to adapt to the MRI domain with the change of task-related parameters. Nirvana achieves higher-quality MRI reconstruction compared to conventional MRI models as well as the models with traditional LLMs’ backbone, and can also generate accurate preliminary clinical reports accordingly.
zh
[AI-60] Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization
【速读】:该论文旨在解决传统机器学习模型在固定几何空间中进行参数优化所导致的表达能力受限问题,即模型无法动态适应数据分布的内在结构。其核心解决方案是将模型本身视为可变形的几何实体,通过优化定义在预设拓扑流形上的度量张量场(metric tensor field),从而动态调整模型空间的几何结构。关键在于构建一个变分框架,其中损失函数同时权衡数据保真度与流形的内在几何复杂度:前者确保模型对观测数据的有效拟合,后者作为正则项惩罚过于弯曲或不规则的几何形态,以防止过拟合并促进简洁模型。为应对无限维优化的计算挑战,作者引入基于离散微分几何的方法,将连续流形离散化为三角网格,并用边长参数化度量张量,借助自动微分实现高效优化。这一方法揭示了与广义相对论中爱因斯坦-希尔伯特作用量的深刻类比,为“数据驱动几何”提供了物理解释,并表明即使拓扑固定,度量优化也能显著提升模型的表达能力。
链接: https://arxiv.org/abs/2510.26068
作者: Di Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Statistics Theory (math.ST)
备注:  9 pages
Abstract:This paper proposes a novel paradigm for machine learning that moves beyond traditional parameter optimization. Unlike conventional approaches that search for optimal parameters within a fixed geometric space, our core idea is to treat the model itself as a malleable geometric entity. Specifically, we optimize the metric tensor field on a manifold with a predefined topology, thereby dynamically shaping the geometric structure of the model space. To achieve this, we construct a variational framework whose loss function carefully balances data fidelity against the intrinsic geometric complexity of the manifold. The former ensures the model effectively explains observed data, while the latter acts as a regularizer, penalizing overly curved or irregular geometries to encourage simpler models and prevent overfitting. To address the computational challenges of this infinite-dimensional optimization problem, we introduce a practical method based on discrete differential geometry: the continuous manifold is discretized into a triangular mesh, and the metric tensor is parameterized by edge lengths, enabling efficient optimization using automatic differentiation tools. Theoretical analysis reveals a profound analogy between our framework and the Einstein-Hilbert action in general relativity, providing an elegant physical interpretation for the concept of “data-driven geometry”. We further argue that even with fixed topology, metric optimization offers significantly greater expressive power than models with fixed geometry. This work lays a solid foundation for constructing fully dynamic “meta-learners” capable of autonomously evolving their geometry and topology, and it points to broad application prospects in areas such as scientific model discovery and robust representation learning.
zh
[AI-61] Can AI be Accountable?
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)系统缺乏问责机制的问题,即AI在实际应用中往往无法被用户、决策者或公众有效追问、讨论或制裁,从而导致其权力滥用风险上升。解决方案的关键在于将通用的问责定义(accountability)映射到AI领域:要求AI具备可追溯性(可被请求信息)、可交互性(能与相关方进行讨论)以及可约束性(可被施加惩罚),并通过技术与制度设计推动实现所有AI系统对受影响群体的全面问责。
链接: https://arxiv.org/abs/2510.26057
作者: Andrew L. Kun
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:  To be published as a chapter in Daniele Quercia and Marios Constantinides (Eds.). Operationalizing Responsible AI. Cambridge University Press. Forthcoming
Abstract:The AI we use is powerful, and its power is increasing rapidly. If this powerful AI is to serve the needs of consumers, voters, and decision makers, then it is imperative that the AI is accountable. In general, an agent is accountable to a forum if the forum can request information from the agent about its actions, if the forum and the agent can discuss this information, and if the forum can sanction the agent. Unfortunately, in too many cases today’s AI is not accountable – we cannot question it, enter into a discussion with it, let alone sanction it. In this chapter we relate the general definition of accountability to AI, we illustrate what it means for AI to be accountable and unaccountable, and we explore approaches that can improve our chances of living in a world where all AI is accountable to those who are affected by it.
zh
[AI-62] Large Language Model-assisted Autonomous Vehicle Recovery from Immobilization
【速读】:该论文旨在解决自动驾驶车辆(AV)在特定交通场景中因无法自主决策而陷入停滞(immobilization)的问题,此类情况常导致交通流中断。现有解决方案如远程干预成本高、效率低,手动接管则限制了非驾驶员群体的可用性。其关键创新在于提出StuckSolver——一个基于大语言模型(Large Language Model, LLM)的恢复框架,通过自推理或乘客引导的方式实现故障解除;该框架作为插件模块集成于AV现有感知-规划-控制栈之上,无需修改底层架构,仅需接入标准传感器数据流以识别停滞状态、解析环境上下文并生成高层恢复指令,由原生规划器执行,从而显著提升AV在复杂不确定性场景下的鲁棒性和可用性。
链接: https://arxiv.org/abs/2510.26023
作者: Zhipeng Bao,Qianwen Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:  8 pages
Abstract:Despite significant advancements in recent decades, autonomous vehicles (AVs) continue to face challenges in navigating certain traffic scenarios where human drivers excel. In such situations, AVs often become immobilized, disrupting overall traffic flow. Current recovery solutions, such as remote intervention (which is costly and inefficient) and manual takeover (which excludes non-drivers and limits AV accessibility), are inadequate. This paper introduces StuckSolver, a novel Large Language Model (LLM) driven recovery framework that enables AVs to resolve immobilization scenarios through self-reasoning and/or passenger-guided decision-making. StuckSolver is designed as a plug-in add-on module that operates on top of the AV’s existing perception-planning-control stack, requiring no modification to its internal architecture. Instead, it interfaces with standard sensor data streams to detect immobilization states, interpret environmental context, and generate high-level recovery commands that can be executed by the AV’s native planner. We evaluate StuckSolver on the Bench2Drive benchmark and in custom-designed uncertainty scenarios. Results show that StuckSolver achieves near-state-of-the-art performance through autonomous self-reasoning alone and exhibits further improvements when passenger guidance is incorporated.
zh
[AI-63] RADRON: Cooperative Localization of Ionizing Radiation Sources by MAVs with Compton Cameras
【速读】:该论文旨在解决如何在复杂环境中高效、实时定位放射性物质的问题,尤其针对传统辐射探测方法在移动性和灵敏度方面的局限。解决方案的关键在于利用微型无人飞行器(Micro Aerial Vehicles, MAVs)协同搭载高灵敏度、轻量化(40 g)的单探测器康普顿相机(Compton camera),通过实时融合稀疏测量数据来估计辐射源位置,并将数据处理与反馈控制直接集成于机载系统中,实现多架MAVs以紧密协作的蜂群方式动态调整飞行路径,从而快速定位甚至追踪移动辐射源。
链接: https://arxiv.org/abs/2510.26018
作者: Petr Stibinger,Tomas Baca,Daniela Doubravova,Jan Rusnak,Jaroslav Solc,Jan Jakubek,Petr Stepan,Martin Saska
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:  8 pages, 9 figures, submitted for review to IEEE RA-L
Abstract:We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector’s exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.
zh
[AI-64] Dual Mixture-of-Experts Framework for Discrete-Time Survival Analysis NEURIPS2025 ALT
【速读】:该论文旨在解决生存分析中如何同时建模患者异质性(patient heterogeneity)并适应个体特征与时间动态变化的挑战。其解决方案的关键在于提出一种双混合专家(dual mixture-of-experts, dual-MoE)框架:一方面通过特征编码器混合专家(feature-encoder MoE)实现子群感知的表示学习,另一方面通过风险函数混合专家(hazard MoE)结合患者特征与时序嵌入以捕捉时间动态变化。该设计可灵活集成至现有的基于深度学习的生存分析流程中,在METABRIC和GBSG乳腺癌数据集上显著提升预测性能,测试集时间依赖C指数最高提升0.04,并在Consurv框架中进一步增益。
链接: https://arxiv.org/abs/2510.26014
作者: Hyeonjun Lee,Hyungseob Shin,Gunhee Nam,Hyeonsoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  Accepted to NeurIPS 2025 workshop Learning from Time Series for Health (TS4H)
Abstract:Survival analysis is a task to model the time until an event of interest occurs, widely used in clinical and biomedical research. A key challenge is to model patient heterogeneity while also adapting risk predictions to both individual characteristics and temporal dynamics. We propose a dual mixture-of-experts (MoE) framework for discrete-time survival analysis. Our approach combines a feature-encoder MoE for subgroup-aware representation learning with a hazard MoE that leverages patient features and time embeddings to capture temporal dynamics. This dual-MoE design flexibly integrates with existing deep learning based survival pipelines. On METABRIC and GBSG breast cancer datasets, our method consistently improves performance, boosting the time-dependent C-index up to 0.04 on the test sets, and yields further gains when incorporated into the Consurv framework.
zh
[AI-65] AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys KDD2025
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)领域研究文献快速增长背景下,人工撰写全面且时效性强的综述论文日益困难的问题。其解决方案的关键在于提出autosurvey2,一个基于多阶段流水线的自动化综述生成框架,核心包括:通过检索增强合成(retrieval-augmented synthesis)实现内容生成与最新文献实时整合,结合并行章节生成与迭代优化机制保障结构完整性和逻辑一致性,并引入多大语言模型(multi-LLM)评估体系对覆盖度、结构合理性和相关性进行量化评测,从而在保持高引文保真度的同时显著提升综述的结构性和主题相关性。
链接: https://arxiv.org/abs/2510.26012
作者: Siyi Wu,Chiaxin Liang,Ziqian Bi,Leyi Zhao,Tianyang Wang,Junhao Song,Yichao Zhang,Keyu Chen,Xinyuan Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  TKDD 2025
Abstract:The rapid growth of research literature, particularly in large language models (LLMs), has made producing comprehensive and current survey papers increasingly difficult. This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation. The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy. Quality is assessed using a multi-LLM evaluation framework that measures coverage, structure, and relevance in alignment with expert review standards. Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines, achieving higher scores in structural coherence and topical relevance while maintaining strong citation fidelity. By combining retrieval, reasoning, and automated evaluation into a unified framework, autosurvey2 provides a scalable and reproducible solution for generating long-form academic surveys and contributes a solid foundation for future research on automated scholarly writing. All code and resources are available at this https URL.
zh
[AI-66] he Quest for Reliable Metrics of Responsible AI
【速读】:该论文旨在解决当前负责任人工智能(Responsible AI)发展中存在的一个关键问题:用于评估AI系统性能的指标(metrics)本身的鲁棒性和可靠性尚未得到充分研究和保障。尽管已有大量工作通过量化指标来衡量负责任AI的进展,但这些指标是否在不同场景下稳定、可信仍缺乏系统性检验。解决方案的关键在于借鉴推荐系统领域中关于公平性指标鲁棒性的研究成果,提炼出一套适用于广泛AI应用场景(包括科学领域的人工智能,AIS)的非穷尽性指南,以指导开发更具可靠性的负责任AI评估指标体系。
链接: https://arxiv.org/abs/2510.26007
作者: Theresia Veronika Rampisela,Maria Maistro,Tuukka Ruotsalo,Christina Lioma
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:  Accepted for presentation at the AI in Science Summit 2025
Abstract:The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.
zh
[AI-67] From Queries to Insights: Agent ic LLM Pipelines for Spatio-Temporal Text-to-SQL
【速读】:该论文旨在解决自然语言到SQL(NL-to-SQL)系统在处理真实场景中的时空查询时存在的局限性,这些问题包括用户模糊表述与数据库模式特定类别之间的对齐困难、时间推理能力不足以及输出选择不当。解决方案的关键在于提出一个基于代理(agentic)的流水线架构,该架构以Llama-3-SQLCoder-8B为基础模型,并通过Mistral-based ReAct代理进行任务编排,使系统能够通过模式检查、SQL生成、执行和可视化工具实现计划、分解与自适应查询调整。实验表明,该方法在纽约和东京签到数据集上的35个自然语言查询中将准确率从28.6%提升至91.4%,显著优于基线模型,同时通过地图、图表和结构化自然语言摘要提升了可用性,验证了代理编排而非更强SQL生成器本身是构建交互式地理空间助手的核心方向。
链接: https://arxiv.org/abs/2510.25997
作者: Manu Redd,Tao Zhe,Dongjie Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:  8 pages, 5 figures, GeoGenAgent’25 - ACM SIGSPATIAL
Abstract:Natural-language-to-SQL (NL-to-SQL) systems hold promise for democratizing access to structured data, allowing users to query databases without learning SQL. Yet existing systems struggle with realistic spatio-temporal queries, where success requires aligning vague user phrasing with schema-specific categories, handling temporal reasoning, and choosing appropriate outputs. We present an agentic pipeline that extends a naive text-to-SQL baseline (llama-3-sqlcoder-8b) with orchestration by a Mistral-based ReAct agent. The agent can plan, decompose, and adapt queries through schema inspection, SQL generation, execution, and visualization tools. We evaluate on 35 natural-language queries over the NYC and Tokyo check-in dataset, covering spatial, temporal, and multi-dataset reasoning. The agent achieves substantially higher accuracy than the naive baseline 91.4% vs. 28.6% and enhances usability through maps, plots, and structured natural-language summaries. Crucially, our design enables more natural human-database interaction, supporting users who lack SQL expertise, detailed schema knowledge, or prompting skill. We conclude that agentic orchestration, rather than stronger SQL generators alone, is a promising foundation for interactive geospatial assistants.
zh
[AI-68] WaveVerif: Acoustic Side-Channel based Verification of Robotic Workflows
【速读】:该论文旨在解决机器人在执行任务过程中行为一致性验证的问题,尤其是在无需硬件改造的条件下实现对机器人操作的实时监控与验证。其解决方案的关键在于利用声学侧信道分析(Acoustic Side-Channel Analysis, ASCA)技术,通过捕捉机器人运动时产生的声学信号,并结合多种机器学习分类器(如支持向量机、深度神经网络、循环神经网络和卷积神经网络)构建工作流验证系统,从而判断实际行为是否与预期指令一致。实验表明,在基准条件下,单个动作识别准确率超过80%,且典型工作流(如抓取-放置和打包)也能以高置信度识别,证明了该方法在敏感机器人环境中具备实时、低成本、被动验证的可行性。
链接: https://arxiv.org/abs/2510.25960
作者: Zeynep Yasemin Erdogan,Shishir Nagaraja,Chuadhry Mujeeb Ahmed,Ryan Shah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:  11 pages, 3 figures, Corresponding Author: Prof. Shishir Nagaraja ( this http URL @newcastle. this http URL )
Abstract:In this paper, we present a framework that uses acoustic side- channel analysis (ASCA) to monitor and verify whether a robot correctly executes its intended commands. We develop and evaluate a machine-learning-based workflow verification system that uses acoustic emissions generated by robotic movements. The system can determine whether real-time behavior is consistent with expected commands. The evaluation takes into account movement speed, direction, and microphone distance. The results show that individual robot movements can be validated with over 80% accuracy under baseline conditions using four different classifiers: Support Vector Machine (SVM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN). Additionally, workflows such as pick-and-place and packing could be identified with similarly high confidence. Our findings demonstrate that acoustic signals can support real-time, low-cost, passive verification in sensitive robotic environments without requiring hardware modifications.
zh
[AI-69] Application and Validation of Geospatial Foundation Model Data for the Prediction of Health Facility Programmatic Outputs – A Case Study in Malawi
【速读】:该论文旨在解决低收入和中等收入国家(LMICs)常规健康数据可靠性不足的问题,主要受限于报告延迟和覆盖不全。为提升健康指标预测的准确性,研究提出利用地理空间基础模型(Geospatial Foundation Models, GeoFMs)整合多源异构数据(如人口动态、卫星影像与手机通话详单)生成数学嵌入(embeddings),并通过XGBoost模型进行下游预测任务。其关键解决方案在于:通过融合三种GeoFM嵌入源(Google Population Dynamics Foundation Model、Google AlphaEarth及移动电话呼叫详细记录),构建多源GeoFM集成模型,在马拉维552个卫生服务覆盖区的数据上显著优于传统地统计插值方法,尤其在人口密度、新发HIV病例和儿童疫苗接种等指标上表现出较高的预测性能(平均交叉验证R²达0.47–0.63)。
链接: https://arxiv.org/abs/2510.25954
作者: Lynn Metz,Rachel Haggard,Michael Moszczynski,Samer Asbah,Chris Mwase,Patricia Khomani,Tyler Smith,Hannah Cooper,Annie Mwale,Arbaaz Muslim,Gautam Prasad,Mimi Sun,Tomer Shekel,Joydeep Paul,Anna Carter,Shravya Shetty,Dylan Green
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  13 pages, 3010 words, 2 tables, 2 figures
Abstract:The reliability of routine health data in low and middle-income countries (LMICs) is often constrained by reporting delays and incomplete coverage, necessitating the exploration of novel data sources and analytics. Geospatial Foundation Models (GeoFMs) offer a promising avenue by synthesizing diverse spatial, temporal, and behavioral data into mathematical embeddings that can be efficiently used for downstream prediction tasks. This study evaluated the predictive performance of three GeoFM embedding sources - Google Population Dynamics Foundation Model (PDFM), Google AlphaEarth (derived from satellite imagery), and mobile phone call detail records (CDR) - for modeling 15 routine health programmatic outputs in Malawi, and compared their utility to traditional geospatial interpolation methods. We used XGBoost models on data from 552 health catchment areas (January 2021-May 2023), assessing performance with R2, and using an 80/20 training and test data split with 5-fold cross-validation used in training. While predictive performance was mixed, the embedding-based approaches improved upon baseline geostatistical methods in 13 of 15 (87%) indicators tested. A Multi-GeoFM model integrating all three embedding sources produced the most robust predictions, achieving average 5-fold cross validated R2 values for indicators like population density (0.63), new HIV cases (0.57), and child vaccinations (0.47) and test set R2 of 0.64, 0.68, and 0.55, respectively. Prediction was poor for prediction targets with low primary data availability, such as TB and malnutrition cases. These results demonstrate that GeoFM embeddings imbue a modest predictive improvement for select health and demographic outcomes in an LMIC context. We conclude that the integration of multiple GeoFM sources is an efficient and valuable tool for supplementing and strengthening constrained routine health information systems.
zh
[AI-70] Estimating cognitive biases with attention-aware inverse planning
【速读】:该论文旨在解决如何从个体的行为中推断其注意力偏置(attentional biases)的问题,这是理解人类目标导向行为的关键,尤其在人机交互场景中具有重要意义。传统逆强化学习(inverse reinforcement learning, IRL)无法捕捉认知偏置对行为的影响,而本文提出“注意力感知的逆规划”(attention-aware inverse planning)问题,通过将认知建模与深度强化学习相结合,构建一个可扩展的框架来识别个体在真实驾驶场景中的注意力策略。解决方案的关键在于将计算认知模型嵌入到深度强化学习架构中,从而实现对行为背后隐含注意力偏置的系统性推断,且已在Waymo Open Dataset的真实驾驶数据上验证了其有效性。
链接: https://arxiv.org/abs/2510.25951
作者: Sounak Banerjee,Daphne Cornelisse,Deepak Gopinath,Emily Sumner,Jonathan DeCastro,Guy Rosman,Eugene Vinitsky,Mark K. Ho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:People’s goal-directed behaviors are influenced by their cognitive biases, and autonomous systems that interact with people should be aware of this. For example, people’s attention to objects in their environment will be biased in a way that systematically affects how they perform everyday tasks such as driving to work. Here, building on recent work in computational cognitive science, we formally articulate the attention-aware inverse planning problem, in which the goal is to estimate a person’s attentional biases from their actions. We demonstrate how attention-aware inverse planning systematically differs from standard inverse reinforcement learning and how cognitive biases can be inferred from behavior. Finally, we present an approach to attention-aware inverse planning that combines deep reinforcement learning with computational cognitive modeling. We use this approach to infer the attentional strategies of RL agents in real-life driving scenarios selected from the Waymo Open Dataset, demonstrating the scalability of estimating cognitive biases with attention-aware inverse planning.
zh
[AI-71] A Process Mining-Based System For The Analysis and Prediction of Software Development Workflows
【速读】:该论文旨在解决软件开发流程中难以提前识别延期风险的问题,即如何在项目执行过程中主动预测代码提交请求(Pull Request, PR)是否能够按时完成,从而实现对项目进度的前瞻性管理。解决方案的关键在于构建一个端到端的系统 CodeSight,其核心包括两个层面:首先,通过从 GitHub 直接采集开发与部署数据并转化为过程挖掘(Process Mining)日志,提取出结构化的 PR 活动模式和工作流效率指标;其次,基于这些日志中的序列化活动轨迹和静态特征,利用长短期记忆网络(LSTM)模型预测 PR 的剩余处理时间,从而早期识别可能的截止日期违约。该方法融合了过程挖掘与机器学习技术,实现了对软件项目交付风险的精准预警。
链接: https://arxiv.org/abs/2510.25935
作者: Antía Dorado,Iván Folgueira,Sofía Martín,Gonzalo Martín,Álvaro Porto,Alejandro Ramos,John Wallace
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:  16 pages, 7 figures, 4 tables
Abstract:CodeSight is an end-to-end system designed to anticipate deadline compliance in software development workflows. It captures development and deployment data directly from GitHub, transforming it into process mining logs for detailed analysis. From these logs, the system generates metrics and dashboards that provide actionable insights into PR activity patterns and workflow efficiency. Building on this structured representation, CodeSight employs an LSTM model that predicts remaining PR resolution times based on sequential activity traces and static features, enabling early identification of potential deadline breaches. In tests, the system demonstrates high precision and F1 scores in predicting deadline compliance, illustrating the value of integrating process mining with machine learning for proactive software project management.
zh
[AI-72] Humains-Junior: A 3.8B Language Model Achieving GPT -4o-Level Factual Accuracy by Directed Exoskeleton Reasoning
【速读】:该论文旨在解决小规模语言模型(Small Language Models, SLMs)在事实准确性(Factual Grounding)方面难以媲美大型前沿模型(如GPT-4o)的问题。其核心挑战在于如何在模型参数量显著减少的情况下,仍能实现与大模型相当的事实推理能力,并兼顾部署成本效率。解决方案的关键在于结合“最小化定向推理骨架”(minimal directed “Exoskeleton Reasoning” scaffolds)与行为微调(behavioral fine-tuning),后者重点训练模型遵守协议规范(epistemic discipline),而非直接学习领域答案。这种组合策略带来了显著性能提升(+17.7个百分点,p < 0.001)并降低方差约25%,使3.8B参数的Humans-Junior模型在FACTS Grounding公共子集上达到与GPT-4o等效的准确率(±5个百分点范围内),同时在云服务定价上约为GPT-4o的1/19,且支持自托管或边缘部署以趋近零边际推理成本。
链接: https://arxiv.org/abs/2510.25933
作者: Nissan Yaron,Dan Bystritsky,Ben-Etzion Yaron
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a  \pm 5  pp equivalence margin. Results. On Q1–Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5–77.2) and Humans-Junior 72.7% (95% CI 68.7–76.5); the paired difference is 0.8 pp (bootstrap 95% CI  -3.1  to  +4.7 ; permutation  p = 0.72 ; Cohen’s  d = 0.023 ). TOST establishes equivalence at  \pm 5  pp (not at  \pm 3  pp). When purchased as managed APIs, Humans-Junior’s base model (Phi-3.5-mini-instruct) is  \approx 19\times  less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed “Exoskeleton Reasoning” scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp,  p  0.001 ) and reduce variance ( \approx 25% ). In prompt-only settings on frontier models (Q1–Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%,  n = 100 ); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within  \pm 5  pp on Q1–Q500). Cloud pricing shows  \approx 19\times  lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1–Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI         Subjects:  Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)  Cite as: arXiv:2510.25933 [cs.AI]    (or  arXiv:2510.25933v1 [cs.AI] for this version)                https://doi.org/10.48550/arXiv.2510.25933   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)        Submission history From: Nissan Yaron [view email]       [v1]         Wed, 29 Oct 2025 20:12:36 UTC (539 KB)       Full-text links: Access Paper:   View a PDF of the paper titled Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning, by Nissan Yaron and 2 other authorsView PDFHTML (experimental)TeX Source     view license         Current browse context: cs.AI   prev   |   next   new  |  recent  | 2025-10      Change to browse by:      cs cs.HC cs.LG cs.NE     References  Citations  NASA ADSGoogle Scholar Semantic Scholar     export BibTeX citation Loading…     BibTeX formatted citation    loading…   Data provided by:      Bookmark           checked=“checked”>     Bibliographic Tools  Bibliographic and Citation Tools       Bibliographic Explorer Toggle    Bibliographic Explorer (What is the Explorer?)        Connected Papers Toggle    Connected Papers (What is Connected Papers?)       Litmaps Toggle    Litmaps (What is Litmaps?)        scite.ai Toggle    scite Smart Citations (What are Smart Citations?)          Code, Data, Media  Code, Data and Media Associated with this Article       alphaXiv Toggle    alphaXiv (What is alphaXiv?)        Links to Code Toggle    CatalyzeX Code Finder for Papers (What is CatalyzeX?)        DagsHub Toggle    DagsHub (What is DagsHub?)        GotitPub Toggle    Gotit.pub (What is GotitPub?)        Huggingface Toggle    Hugging Face (What is Huggingface?)        Links to Code Toggle    Papers with Code (What is Papers with Code?)        ScienceCast Toggle    ScienceCast (What is ScienceCast?)              Demos  Demos       Replicate Toggle    Replicate (What is Replicate?)        Spaces Toggle    Hugging Face Spaces (What is Spaces?)        Spaces Toggle    TXYZ.AI (What is TXYZ.AI?)         Related Papers  Recommenders and Search Tools       Link to Influence Flower    Influence Flower (What are Influence Flowers?)        Core recommender toggle    CORE Recommender (What is CORE?)      Author Venue Institution Topic                      About arXivLabs           arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.           Which authors of this paper are endorsers? |     Disable MathJax (What is MathJax?)       mathjaxToggle();           About Help      contact arXivClick here to contact arXiv  Contact   subscribe to arXiv mailingsClick here to subscribe  Subscribe            Copyright Privacy Policy     Web Accessibility Assistance   arXiv Operational Status
zh
[AI-73] Multi-Agent Reinforcement Learning for Market Making: Competition without Collusion
【速读】:该论文旨在解决算法合谋(algorithmic collusion)问题,即在市场中部署的多个智能体(AI agents)之间交互是否会导致隐性串通或市场失衡,进而影响市场效率与公平性。其核心挑战在于理解不同目标导向的智能体如何通过策略互动形成新型市场行为模式,例如垄断或共谋。解决方案的关键在于提出一种分层多智能体强化学习框架(hierarchical multi-agent reinforcement learning framework),其中包含一个自利型做市商(Agent A)和三个底层竞争智能体:纯自利型(B1)、对抗型(B2)及混合型(B⋆)。该框架通过设计交互层面指标(interaction-level metrics)量化行为不对称性和系统动态特性,从而识别潜在的 emergent interaction patterns,并揭示不同策略对市场结果的影响机制。实验表明,混合型智能体(B⋆)因其适应性报价策略可在维持自身优势的同时降低对其他参与者的负面冲击,展现出更可持续的战略共存能力,为算法交易系统的行为设计提供了结构化评估工具。
链接: https://arxiv.org/abs/2510.25929
作者: Ziyi Wang,Carmine Ventre,Maria Polukarov
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Algorithmic collusion has emerged as a central question in AI: Will the interaction between different AI agents deployed in markets lead to collusion? More generally, understanding how emergent behavior, be it a cartel or market dominance from more advanced bots, affects the market overall is an important research question. We propose a hierarchical multi-agent reinforcement learning framework to study algorithmic collusion in market making. The framework includes a self-interested market maker (Agent~A), which is trained in an uncertain environment shaped by an adversary, and three bottom-layer competitors: the self-interested Agent~B1 (whose objective is to maximize its own PnL), the competitive Agent~B2 (whose objective is to minimize the PnL of its opponent), and the hybrid Agent~B ^\star , which can modulate between the behavior of the other two. To analyze how these agents shape the behavior of each other and affect market outcomes, we propose interaction-level metrics that quantify behavioral asymmetry and system-level dynamics, while providing signals potentially indicative of emergent interaction patterns. Experimental results show that Agent~B2 secures dominant performance in a zero-sum setting against B1, aggressively capturing order flow while tightening average spreads, thus improving market execution efficiency. In contrast, Agent~B ^\star  exhibits a self-interested inclination when co-existing with other profit-seeking agents, securing dominant market share through adaptive quoting, yet exerting a milder adverse impact on the rewards of Agents~A and B1 compared to B2. These findings suggest that adaptive incentive control supports more sustainable strategic co-existence in heterogeneous agent environments and offers a structured lens for evaluating behavioral design in algorithmic trading systems.         Subjects:  Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)  Cite as: arXiv:2510.25929 [cs.MA]    (or  arXiv:2510.25929v1 [cs.MA] for this version)                https://doi.org/10.48550/arXiv.2510.25929   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)
zh
[AI-74] ransferring Causal Effects using Proxies NEURIPS2025
【速读】:该论文旨在解决多领域(multi-domain)环境中因果效应估计的问题,其中感兴趣的因果效应受到未观测到的混杂因子(unobserved confounder)干扰,并且在不同领域间可能发生变化。解决方案的关键在于:假设可获得该隐藏混杂因子的一个代理变量(proxy variable),并基于此构建可识别性理论框架——即使处理变量和响应变量为连续型,仍能保证因果效应在目标域中的可识别性;进而提出两种估计方法,证明其一致性并推导置信区间,从而实现对目标域中因果效应的稳健估计。
链接: https://arxiv.org/abs/2510.25924
作者: Manuel Iglesias-Alonso,Felix Schur,Julius von Kügelgen,Jonas Peters
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:  Advances in Neural Information Processing Systems (NeurIPS 2025) camera-ready version
Abstract:We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.
zh
[AI-75] FinOps Agent – A Use-Case for IT Infrastructure and Cost Optimization
【速读】:该论文旨在解决FinOps(财务与运营融合)实践中面临的挑战:来自多个云服务商及内部系统的计费数据格式、分类体系和度量指标异构,导致难以快速整合并生成可操作的洞察以支持及时决策。解决方案的关键在于引入自主的、目标驱动的生成式AI代理(Generative AI Agent),通过模拟从多源数据获取、整合分析到生成优化建议的端到端工业流程,实现对IT基础设施成本优化任务的理解、规划与执行,实验证明该代理在多项指标上达到了与实际FinOps从业者相当的能力水平。
链接: https://arxiv.org/abs/2510.25914
作者: Ngoc Phuoc An Vo,Manish Kesarwani,Ruchi Mahindru,Chandrasekhar Narayanaswami
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:FinOps (Finance + Operations) represents an operational framework and cultural practice which maximizes cloud business value through collaborative financial accountability across engineering, finance, and business teams. FinOps practitioners face a fundamental challenge: billing data arrives in heterogeneous formats, taxonomies, and metrics from multiple cloud providers and internal systems which eventually lead to synthesizing actionable insights, and making time-sensitive decisions. To address this challenge, we propose leveraging autonomous, goal-driven AI agents for FinOps automation. In this paper, we built a FinOps agent for a typical use-case for IT infrastructure and cost optimization. We built a system simulating a realistic end-to-end industry process starting with retrieving data from various sources to consolidating and analyzing the data to generate recommendations for optimization. We defined a set of metrics to evaluate our agent using several open-source and close-source language models and it shows that the agent was able to understand, plan, and execute tasks as well as an actual FinOps practitioner.
zh
[AI-76] SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学应用中可信度不足的问题,尤其是在高风险场景下其真实性、鲁棒性、安全性与伦理合规性存在显著缺陷。解决方案的关键在于提出并实现了一个名为SciTrust 2.0的综合性评估框架,该框架从四个维度系统评估LLM的可信度:真实性(truthfulness)、对抗鲁棒性(adversarial robustness)、科学安全性(scientific safety)和科学伦理(scientific ethics)。该框架引入了通过验证反射调优流程和专家验证构建的开放式真实性基准,以及覆盖双用途研究、偏见等八个子类别的科学伦理基准,并对七种主流LLM进行了多维量化评估,揭示出通用行业模型在多数维度上优于专用科学模型,尤其在逻辑推理与伦理判断方面,科学专用模型存在明显短板,从而为提升科学领域AI系统的可信度提供了可复用、可扩展的评估工具与改进方向。
链接: https://arxiv.org/abs/2510.25908
作者: Emily Herron,Junqi Yin,Feiyi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:  Preprint Submitted to ACM Transactions on AI for Science (TAIS)
Abstract:Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.
zh
[AI-77] PRISM: Proof-Carrying Artifact Generation through LLM x MDE Synergy and Stratified Constraints
【速读】:该论文旨在解决安全与合规关键领域中,生成式 AI (Generative AI) 产出的文档难以保证结构正确性、语义一致性及可验证性的问题。传统方法依赖人工校验,效率低且易出错,无法满足自动化、可审计的工程需求。解决方案的关键在于 PRISM 框架的三重创新:一是统一元模型(Unified Meta-Model, UMM)将异构数据模式和法规文本映射到统一语义空间;二是集成约束模型(Integrated Constraint Model, ICM)将结构与语义要求转化为生成时自动机(GBNF、DFA)和生成后验证器(如 SHACL、SMT);三是约束引导的可验证生成(Constraint-Guided Verifiable Generation, CVG),通过两层约束执行——结构约束驱动前缀安全解码,语义逻辑验证生成机器可检查证书,并在违规时进行审计引导修复与生成轨迹记录,从而实现监管就绪(regulator-ready)且可追溯的自动化生成流程。
链接: https://arxiv.org/abs/2510.25890
作者: Tong Ma,Hui Lai,Hui Wang,Zhenhu Tian,Jizhou Wang,Haichao Wu,Yongfan Gao,Chaochao Li,Fengjie Xu,Ling Fang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:  45 pages, 9 figures
Abstract:PRISM unifies Large Language Models with Model-Driven Engineering to generate regulator-ready artifacts and machine-checkable evidence for safety- and compliance-critical domains. PRISM integrates three pillars: a Unified Meta-Model (UMM) reconciles heterogeneous schemas and regulatory text into a single semantic space; an Integrated Constraint Model (ICM) compiles structural and semantic requirements into enforcement artifacts including generation-time automata (GBNF, DFA) and post-generation validators (e.g., SHACL, SMT); and Constraint-Guided Verifiable Generation (CVG) applies these through two-layer enforcement - structural constraints drive prefix-safe decoding while semantic/logical validation produces machine-checkable certificates. When violations occur, PRISM performs audit-guided repair and records generation traces for compliance review. We evaluate PRISM in automotive software engineering (AUTOSAR) and cross-border legal jurisdiction (Brussels I bis). PRISM produces structurally valid, auditable artifacts that integrate with existing tooling and substantially reduce manual remediation effort, providing a practical path toward automated artifact generation with built-in assurance.
zh
[AI-78] he Information-Theoretic Imperative: Compression and the Epistemic Foundations of Intelligence
【速读】:该论文试图解决的问题是:现有智能框架虽普遍认为压缩(compression)在智能形成中具有核心作用,但未能明确解释为何这一过程会强制发现因果结构(causal structure),而非仅捕捉表层的统计相关性(superficial statistical patterns)。其解决方案的关键在于提出一个两层理论框架——信息论必要性(Information-Theoretic Imperative, ITI)与压缩效率原理(Compression Efficiency Principle, CEP)。ITI从演化角度阐明,在不确定环境中持续存在的系统必须通过预测压缩来最小化认知熵(epistemic entropy),从而将生存压力与信息处理需求联系起来;CEP则进一步指出,高效的压缩机制通过异常积累(exception-accumulation)动态自然选择生成式、因果模型,使现实对齐(reality alignment)成为必然结果而非偶然达成。二者共同构建了一条因果链条:从生存压力到预测必要性、压缩要求、效率优化、生成结构发现,最终导向现实对齐,揭示了智能是在结构化环境中持久存在时的机械必然产物。
链接: https://arxiv.org/abs/2510.25883
作者: Christian Dittrich,Jennifer Flygare Kinne
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:  41 pages, 2 tables, 3 appendices. Submitted to arXiv for open access
Abstract:Existing frameworks converge on the centrality of compression to intelligence but leave underspecified why this process enforces the discovery of causal structure rather than superficial statistical patterns. We introduce a two-level framework to address this gap. The Information-Theoretic Imperative (ITI) establishes that any system persisting in uncertain environments must minimize epistemic entropy through predictive compression: this is the evolutionary “why” linking survival pressure to information-processing demands. The Compression Efficiency Principle (CEP) specifies how efficient compression mechanically selects for generative, causal models through exception-accumulation dynamics, making reality alignment a consequence rather than a contingent achievement. Together, ITI and CEP define a causal chain: from survival pressure to prediction necessity, compression requirement, efficiency optimization, generative structure discovery, and ultimately reality alignment. Each link follows from physical, information-theoretic, or evolutionary constraints, implying that intelligence is the mechanically necessary outcome of persistence in structured environments. This framework yields empirically testable predictions: compression efficiency, measured as approach to the rate-distortion frontier, correlates with out-of-distribution generalization; exception-accumulation rates differentiate causal from correlational models; hierarchical systems exhibit increasing efficiency across abstraction layers; and biological systems demonstrate metabolic costs that track representational complexity. ITI and CEP thereby provide a unified account of convergence across biological, artificial, and multi-scale systems, addressing the epistemic and functional dimensions of intelligence without invoking assumptions about consciousness or subjective experience.
zh
[AI-79] AAGATE: A NIST AI RMF-Aligned Governance Platform for Agent ic AI
【速读】:该论文旨在解决自主语言模型驱动的智能体(Agentic AI)在生产环境中所面临的独特安全与治理挑战,尤其针对传统应用安全(AppSec)工具难以应对快速决策、自主执行的机器级系统的问题。其核心解决方案是提出一个原生支持 Kubernetes 的控制平面——AAGATE,该框架基于 NIST AI 风险管理框架(AI RMF)进行落地实施,并为每个功能模块集成专用安全框架:使用 MAESTRO 框架进行威胁建模(Map),结合 OWASP AIVSS 与 SEI SSVC 构建度量体系(Measure),并采用 Cloud Security Alliance 的红队指南实现风险管控(Manage)。关键创新在于引入零信任服务网格、可解释策略引擎、行为分析机制及去中心化问责钩子,形成持续可验证的治理能力,同时通过 DIRF(数字身份权利)、LPCI(逻辑层注入防御)和 QSAF(认知退化监控)扩展覆盖系统性、对抗性和伦理风险,从而实现安全、可问责且可扩展的 agentic AI 部署。
链接: https://arxiv.org/abs/2510.25863
作者: Ken Huang,Jerry Huang,Yasir Mehmood,Hammad Atta,Muhammad Zeeshan Baig,Muhammad Aziz Ul Haq
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:This paper introduces the Agentic AI Governance Assurance  Trust Engine (AAGATE), a Kubernetes-native control plane designed to address the unique security and governance challenges posed by autonomous, language-model-driven agents in production. Recognizing the limitations of traditional Application Security (AppSec) tooling for improvisational, machine-speed systems, AAGATE operationalizes the NIST AI Risk Management Framework (AI RMF). It integrates specialized security frameworks for each RMF function: the Agentic AI Threat Modeling MAESTRO framework for Map, a hybrid of OWASP’s AIVSS and SEI’s SSVC for Measure, and the Cloud Security Alliance’s Agentic AI Red Teaming Guide for Manage. By incorporating a zero-trust service mesh, an explainable policy engine, behavioral analytics, and decentralized accountability hooks, AAGATE provides a continuous, verifiable governance solution for agentic AI, enabling safe, accountable, and scalable deployment. The framework is further extended with DIRF for digital identity rights, LPCI defenses for logic-layer injection, and QSAF monitors for cognitive degradation, ensuring governance spans systemic, adversarial, and ethical risks.
zh
[AI-80] Symbolically Scaffolded Play: Designing Role-Sensitive Prompts for Generative NPC Dialogue
【速读】:该论文旨在解决生成式 AI(Generative AI)在互动游戏中的应用问题,即通过大语言模型(Large Language Models, LLMs)驱动非玩家角色(NPC)实现自然对话时,约束性提示(prompt constraints)是否真正提升玩家体验。现有研究未明确验证高约束提示(High-Constraint Prompt, HCP)与低约束提示(Low-Constraint Prompt, LCP)对用户体验的差异。研究通过一个基于 GPT-4o 的语音侦探游戏《The Interview》进行对照实验,发现两种提示方式在体验层面无显著差异,仅在技术故障敏感度上存在差别。为优化这一问题,作者提出一种混合式结构——JSON+检索增强生成(RAG)的引导框架,并进一步引入“符号化引导游戏”(Symbolically Scaffolded Play)新范式:该范式以模糊符号边界(fuzzy-symbolic scaffolding)替代硬性约束,在关键角色(如任务发布者NPC)处强化稳定性,同时保留其他角色(如嫌疑人NPC)的即兴表现力,从而实现“稳定中保持惊喜”的动态平衡。其核心创新在于将约束从统一刚性控制转化为角色感知的差异化数值边界,使生成内容既保真又具沉浸感。
链接: https://arxiv.org/abs/2510.25820
作者: Vanessa Figueiredo,David Elumeze
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large Language Models (LLMs) promise to transform interactive games by enabling non-player characters (NPCs) to sustain unscripted dialogue. Yet it remains unclear whether constrained prompts actually improve player experience. We investigate this question through The Interview, a voice-based detective game powered by GPT-4o. A within-subjects usability study ( N=10 ) compared high-constraint (HCP) and low-constraint (LCP) prompts, revealing no reliable experiential differences beyond sensitivity to technical breakdowns. Guided by these findings, we redesigned the HCP into a hybrid JSON+RAG scaffold and conducted a synthetic evaluation with an LLM judge, positioned as an early-stage complement to usability testing. Results uncovered a novel pattern: scaffolding effects were role-dependent: the Interviewer (quest-giver NPC) gained stability, while suspect NPCs lost improvisational believability. These findings overturn the assumption that tighter constraints inherently enhance play. Extending fuzzy-symbolic scaffolding, we introduce \textitSymbolically Scaffolded Play, a framework in which symbolic structures are expressed as fuzzy, numerical boundaries that stabilize coherence where needed while preserving improvisation where surprise sustains engagement.
zh
[AI-81] Identity Management for Agent ic AI: The new frontier of authorization authentication and security for an AI agent world
【速读】:该论文旨在解决人工智能代理(AI agents)在认证(authentication)、授权(authorization)和身份管理(identity management)方面面临的紧迫挑战,尤其是在当前代理中心化协议(如MCP)日益普及的背景下,亟需明确最佳实践。其解决方案的关键在于:一方面梳理现有可用于保护当前AI代理的安全资源,另一方面提出一项战略议程,聚焦于未来广泛自主系统所依赖的基础性认证、授权与身份问题,涵盖可扩展的访问控制、代理中心化身份、AI工作负载差异化及委托权限等长期议题。
链接: https://arxiv.org/abs/2510.25819
作者: Tobin South,Subramanya Nagabhushanaradhya,Ayesha Dissanayaka,Sarah Cecchetti,George Fletcher,Victor Lu,Aldo Pietropaolo,Dean H. Saxe,Jeff Lombardo,Abhishek Maligehalli Shivalingaiah,Stan Bounev,Alex Keisner,Andor Kesselman,Zack Proser,Ginny Fahs,Andrew Bunyea,Ben Moskowitz,Atul Tulshibagwale,Dazza Greenwood,Jiaxin Pei,Alex Pentland
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:The rapid rise of AI agents presents urgent challenges in authentication, authorization, and identity management. Current agent-centric protocols (like MCP) highlight the demand for clarified best practices in authentication and authorization. Looking ahead, ambitions for highly autonomous agents raise complex long-term questions regarding scalable access control, agent-centric identities, AI workload differentiation, and delegated authority. This OpenID Foundation whitepaper is for stakeholders at the intersection of AI agents and access management. It outlines the resources already available for securing today’s agents and presents a strategic agenda to address the foundational authentication, authorization, and identity problems pivotal for tomorrow’s widespread autonomous systems.
zh
[AI-82] ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion NEURIPS2025
【速读】:该论文旨在解决文本到图像扩散模型在生成超出其训练分辨率图像时性能下降的问题。现有无训练方法虽能缓解此问题,但常面临计算开销大或与Diffusion Transformer架构不兼容的挑战。解决方案的关键在于提出一种模型无关且高效的框架ScaleDiff,其核心创新包括:1)邻域块注意力(Neighborhood Patch Attention, NPA),通过非重叠块机制减少自注意力层中的计算冗余;2)潜空间频率混合(Latent Frequency Mixing, LFM),提升细节生成质量;3)结构引导机制,在去噪过程中增强全局结构一致性。该框架无需额外训练即可显著提升图像质量和推理效率,在U-Net与Diffusion Transformer架构上均达到当前最优训练-free效果。
链接: https://arxiv.org/abs/2510.25818
作者: Sungho Koh,SeungJu Cha,Hyunwoo Oh,Kwanyoung Lee,Dong-Jin Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  NeurIPS 2025. Code: this https URL
Abstract:Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
zh
[AI-83] An Agent ic Framework for Rapid Deployment of Edge AI Solutions in Industry 5.0
【速读】:该论文旨在解决工业5.0场景下AI模型在边缘设备部署时面临的延迟高和外部数据传输依赖的问题,其解决方案的关键在于提出一种基于代理(agent-based)的框架,通过本地推理与实时处理降低延迟并避免数据外传;同时,该框架支持模块化集成且资源消耗低,从而提升系统在真实工业环境(如食品行业)中的部署效率与适应性。
链接: https://arxiv.org/abs/2510.25813
作者: Jorge Martinez-Gil,Mario Pichler,Nefeli Bountouni,Sotiris Koussouris,Marielena Márquez Barreiro,Sergio Gusmeroli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a novel framework for Industry 5.0 that simplifies the deployment of AI models on edge devices in various industrial settings. The design reduces latency and avoids external data transfer by enabling local inference and real-time processing. Our implementation is agent-based, which means that individual agents, whether human, algorithmic, or collaborative, are responsible for well-defined tasks, enabling flexibility and simplifying integration. Moreover, our framework supports modular integration and maintains low resource requirements. Preliminary evaluations concerning the food industry in real scenarios indicate improved deployment time and system adaptability performance. The source code is publicly available at this https URL.
zh
[AI-84] Non-myopic Matching and Rebalancing in Large-Scale On-Demand Ride-Pooling Systems Using Simulation-Informed Reinforcement Learning
【速读】:该论文旨在解决共享出行(ride-pooling)系统中因调度决策具有短视性(myopic decision-making)而导致的长期效率低下问题,即当前决策未考虑对未来交通状态、乘客等待时间及车辆利用率等指标的影响。其解决方案的关键在于提出一种基于仿真驱动的强化学习(reinforcement learning, RL)方法,通过在学习机制中嵌入共享出行仿真模块,使决策过程具备非短视(non-myopic)特性;同时引入互补的空驶车辆再平衡策略,从而实现对匹配和调度两个环节的协同优化。实验结果表明,该方案可显著提升服务率、降低乘客等待与乘车时间,并减少车队规模达25%以上,同时通过整合再平衡操作进一步改善运营指标。
链接: https://arxiv.org/abs/2510.25796
作者: Farnoosh Namdarpour,Joseph Y. J. Chow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Ride-pooling, also known as ride-sharing, shared ride-hailing, or microtransit, is a service wherein passengers share rides. This service can reduce costs for both passengers and operators and reduce congestion and environmental impacts. A key limitation, however, is its myopic decision-making, which overlooks long-term effects of dispatch decisions. To address this, we propose a simulation-informed reinforcement learning (RL) approach. While RL has been widely studied in the context of ride-hailing systems, its application in ride-pooling systems has been less explored. In this study, we extend the learning and planning framework of Xu et al. (2018) from ride-hailing to ride-pooling by embedding a ride-pooling simulation within the learning mechanism to enable non-myopic decision-making. In addition, we propose a complementary policy for rebalancing idle vehicles. By employing n-step temporal difference learning on simulated experiences, we derive spatiotemporal state values and subsequently evaluate the effectiveness of the non-myopic policy using NYC taxi request data. Results demonstrate that the non-myopic policy for matching can increase the service rate by up to 8.4% versus a myopic policy while reducing both in-vehicle and wait times for passengers. Furthermore, the proposed non-myopic policy can decrease fleet size by over 25% compared to a myopic policy, while maintaining the same level of performance, thereby offering significant cost savings for operators. Incorporating rebalancing operations into the proposed framework cuts wait time by up to 27.3%, in-vehicle time by 12.5%, and raises service rate by 15.1% compared to using the framework for matching decisions alone at the cost of increased vehicle minutes traveled per passenger.
zh
[AI-85] he Kinetics of Reasoning : How Chain-of-Thought Shapes Learning in Transformers?
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在链式思维(Chain-of-thought, CoT)监督下学习机制不明确的问题,特别是其如何通过CoT提升性能以及这种提升与任务复杂度、数据分布和训练动态之间的关系。解决方案的关键在于:首先,通过预训练Transformer模型于具有可调算法复杂度的符号推理任务,并控制数据组成,系统性地比较仅输出最终答案与同时生成显式CoT推理轨迹两种训练设置;其次,引入基于三参数逻辑曲线的精度建模方法量化训练步数对准确率的影响,揭示CoT如何改变学习速度与曲线形态;最后,发现并表征了一个“推理轨迹不一致”(trace unfaithfulness)的瞬态阶段——即早期训练中模型能正确作答但推理步骤错误或缺失,随后逐步使推理轨迹与答案对齐,从而证明CoT不仅加速泛化,还实质性地改变了Transformer内部计算机制。
链接: https://arxiv.org/abs/2510.25791
作者: Zihan Pengmei,Costas Mavromatis,Zhengyuan Shen,Yunyi Zhang,Vassilis N. Ioannidis,Huzefa Rangwala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:  10 pages, 7 figures, with appendix
Abstract:Chain-of-thought (CoT) supervision can substantially improve transformer performance, yet the mechanisms by which models learn to follow and benefit from CoT remain poorly understood. We investigate these learning dynamics through the lens of grokking by pretraining transformers on symbolic reasoning tasks with tunable algorithmic complexity and controllable data composition to study their generalization. Models were trained under two settings: (i) producing only final answers, and (ii) emitting explicit CoT traces before answering. Our results show that while CoT generally improves task performance, its benefits depend on task complexity. To quantify these effects, we model the accuracy of the logarithmic training steps with a three-parameter logistic curve, revealing how the learning speed and shape vary with task complexity, data distribution, and the presence of CoT supervision. We also uncover a transient trace unfaithfulness phase: early in training, models often produce correct answers while skipping or contradicting CoT steps, before later aligning their reasoning traces with answers. Empirically, we (1) demonstrate that CoT accelerates generalization but does not overcome tasks with higher algorithmic complexity, such as finding list intersections; (2) introduce a kinetic modeling framework for understanding transformer learning; (3) characterize trace faithfulness as a dynamic property that emerges over training; and (4) show CoT alters internal transformer computation mechanistically.
zh
[AI-86] Unsupervised local learning based on voltage-dependent synaptic plasticity for resistive and ferroelectric synapses
【速读】:该论文旨在解决在边缘计算设备上部署人工智能(AI)时面临的能耗高与功能受限问题,提出通过类脑学习机制实现低功耗、实时自适应的AI运算。其解决方案的关键在于引入电压依赖型突触可塑性(voltage-dependent synaptic plasticity, VDSP),这是一种基于赫布原理(Hebbian principles)的高效无监督局部学习方法,无需传统脉冲时间依赖可塑性(spike-timing-dependent plasticity, STDP)所需的复杂脉冲整形电路即可实现在线学习。VDSP被成功适配于三种不同特性的忆阻器件(TiO₂、HfO₂基金属氧化物丝状突触和HfZrO₄基铁电隧道结),并通过含这些器件的脉冲神经网络系统级仿真,在MNIST模式识别任务中实现了超过83%的准确率,同时验证了对器件变异性(如开关阈值和高低阻态比)的鲁棒性提升策略的有效性。
链接: https://arxiv.org/abs/2510.25787
作者: Nikhil Garg,Ismael Balafrej,Joao Henrique Quintino Palhares,Laura Bégon-Lours,Davide Florini,Donato Francesco Falcone,Tommaso Stecconi,Valeria Bragaglia,Bert Jan Offrein,Jean-Michel Portal,Damien Querlioz,Yann Beilliard,Dominique Drouin,Fabien Alibart
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:
Abstract:The deployment of AI on edge computing devices faces significant challenges related to energy consumption and functionality. These devices could greatly benefit from brain-inspired learning mechanisms, allowing for real-time adaptation while using low-power. In-memory computing with nanoscale resistive memories may play a crucial role in enabling the execution of AI workloads on these edge devices. In this study, we introduce voltage-dependent synaptic plasticity (VDSP) as an efficient approach for unsupervised and local learning in memristive synapses based on Hebbian principles. This method enables online learning without requiring complex pulse-shaping circuits typically necessary for spike-timing-dependent plasticity (STDP). We show how VDSP can be advantageously adapted to three types of memristive devices (TiO _2 , HfO _2 -based metal-oxide filamentary synapses, and HfZrO _4 -based ferroelectric tunnel junctions (FTJ)) with disctinctive switching characteristics. System-level simulations of spiking neural networks incorporating these devices were conducted to validate unsupervised learning on MNIST-based pattern recognition tasks, achieving state-of-the-art performance. The results demonstrated over 83% accuracy across all devices using 200 neurons. Additionally, we assessed the impact of device variability, such as switching thresholds and ratios between high and low resistance state levels, and proposed mitigation strategies to enhance robustness.
zh
[AI-87] HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series
【速读】:该论文旨在解决可穿戴传感器中生理时间序列数据的预测能力受限于时间分辨率建模不清晰的问题,即不同临床和行为结果可能依赖于不同时间尺度上的结构特征,而现有方法往往忽略多尺度信息。其解决方案的关键在于提出HiMAE(Hierarchical Masked Autoencoder),一种结合掩码自编码与分层卷积编码器-解码器结构的自监督学习框架,能够生成多分辨率嵌入表示,从而系统性地评估哪些时间尺度包含预测信号,并将时间分辨率从超参数转化为可解释性的探针。该方法在分类、回归和生成基准任务中均优于当前主流基础模型,且模型体积小至可在智能手表端实现毫秒级推理,具备边缘计算部署能力。
链接: https://arxiv.org/abs/2510.25785
作者: Simon A. Lee,Cyrus Tanade,Hao Zhou,Juhyeon Lee,Megha Thukral,Minji Han,Rachel Choi,Md Sazzad Hissain Khan,Baiying Lu,Migyeong Gwak,Mehrab Bin Morshed,Viswam Nathan,Md Mahbubur Rahman,Li Zhu,Subramaniam Venkatraman,Sharanya Arcot Desai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this resolution hypothesis, we introduce HiMAE (Hierarchical Masked Autoencoder), a self supervised framework that combines masked autoencoding with a hierarchical convolutional encoder decoder. HiMAE produces multi resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state of the art foundation models that collapse scale, while being orders of magnitude smaller. HiMAE is an efficient representation learner compact enough to run entirely on watch, achieving sub millisecond inference on smartwatch class CPUs for true edge inference. Together, these contributions position HiMAE as both an efficient self supervised learning method and a discovery tool for scale sensitive structure in wearable health.
zh
[AI-88] A Practitioners Guide to Kolmogorov-Arnold Networks
【速读】:该论文旨在解决传统多层感知机(Multilayer Perceptrons, MLPs)在表达能力和可解释性方面的局限性,提出以Kolmogorov-Arnold Networks (KANs) 作为更具参数效率和灵活性的替代方案。其核心解决方案在于将传统MLPs中固定激活函数作用于节点的机制,转变为在边(edges)上使用可学习的一元基函数(univariate basis functions),从而增强模型的表达能力与可解释性。论文指出,这种结构上的革新不仅在理论上与MLP等价,还通过优化基函数的选择(如B-splines、Chebyshev多项式、ReLU组合、高斯径向基函数等)实现了更高的参数效率,并系统梳理了提升精度、效率与正则化的关键技术路径,包括物理信息损失设计、自适应采样、域分解及混合架构等策略,为实际应用提供了清晰的实践指南。
链接: https://arxiv.org/abs/2510.25781
作者: Amir Noorizadegan,Sifan Wang,Leevan Ling
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
备注:
Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional Multilayer Perceptrons (MLPs), inspired by the Kolmogorov-Arnold representation theorem. Unlike MLPs, which use fixed activation functions on nodes, KANs employ learnable univariate basis functions on edges, offering enhanced expressivity and interpretability. This review provides a systematic and comprehensive overview of the rapidly expanding KAN landscape, moving beyond simple performance comparisons to offer a structured synthesis of theoretical foundations, architectural variants, and practical implementation strategies. By collecting and categorizing a vast array of open-source implementations, we map the vibrant ecosystem supporting KAN development. We begin by bridging the conceptual gap between KANs and MLPs, establishing their formal equivalence and highlighting the superior parameter efficiency of the KAN formulation. A central theme of our review is the critical role of the basis function; we survey a wide array of choices, including B-splines, Chebyshev and Jacobi polynomials, ReLU compositions, Gaussian RBFs, and Fourier series, and analyze their respective trade-offs in terms of smoothness, locality, and computational cost. We then categorize recent advancements into a clear roadmap, covering techniques for improving accuracy, efficiency, and regularization. Key topics include physics-informed loss design, adaptive sampling, domain decomposition, hybrid architectures, and specialized methods for handling discontinuities. Finally, we provide a practical “Choose-Your-KAN” guide to help practitioners select appropriate architectures, and we conclude by identifying current research gaps. The associated GitHub repository this https URL complements this paper and serves as a structured reference for ongoing KAN research.
zh
[AI-89] Magent ic Marketplace: An Open-Source Environment for Studying Agent ic Markets
【速读】:该论文试图解决的问题是:随着大语言模型(Large Language Models, LLMs)代理在真实经济场景中日益承担用户决策中介角色(如产品发现与交易),现有研究多局限于受限环境(如单任务市场或结构化两代理交互),难以揭示代理在复杂、动态且多主体参与的真实市场环境中如何表现及其对用户价值和责任分配的影响。为填补这一空白,论文提出的关键解决方案是构建一个名为Magentic-Marketplace的模拟环境,用于安全地研究“Assistant代理”(代表消费者)与“Service代理”(代表竞争性企业)之间的双向代理市场互动。该环境支持探索关键市场动态,包括代理效用、行为偏差、易受操纵性及搜索机制对市场结果的影响,从而为设计公平高效、具备鲁棒性的下一代代理市场提供实证依据。
链接: https://arxiv.org/abs/2510.25779
作者: Gagan Bansal,Wenyue Hua,Zezhou Huang,Adam Fourney,Amanda Swearngin,Will Epperson,Tyler Payne,Jake M. Hofman,Brendan Lucier,Chinmay Singh,Markus Mobius,Akshay Nambi,Archana Yadav,Kevin Gao,David M. Rothschild,Aleksandrs Slivkins,Daniel G. Goldstein,Hussein Mozannar,Nicole Immorlica,Maya Murad,Matthew Vogel,Subbarao Kambhampati,Eric Horvitz,Saleema Amershi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLM agents advance, they are increasingly mediating economic decisions, ranging from product discovery to transactions, on behalf of users. Such applications promise benefits but also raise many questions about agent accountability and value for users. Addressing these questions requires understanding how agents behave in realistic market conditions. However, previous research has largely evaluated agents in constrained settings, such as single-task marketplaces (e.g., negotiation) or structured two-agent interactions. Real-world markets are fundamentally different: they require agents to handle diverse economic activities and coordinate within large, dynamic ecosystems where multiple agents with opaque behaviors may engage in open-ended dialogues. To bridge this gap, we investigate two-sided agentic marketplaces where Assistant agents represent consumers and Service agents represent competing businesses. To study these interactions safely, we develop Magentic-Marketplace-- a simulated environment where Assistants and Services can operate. This environment enables us to study key market dynamics: the utility agents achieve, behavioral biases, vulnerability to manipulation, and how search mechanisms shape market outcomes. Our experiments show that frontier models can approach optimal welfare-- but only under ideal search conditions. Performance degrades sharply with scale, and all models exhibit severe first-proposal bias, creating 10-30x advantages for response speed over quality. These findings reveal how behaviors emerge across market conditions, informing the design of fair and efficient agentic marketplaces.
zh
[AI-90] owards Piece-by-Piece Explanations for Chess Positions with SHAP
【速读】:该论文旨在解决当前国际象棋引擎评估结果缺乏可解释性的问题,即虽然引擎能提供精确的 centipawn(百分之一兵卒)分数用于决策,但无法清晰揭示各棋子或特定棋型对整体评估的具体贡献。解决方案的关键在于引入 SHAP(SHapley Additive exPlanations)方法,将棋盘上的每个棋子视为一个特征,并通过系统性地移除(ablation)各个棋子来计算其对引擎输出的加性贡献,从而以局部忠实且人类可理解的方式解释引擎的评估结果。这一方法结合了传统国际象棋教学中“移除棋子”进行局面分析的思维模式与现代可解释人工智能技术,为可视化、人机训练和引擎对比提供了新途径。
链接: https://arxiv.org/abs/2510.25775
作者: Francesco Spinnato
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Contemporary chess engines offer precise yet opaque evaluations, typically expressed as centipawn scores. While effective for decision-making, these outputs obscure the underlying contributions of individual pieces or patterns. In this paper, we explore adapting SHAP (SHapley Additive exPlanations) to the domain of chess analysis, aiming to attribute a chess engines evaluation to specific pieces on the board. By treating pieces as features and systematically ablating them, we compute additive, per-piece contributions that explain the engines output in a locally faithful and human-interpretable manner. This method draws inspiration from classical chess pedagogy, where players assess positions by mentally removing pieces, and grounds it in modern explainable AI techniques. Our approach opens new possibilities for visualization, human training, and engine comparison. We release accompanying code and data to foster future research in interpretable chess AI.
zh
[AI-91] Hybrid LLM and Higher-Order Quantum Approximate Optimization for CSA Collateral Management
【速读】:该论文旨在解决在ISDA信用支持附件(CSA)框架下,金融原生抵押品优化问题,其挑战源于整数单位(integer lots)、Schedule A折扣率(haircuts)、RA/MTA准入限制(gating)、发行人/货币/类别限额(caps)等因素导致的复杂且法律约束严格的搜索空间。解决方案的关键在于提出一种可验证的混合流水线:首先通过证据门控大语言模型(evidence-gated LLM)从CSA文本中提取条款并结构化为带引用的标准化JSON;其次引入受量子启发的探索器,将模拟退火与高阶量子近似优化算法(HO-QAOA,子QUBO规模n=16、阶数k=4)结合,在绑定子问题上协调多资产跨限额移动;再者设计加权风险感知目标函数(包含移动成本、CVaR和资金定价超额),明确覆盖窗口U = Reff + B;最后以CP-SAT作为单一仲裁器验证可行性及间隙,其中预检U-限额可报告最小可行缓冲B*。该方法通过将限额和舍入编码为高阶项,使HO-QAOA聚焦于破坏局部交换的域耦合,从而显著提升优化性能,在政府债券数据集和多CSA输入下相较经典基线(BL-3)提升9.1%至10.7%,实现更优的成本-移动-尾部前沿。
链接: https://arxiv.org/abs/2510.26217
作者: Tao Jin,Stuart Florescu,Heyu(Andrew)Jin
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:  6 pages
Abstract:We address finance-native collateral optimization under ISDA Credit Support Annexes (CSAs), where integer lots, Schedule A haircuts, RA/MTA gating, and issuer/currency/class caps create rugged, legally bounded search spaces. We introduce a certifiable hybrid pipeline purpose-built for this domain: (i) an evidence-gated LLM that extracts CSA terms to a normalized JSON (abstain-by-default, span-cited); (ii) a quantum-inspired explorer that interleaves simulated annealing with micro higher order QAOA (HO-QAOA) on binding sub-QUBOs (subset size n = 16, order k = 4) to coordinate multi-asset moves across caps and RA-induced discreteness; (iii) a weighted risk-aware objective (Movement, CVaR, funding-priced overshoot) with an explicit coverage window U = Reff+B; and (iv) CP-SAT as single arbiter to certify feasibility and gaps, including a U-cap pre-check that reports the minimal feasible buffer B*. Encoding caps/rounding as higher-order terms lets HO-QAOA target the domain couplings that defeat local swaps. On government bond datasets and multi-CSA inputs, the hybrid improves a strong classical baseline (BL-3) by 9.1%, 9.6%, and 10.7% across representative harnesses, delivering better cost-movement-tail frontiers under governance settings. We release governance grade artifacts-span citations, valuation matrix audit, weight provenance, QUBO manifests, and CP-SAT traces-to make results auditable and reproducible.
zh
[AI-92] Learning to Manage Investment Portfolios beyond Simple Utility Functions
【速读】:该论文旨在解决传统投资组合管理建模中难以准确刻画基金经理复杂、多目标优化行为的问题,尤其是现有基于多目标效用函数的方法在设定和参数化方面面临根本性挑战。其解决方案的关键在于提出一种生成式框架,通过无监督学习方式从观测到的持仓数据与市场数据的联合分布中直接学习基金经理策略的潜在表示(latent representations),而无需显式指定效用函数或奖励机制。该框架采用基于生成对抗网络(GAN)的架构,建模给定股票特征、历史收益、前序权重及潜变量条件下基金组合权重的条件概率分布,从而捕捉如“成长型”和“价值型”等已知投资风格,并揭示隐含的管理目标,例如不同基金在交易成本、集中度和因子暴露上的异质性实现。
链接: https://arxiv.org/abs/2510.26165
作者: Maarten P. Scholl,Mahmoud Mahfouz,Anisoara Calinescu,J. Doyne Farmer
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:  6th ACM International Conference on AI in Finance, November 15-18, 2025, Singapore
Abstract:While investment funds publicly disclose their objectives in broad terms, their managers optimize for complex combinations of competing goals that go beyond simple risk-return trade-offs. Traditional approaches attempt to model this through multi-objective utility functions, but face fundamental challenges in specification and parameterization. We propose a generative framework that learns latent representations of fund manager strategies without requiring explicit utility specification. Our approach directly models the conditional probability of a fund’s portfolio weights, given stock characteristics, historical returns, previous weights, and a latent variable representing the fund’s strategy. Unlike methods based on reinforcement learning or imitation learning, which require specified rewards or labeled expert objectives, our GAN-based architecture learns directly from the joint distribution of observed holdings and market data. We validate our framework on a dataset of 1436 U.S. equity mutual funds. The learned representations successfully capture known investment styles, such as “growth” and “value,” while also revealing implicit manager objectives. For instance, we find that while many funds exhibit characteristics of Markowitz-like optimization, they do so with heterogeneous realizations for turnover, concentration, and latent factors. To analyze and interpret the end-to-end model, we develop a series of tests that explain the model, and we show that the benchmark’s expert labeling are contained in our model’s encoding in a linear interpretable way. Our framework provides a data-driven approach for characterizing investment strategies for applications in market simulation, strategy attribution, and regulatory oversight.          Comments: 6th ACM International Conference on AI in Finance, November 15-18, 2025, Singapore   Subjects:  Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)  Cite as: arXiv:2510.26165 [q-fin.PM]    (or  arXiv:2510.26165v1 [q-fin.PM] for this version)                https://doi.org/10.48550/arXiv.2510.26165   Focus to learn more                      arXiv-issued DOI via DataCite (pending registration)      Related DOI:            https://doi.org/10.1145/3768292.3770426     Focus to learn more                     DOI(s) linking to related resources                        Submission history From: Maarten Peter Scholl [view email]       [v1]         Thu, 30 Oct 2025 06:01:20 UTC (551 KB)       Full-text links: Access Paper:   View a PDF of the paper titled Learning to Manage Investment Portfolios beyond Simple Utility Functions, by Maarten P. Scholl and 3 other authorsView PDFHTML (experimental)TeX Source     view license         Current browse context: q-fin.PM   prev   |   next   new  |  recent  | 2025-10      Change to browse by:      cs cs.AI cs.CE q-fin     References  Citations  NASA ADSGoogle Scholar Semantic Scholar     export BibTeX citation Loading…     BibTeX formatted citation    loading…   Data provided by:      Bookmark           checked=“checked”>     Bibliographic Tools  Bibliographic and Citation Tools       Bibliographic Explorer Toggle    Bibliographic Explorer (What is the Explorer?)        Connected Papers Toggle    Connected Papers (What is Connected Papers?)       Litmaps Toggle    Litmaps (What is Litmaps?)        scite.ai Toggle    scite Smart Citations (What are Smart Citations?)          Code, Data, Media  Code, Data and Media Associated with this Article       alphaXiv Toggle    alphaXiv (What is alphaXiv?)        Links to Code Toggle    CatalyzeX Code Finder for Papers (What is CatalyzeX?)        DagsHub Toggle    DagsHub (What is DagsHub?)        GotitPub Toggle    Gotit.pub (What is GotitPub?)        Huggingface Toggle    Hugging Face (What is Huggingface?)        Links to Code Toggle    Papers with Code (What is Papers with Code?)        ScienceCast Toggle    ScienceCast (What is ScienceCast?)              Demos  Demos       Replicate Toggle    Replicate (What is Replicate?)        Spaces Toggle    Hugging Face Spaces (What is Spaces?)        Spaces Toggle    TXYZ.AI (What is TXYZ.AI?)         Related Papers  Recommenders and Search Tools       Link to Influence Flower    Influence Flower (What are Influence Flowers?)        Core recommender toggle    CORE Recommender (What is CORE?)      Author Venue Institution Topic                      About arXivLabs           arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.           Which authors of this paper are endorsers? |     Disable MathJax (What is MathJax?)       mathjaxToggle();           About Help      contact arXivClick here to contact arXiv  Contact   subscribe to arXiv mailingsClick here to subscribe  Subscribe            Copyright Privacy Policy     Web Accessibility Assistance   arXiv Operational Status
zh
[AI-93] Data-driven Projection Generation for Efficiently Solving Heterogeneous Quadratic Programming Problems
【速读】:该论文旨在解决高维二次规划(Quadratic Programming, QP)问题的计算效率瓶颈问题,即在保证解质量的前提下显著降低求解时间。其解决方案的关键在于提出一种数据驱动的框架,通过针对每个QP实例生成特定的低维投影矩阵(instance-specific projection),将原始高维问题映射到低维空间进行求解。该框架利用图神经网络(Graph Neural Network, GNN)模型学习生成高质量的投影矩阵,从而在未见过的QP实例上仍能获得可行且接近最优的解。训练过程被建模为双层优化问题:内层使用QP求解器在给定投影下优化目标函数,外层更新GNN参数以最小化投影解的期望目标值;作者进一步设计了一种无需反向传播穿越求解器的高效算法来实现参数梯度计算,并提供了理论分析以证明该方法在泛化能力上的优势。
链接: https://arxiv.org/abs/2510.26061
作者: Tomoharu Iwata,Futoshi Futami
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:
Abstract:We propose a data-driven framework for efficiently solving quadratic programming (QP) problems by reducing the number of variables in high-dimensional QPs using instance-specific projection. A graph neural network-based model is designed to generate projections tailored to each QP instance, enabling us to produce high-quality solutions even for previously unseen problems. The model is trained on heterogeneous QPs to minimize the expected objective value evaluated on the projected solutions. This is formulated as a bilevel optimization problem; the inner optimization solves the QP under a given projection using a QP solver, while the outer optimization updates the model parameters. We develop an efficient algorithm to solve this bilevel optimization problem, which computes parameter gradients without backpropagating through the solver. We provide a theoretical analysis of the generalization ability of solving QPs with projection matrices generated by neural networks. Experimental results demonstrate that our method produces high-quality feasible solutions with reduced computation time, outperforming existing methods.
zh
机器学习
[LG-0] Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators Curricula and Interpretability
链接: https://arxiv.org/abs/2510.26792
作者: Tao Tao,Maissam Barkeshli
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Cryptography and Security (cs.CR)
*备注:  10+13 pages, 8+19 figures
Abstract:We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to 2^22 using up to 50 million model parameters and datasets with up to 5 billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus m : the number of in-context sequence elements required for near-perfect prediction grows as \sqrtm . For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli m \geq 2^20 requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.
[LG-1] Pre-trained Forecasting Models: Strong Zero-Shot Feature Extractors for Time Series Classification NEURIPS2025
链接: https://arxiv.org/abs/2510.26777
作者: Andreas Auer,Daniel Klotz,Sebastinan Böck,Sepp Hochreiter
类目: Machine Learning (cs.LG)
*备注:  NeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S)
Abstract:Recent research on time series foundation models has primarily focused on forecasting, leaving it unclear how generalizable their learned representations are. In this study, we examine whether frozen pre-trained forecasting models can provide effective representations for classification. To this end, we compare different representation extraction strategies and introduce two model-agnostic embedding augmentations. Our experiments show that the best forecasting models achieve classification accuracy that matches or even surpasses that of state-of-the-art models pre-trained specifically for classification. Moreover, we observe a positive correlation between forecasting and classification performance. These findings challenge the assumption that task-specific pre-training is necessary, and suggest that learning to forecast may provide a powerful route toward constructing general-purpose time series foundation models.
[LG-2] On Purely Private Covariance Estimation
链接: https://arxiv.org/abs/2510.26717
作者: Tommaso d’Orsi,Gleb Novikov
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:  equal contribution
Abstract:We present a simple perturbation mechanism for the release of d -dimensional covariance matrices \Sigma under pure differential privacy. For large datasets with at least n\geq d^2/\varepsilon elements, our mechanism recovers the provably optimal Frobenius norm error guarantees of \citenikolov2023private, while simultaneously achieving best known error for all other p -Schatten norms, with p\in [1,\infty] . Our error is information-theoretically optimal for all p\ge 2 , in particular, our mechanism is the first purely private covariance estimator that achieves optimal error in spectral norm. For small datasets n d^2/\varepsilon , we further show that by projecting the output onto the nuclear norm ball of appropriate radius, our algorithm achieves the optimal Frobenius norm error O(\sqrtd;\textTr(\Sigma) /n) , improving over the known bounds of O(\sqrtd/n) of \citenikolov2023private and O\big(d^3/4\sqrt\textTr(\Sigma)/n\big) of \citedong2022differentially. Comments: equal contribution Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2510.26717 [cs.LG] (or arXiv:2510.26717v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.26717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-3] LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation
链接: https://arxiv.org/abs/2510.26715
作者: Gabriel Asher,Devesh Shah,Amy A. Caudy,Luke Ferro,Lea Amar,Ana S. H. Costa,Thomas Patton,Niall O’Connor,Jennifer M. Campbell,Jack Geremia
类目: Machine Learning (cs.LG)
*备注:
Abstract:A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.
[LG-4] An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
链接: https://arxiv.org/abs/2510.26709
作者: Chuyan Chen,Chenyang Ma,Zhangxin Li,Yutong He,Yanjie Dong,Kun Yuan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:  8 pages, 2 figures
Abstract:Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand- K \ discards structural information and performs poorly in practice, while Top- K \ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top- K , an All-Reduce-Compatible Top- K compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top- K \ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top- K \ matches the accuracy of Top- K \ while reducing wall-clock training time by up to 60.7%, offering an efficient and scalable solution that combines the robustness of Rand- K \ with the strong performance of Top- K .
[LG-5] Budgeted Multiple-Expert Deferral
链接: https://arxiv.org/abs/2510.26706
作者: Giulia DeSalvo,Clara Mohri,Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Learning to defer uncertain predictions to costly experts offers a powerful strategy for improving the accuracy and efficiency of machine learning systems. However, standard training procedures for deferral algorithms typically require querying all experts for every training instance, an approach that becomes prohibitively expensive when expert queries incur significant computational or resource costs. This undermines the core goal of deferral: to limit unnecessary expert usage. To overcome this challenge, we introduce the budgeted deferral framework, which aims to train effective deferral algorithms while minimizing expert query costs during training. We propose new algorithms for both two-stage and single-stage multiple-expert deferral settings that selectively query only a subset of experts per training example. While inspired by active learning, our setting is fundamentally different: labels are already known, and the core challenge is to decide which experts to query in order to balance cost and predictive performance. We establish theoretical guarantees for both of our algorithms, including generalization bounds and label complexity analyses. Empirical results across several domains show that our algorithms substantially reduce training costs without sacrificing prediction accuracy, demonstrating the practical value of our budget-aware deferral algorithms.
[LG-6] How Regularization Terms Make Invertible Neural Networks Bayesian Point Estimators
链接: https://arxiv.org/abs/2510.26704
作者: Nick Heilenkötter
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:  Preprint, under review
Abstract:Can regularization terms in the training of invertible neural networks lead to known Bayesian point estimators in reconstruction? Invertible networks are attractive for inverse problems due to their inherent stability and interpretability. Recently, optimization strategies for invertible neural networks that approximate either a reconstruction map or the forward operator have been studied from a Bayesian perspective, but each has limitations. To address this, we introduce and analyze two regularization terms for the network training that, upon inversion of the network, recover properties of classical Bayesian point estimators: while the first can be connected to the posterior mean, the second resembles the MAP estimator. Our theoretical analysis characterizes how each loss shapes both the learned forward operator and its inverse reconstruction map. Numerical experiments support our findings and demonstrate how these loss-term regularizers introduce data-dependence in a stable and interpretable way.
[LG-7] LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits
链接: https://arxiv.org/abs/2510.26690
作者: Amir Reza Mirzaei,Yuqiao Wen,Yanshuai Cao,Lili Mou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks. Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.
[LG-8] ght Differentially Private PCA via Matrix Coherence
链接: https://arxiv.org/abs/2510.26679
作者: Tommaso d’Orsi,Gleb Novikov
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:  SODA 2026; equal contribution
Abstract:We revisit the task of computing the span of the top r singular vectors u_1, \ldots, u_r of a matrix under differential privacy. We show that a simple and efficient algorithm – based on singular value decomposition and standard perturbation mechanisms – returns a private rank- r approximation whose error depends only on the \emphrank- r coherence of u_1, \ldots, u_r and the spectral gap \sigma_r - \sigma_r+1 . This resolves a question posed by Hardt and Roth~\citehardt2013beyond. Our estimator outperforms the state of the art – significantly so in some regimes. In particular, we show that in the dense setting, it achieves the same guarantees for single-spike PCA in the Wishart model as those attained by optimal non-private algorithms, whereas prior private algorithms failed to do so. In addition, we prove that (rank- r ) coherence does not increase under Gaussian perturbations. This implies that any estimator based on the Gaussian mechanism – including ours – preserves the coherence of the input. We conjecture that similar behavior holds for other structured models, including planted problems in graphs. We also explore applications of coherence to graph problems. In particular, we present a differentially private algorithm for Max-Cut and other constraint satisfaction problems under low coherence assumptions. Comments: SODA 2026; equal contribution Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2510.26679 [cs.LG] (or arXiv:2510.26679v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.26679 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems
链接: https://arxiv.org/abs/2510.26656
作者: Georgios Kamaras,Craig Innes,Subramanian Ramamoorthy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.
[LG-10] Curly Flow Matching for Learning Non-gradient Field Dynamics NEURIPS2025
链接: https://arxiv.org/abs/2510.26645
作者: Katarina Petrović,Lazar Atanackovic,Viggo Moro,Kacper Kapuśniak,İsmail İlkan Ceylan,Michael Bronstein,Avishek Joey Bose,Alexander Tong
类目: Machine Learning (cs.LG)
*备注:  Accepted to NeurIPS 2025
Abstract:Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Such models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schrödinger bridge problem with a non-zero drift reference process – in stark contrast to typical zero-drift reference processes – which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems. Our code repository is accessible at: this https URL
[LG-11] MSAD: A Deep Dive into Model Selection for Time series Anomaly Detection VLDB
链接: https://arxiv.org/abs/2510.26643
作者: Emmanouil Sylligardos,John Paparrizos,Themis Palpanas,Pierre Senellart,Paul Boniol
类目: Machine Learning (cs.LG)
*备注:  25 pages, 13 figures, VLDB Journal
Abstract:Anomaly detection is a fundamental task for time series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmarks and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection methods to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, this paper studies the performance of time series classification methods used as model selection for anomaly detection. In total, we evaluate 234 model configurations derived from 16 base classifiers across more than 1980 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection, and represents a strong baseline that can then be used to guide the model selection step in general AutoML pipelines. Preprint version of an article accepted at the VLDB Journal.
[LG-12] Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimization
链接: https://arxiv.org/abs/2510.26633
作者: Colin Doumont,Victor Picheny,Viacheslav Borovitskiy,Henry Moss
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian Optimization (BO) has the potential to solve various combinatorial tasks, ranging from materials science to neural architecture search. However, BO requires specialized kernels to effectively model combinatorial domains. Recent efforts have introduced several combinatorial kernels, but the relationships among them are not well understood. To bridge this gap, we develop a unifying framework based on heat kernels, which we derive in a systematic way and express as simple closed-form expressions. Using this framework, we prove that many successful combinatorial kernels are either related or equivalent to heat kernels, and validate this theoretical claim in our experiments. Moreover, our analysis confirms and extends the results presented in Bounce: certain algorithms’ performance decreases substantially when the unknown optima of the function do not have a certain structure. In contrast, heat kernels are not sensitive to the location of the optima. Lastly, we show that a fast and simple pipeline, relying on heat kernels, is able to achieve state-of-the-art results, matching or even outperforming certain slow or complex algorithms.
[LG-13] Wasserstein Regression as a Variational Approximation of Probabilistic Trajectories through the Bernstein Basis
链接: https://arxiv.org/abs/2510.26607
作者: Maksim Maslov,Alexander Kugaevskikh,Matthew Ivanov
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper considers the problem of regression over distributions, which is becoming increasingly important in machine learning. Existing approaches often ignore the geometry of the probability space or are computationally expensive. To overcome these limitations, a new method is proposed that combines the parameterization of probability trajectories using a Bernstein basis and the minimization of the Wasserstein distance between distributions. The key idea is to model a conditional distribution as a smooth probability trajectory defined by a weighted sum of Gaussian components whose parameters – the mean and covariance – are functions of the input variable constructed using Bernstein polynomials. The loss function is the averaged squared Wasserstein distance between the predicted Gaussian distributions and the empirical data, which takes into account the geometry of the distributions. An autodiff-based optimization method is used to train the model. Experiments on synthetic datasets that include complex trajectories demonstrated that the proposed method provides competitive approximation quality in terms of the Wasserstein distance, Energy Distance, and RMSE metrics, especially in cases of pronounced nonlinearity. The model demonstrates trajectory smoothness that is better than or comparable to alternatives and robustness to changes in data structure, while maintaining high interpretability due to explicit parameterization via control points. The developed approach represents a balanced solution that combines geometric accuracy, computational practicality, and interpretability. Prospects for further research include extending the method to non-Gaussian distributions, applying entropy regularization to speed up computations, and adapting the approach to working with high-dimensional data for approximating surfaces and more complex structures.
[LG-14] On Measuring Localization of Shortcuts in Deep Networks
链接: https://arxiv.org/abs/2510.26560
作者: Nikita Tsoy,Nikola Konstantinov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.
[LG-15] Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices
链接: https://arxiv.org/abs/2510.26557
作者: Jan Stenkamp,Nina Herrmann,Benjamin Karic,Stefan Oehmcke,Fabian Gieseke
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4-16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This capability opens the door to a wide range of IoT applications, including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.
[LG-16] A Three-Stage Bayesian Transfer Learning Framework to Improve Predictions in Data-Scarce Domains
链接: https://arxiv.org/abs/2510.26541
作者: Aidan Furlong,Robert Salko,Xingang Zhao,Xu Wu
类目: Machine Learning (cs.LG)
*备注:  Submitted to Engineering Applications of Artificial Intelligence
Abstract:The use of ML in engineering has grown steadily to support a wide array of applications. Among these methods, deep neural networks have been widely adopted due to their performance and accessibility, but they require large, high-quality datasets. Experimental data are often sparse, noisy, or insufficient to build resilient data-driven models. Transfer learning, which leverages relevant data-abundant source domains to assist learning in data-scarce target domains, has shown efficacy. Parameter transfer, where pretrained weights are reused, is common but degrades under large domain shifts. Domain-adversarial neural networks (DANNs) help address this issue by learning domain-invariant representations, thereby improving transfer under greater domain shifts in a semi-supervised setting. However, DANNs can be unstable during training and lack a native means for uncertainty quantification. This study introduces a fully-supervised three-stage framework, the staged Bayesian domain-adversarial neural network (staged B-DANN), that combines parameter transfer and shared latent space adaptation. In Stage 1, a deterministic feature extractor is trained on the source domain. This feature extractor is then adversarially refined using a DANN in Stage 2. In Stage 3, a Bayesian neural network is built on the adapted feature extractor for fine-tuning on the target domain to handle conditional shifts and yield calibrated uncertainty estimates. This staged B-DANN approach was first validated on a synthetic benchmark, where it was shown to significantly outperform standard transfer techniques. It was then applied to the task of predicting critical heat flux in rectangular channels, leveraging data from tube experiments as the source domain. The results of this study show that the staged B-DANN method can improve predictive accuracy and generalization, potentially assisting other domains in nuclear engineering.
[LG-17] Higher-Order Regularization Learning on Hypergraphs
链接: https://arxiv.org/abs/2510.26533
作者: Adrien Weihs,Andrea Bertozzi,Matthew Thorpe
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Higher-Order Hypergraph Learning (HOHL) was recently introduced as a principled alternative to classical hypergraph regularization, enforcing higher-order smoothness via powers of multiscale Laplacians induced by the hypergraph structure. Prior work established the well- and ill-posedness of HOHL through an asymptotic consistency analysis in geometric settings. We extend this theoretical foundation by proving the consistency of a truncated version of HOHL and deriving explicit convergence rates when HOHL is used as a regularizer in fully supervised learning. We further demonstrate its strong empirical performance in active learning and in datasets lacking an underlying geometric structure, highlighting HOHL’s versatility and robustness across diverse learning settings.
[LG-18] Polybasic Speculative Decoding Through a Theoretical Perspective
链接: https://arxiv.org/abs/2510.26527
作者: Ruilin Wang,Huixia Li,Yuexiao Ma,Xiawu Zheng,Fei Chao,Xuefeng Xiao,Rongrong Ji
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emphpolybasic speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from 3.31\times to 4.01\times for LLaMA2-Chat 7B, up to 3.87 \times for LLaMA3-8B, up to 4.43 \times for Vicuna-7B and up to 3.85 \times for Qwen2-7B – all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.
[LG-19] hink Outside the Policy: In-Context Steered Policy Optimization
链接: https://arxiv.org/abs/2510.26519
作者: Hsiu-Yuan Huang,Chenming Tang,Weijie Liu,Saiyong Yang,Yunfang Wu
类目: Machine Learning (cs.LG)
*备注:  Work in progress
Abstract:Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy’s distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.
[LG-20] LLM s as In-Context Meta-Learners for Model and Hyperparameter Selection
链接: https://arxiv.org/abs/2510.26510
作者: Youssef Attia El Hili,Albert Thomas,Malik Tiomoko,Abdelhakim Benechehab,Corentin Léger,Corinne Ancourt,Balázs Kégl
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:  27 pages, 6 figures
Abstract:Model and hyperparameter selection are critical but challenging in machine learning, typically requiring expert intuition or expensive automated search. We investigate whether large language models (LLMs) can act as in-context meta-learners for this task. By converting each dataset into interpretable metadata, we prompt an LLM to recommend both model families and hyperparameters. We study two prompting strategies: (1) a zero-shot mode relying solely on pretrained knowledge, and (2) a meta-informed mode augmented with examples of models and their performance on past tasks. Across synthetic and real-world benchmarks, we show that LLMs can exploit dataset metadata to recommend competitive models and hyperparameters without search, and that improvements from meta-informed prompting demonstrate their capacity for in-context meta-learning. These results highlight a promising new role for LLMs as lightweight, general-purpose assistants for model selection and hyperparameter optimization.
[LG-21] Enhancing ECG Classification Robustness with Lightweight Unsupervised Anomaly Detection Filters
链接: https://arxiv.org/abs/2510.26501
作者: Mustafa Fuad Rifet Ibrahim,Maurice Meijer,Alexander Schlaefer,Peer Stelldinger
类目: Machine Learning (cs.LG)
*备注:  Submitted to the 24th International Conference on Pervasive Computing and Communications (PerCom 2026)
Abstract:Continuous electrocardiogram (ECG) monitoring via wearables offers significant potential for early cardiovascular disease (CVD) detection. However, deploying deep learning models for automated analysis in resource-constrained environments faces reliability challenges due to inevitable Out-of-Distribution (OOD) data. OOD inputs, such as unseen pathologies or noisecorrupted signals, often cause erroneous, high-confidence predictions by standard classifiers, compromising patient safety. Existing OOD detection methods either neglect computational constraints or address noise and unseen classes separately. This paper explores Unsupervised Anomaly Detection (UAD) as an independent, upstream filtering mechanism to improve robustness. We benchmark six UAD approaches, including Deep SVDD, reconstruction-based models, Masked Anomaly Detection, normalizing flows, and diffusion models, optimized via Neural Architecture Search (NAS) under strict resource constraints (at most 512k parameters). Evaluation on PTB-XL and BUT QDB datasets assessed detection of OOD CVD classes and signals unsuitable for analysis due to noise. Results show Deep SVDD consistently achieves the best trade-off between detection and efficiency. In a realistic deployment simulation, integrating the optimized Deep SVDD filter with a diagnostic classifier improved accuracy by up to 21 percentage points over a classifier-only baseline. This study demonstrates that optimized UAD filters can safeguard automated ECG analysis, enabling safer, more reliable continuous cardiovascular monitoring on wearables.
[LG-22] Data-Efficient RLVR via Off-Policy Influence Guidance
链接: https://arxiv.org/abs/2510.26491
作者: Erle Zhu,Dazhi Jiang,Yuan Wang,Xujun Li,Jiale Cheng,Yuxian Gu,Yilin Niu,Aohan Zeng,Jie Tang,Minlie Huang,Hongning Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbfCurriculum \textbfRL with \textbfOff-\textbfPolicy \textInfluence guidance (\textbfCROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
[LG-23] Quantum Gated Recurrent GAN with Gaussian Uncertainty for Network Anomaly Detection
链接: https://arxiv.org/abs/2510.26487
作者: Wajdi Hammami,Soumaya Cherkaoui,Jean-Frederic Laprade,Ola Ahmad,Shengrui Wang
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Anomaly detection in time-series data is a critical challenge with significant implications for network security. Recent quantum machine learning approaches, such as quantum kernel methods and variational quantum circuits, have shown promise in capturing complex data distributions for anomaly detection but remain constrained by limited qubit counts. We introduce in this work a novel Quantum Gated Recurrent Unit (QGRU)-based Generative Adversarial Network (GAN) employing Successive Data Injection (SuDaI) and a multi-metric gating strategy for robust network anomaly detection. Our model uniquely utilizes a quantum-enhanced generator that outputs parameters (mean and log-variance) of a Gaussian distribution via reparameterization, combined with a Wasserstein critic to stabilize adversarial training. Anomalies are identified through a novel gating mechanism that initially flags potential anomalies based on Gaussian uncertainty estimates and subsequently verifies them using a composite of critic scores and reconstruction errors. Evaluated on benchmark datasets, our method achieves a high time-series aware F1 score (TaF1) of 89.43% demonstrating superior capability in detecting anomalies accurately and promptly as compared to existing classical and quantum models. Furthermore, the trained QGRU-WGAN was deployed on real IBM Quantum hardware, where it retained high anomaly detection performance, confirming its robustness and practical feasibility on current noisy intermediate-scale quantum (NISQ) devices.
[LG-24] ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
链接: https://arxiv.org/abs/2510.26475
作者: Qiaoling Chen,Zijun Liu,Peng Sun,Shenggui Li,Guoteng Wang,Ziming Liu,Yonggang Wen,Siyuan Feng,Tianwei Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B–14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2510.26475 [cs.LG] (or arXiv:2510.26475v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.26475 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-25] Vectorized Context-Aware Embeddings for GAT-Based Collaborative Filtering
链接: https://arxiv.org/abs/2510.26461
作者: Danial Ebrat,Sepideh Ahmadian,Luis Rueda
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Recommender systems often struggle with data sparsity and cold-start scenarios, limiting their ability to provide accurate suggestions for new or infrequent users. This paper presents a Graph Attention Network (GAT) based Collaborative Filtering (CF) framework enhanced with Large Language Model (LLM) driven context aware embeddings. Specifically, we generate concise textual user profiles and unify item metadata (titles, genres, overviews) into rich textual embeddings, injecting these as initial node features in a bipartite user item graph. To further optimize ranking performance, we introduce a hybrid loss function that combines Bayesian Personalized Ranking (BPR) with a cosine similarity term and robust negative sampling, ensuring explicit negative feedback is distinguished from unobserved data. Experiments on the MovieLens 100k and 1M datasets show consistent improvements over state-of-the-art baselines in Precision, NDCG, and MAP while demonstrating robustness for users with limited interaction history. Ablation studies confirm the critical role of LLM-augmented embeddings and the cosine similarity term in capturing nuanced semantic relationships. Our approach effectively mitigates sparsity and cold-start limitations by integrating LLM-derived contextual understanding into graph-based architectures. Future directions include balancing recommendation accuracy with coverage and diversity, and introducing fairness-aware constraints and interpretability features to enhance system performance further.
[LG-26] Co-Evolving Latent Action World Models
链接: https://arxiv.org/abs/2510.26433
作者: Yucen Wang,Fengming Zhang,De-Chuan Zhan,Li Zhao,Kaixin Wang,Jiang Bian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adapting pre-trained video generation models into controllable world models via latent actions is a promising step towards creating generalist world models. The dominant paradigm adopts a two-stage approach that trains latent action model (LAM) and the world model separately, resulting in redundant training and limiting their potential for co-adaptation. A conceptually simple and appealing idea is to directly replace the forward dynamic model in LAM with a powerful world model and training them jointly, but it is non-trivial and prone to representational collapse. In this work, we propose CoLA-World, which for the first time successfully realizes this synergistic paradigm, resolving the core challenge in joint learning through a critical warm-up phase that effectively aligns the representations of the from-scratch LAM with the pre-trained world model. This unlocks a co-evolution cycle: the world model acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM, while the LAM offers a more precise and adaptable control interface to the world model. Empirically, CoLA-World matches or outperforms prior two-stage methods in both video simulation quality and downstream visual planning, establishing a robust and efficient new paradigm for the field.
[LG-27] Multi-Task Learning Based on Support Vector Machines and Twin Support Vector Machines: A Comprehensive Survey
链接: https://arxiv.org/abs/2510.26392
作者: Fatemeh Bazikar,Hossein Moosaei,Atefeh Hemmati,Panos M. Pardalos
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Multi-task learning (MTL) enables simultaneous training across related tasks, leveraging shared information to improve generalization, efficiency, and robustness, especially in data-scarce or high-dimensional scenarios. While deep learning dominates recent MTL research, Support Vector Machines (SVMs) and Twin SVMs (TWSVMs) remain relevant due to their interpretability, theoretical rigor, and effectiveness with small datasets. This chapter surveys MTL approaches based on SVM and TWSVM, highlighting shared representations, task regularization, and structural coupling strategies. Special attention is given to emerging TWSVM extensions for multi-task settings, which show promise but remain underexplored. We compare these models in terms of theoretical properties, optimization strategies, and empirical performance, and discuss applications in fields such as computer vision, natural language processing, and bioinformatics. Finally, we identify research gaps and outline future directions for building scalable, interpretable, and reliable margin-based MTL frameworks. This work provides a comprehensive resource for researchers and practitioners interested in SVM- and TWSVM-based multi-task learning. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2510.26392 [cs.LG] (or arXiv:2510.26392v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.26392 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2510.26389
作者: Wenchang Duan,Yaoliang Yu,Jiwan He,Yi Shi
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).
[LG-29] Efficient Generative AI Boosts Probabilistic Forecasting of Sudden Stratospheric Warmings
链接: https://arxiv.org/abs/2510.26376
作者: Ningning Tao,Fei Xie,Baoxiang Pan,Hongyu Wang,Han Huang,Zhongpu Qiu,Ke Gui,Jiali Luo,Xiaosong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sudden Stratospheric Warmings (SSWs) are key sources of subseasonal predictability and major drivers of extreme winter weather. Yet, their accurate and efficient forecast remains a persistent challenge for numerical weather prediction (NWP) systems due to limitations in physical representation, initialization, and the immense computational demands of ensemble forecasts. While data-driven forecasting is rapidly evolving, its application to the complex, three-dimensional dynamics of SSWs, particularly for probabilistic forecast, remains underexplored. Here, we bridge this gap by developing a Flow Matching-based generative AI model (FM-Cast) for efficient and skillful probabilistic forecasting of the spatiotemporal evolution of stratospheric circulation. Evaluated across 18 major SSW events (1998-2024), FM-Cast skillfully forecasts the onset, intensity, and morphology of 10 events up to 20 days in advance, achieving ensemble accuracies above 50%. Its performance is comparable to or exceeds leading NWP systems while requiring only two minutes for a 50-member, 30-day forecast on a consumer GPU. Furthermore, leveraging FM-Cast as a scientific tool, we demonstrate through idealized experiments that SSW predictability is fundamentally linked to its underlying physical drivers, distinguishing between events forced from the troposphere and those driven by internal stratospheric dynamics. Our work thus establishes a computationally efficient paradigm for probabilistic forecasting stratospheric anomalies and showcases generative AI’s potential to deepen the physical understanding of atmosphere-climate dynamics.
[LG-30] owards Explainable and Reliable AI in Finance
链接: https://arxiv.org/abs/2510.26353
作者: Albi Isufaj,Pablo Mollá,Helmut Prendinger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Financial forecasting increasingly uses large neural network models, but their opacity raises challenges for trust and regulatory compliance. We present several approaches to explainable and reliable AI in finance. \emphFirst, we describe how Time-LLM, a time series foundation model, uses a prompt to avoid a wrong directional forecast. \emphSecond, we show that combining foundation models for time series forecasting with a reliability estimator can filter our unreliable predictions. \emphThird, we argue for symbolic reasoning encoding domain rules for transparent justification. These approaches shift emphasize executing only forecasts that are both reliable and explainable. Experiments on equity and cryptocurrency data show that the architecture reduces false positives and supports selective execution. By integrating predictive performance with reliability estimation and rule-based reasoning, our framework advances transparent and auditable financial AI systems.
[LG-31] UnifiedFL: A Dynamic Unified Learning Framework for Equitable Federation
链接: https://arxiv.org/abs/2510.26350
作者: Furkan Pala,Islem Rekik
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) has emerged as a key paradigm for collaborative model training across multiple clients without sharing raw data, enabling privacy-preserving applications in areas such as radiology and pathology. However, works on collaborative training across clients with fundamentally different neural architectures and non-identically distributed datasets remain scarce. Existing FL frameworks face several limitations. Despite claiming to support architectural heterogeneity, most recent FL methods only tolerate variants within a single model family (e.g., shallower, deeper, or wider CNNs), still presuming a shared global architecture and failing to accommodate federations where clients deploy fundamentally different network types (e.g., CNNs, GNNs, MLPs). Moreover, existing approaches often address only statistical heterogeneity while overlooking the domain-fracture problem, where each client’s data distribution differs markedly from that faced at testing time, undermining model generalizability. When clients use different architectures, have non-identically distributed data, and encounter distinct test domains, current methods perform poorly. To address these challenges, we propose UnifiedFL, a dynamic federated learning framework that represents heterogeneous local networks as nodes and edges in a directed model graph optimized by a shared graph neural network (GNN). UnifiedFL introduces (i) a common GNN to parameterize all architectures, (ii) distance-driven clustering via Euclidean distances between clients’ parameters, and (iii) a two-tier aggregation policy balancing convergence and diversity. Experiments on MedMNIST classification and hippocampus segmentation benchmarks demonstrate UnifiedFL’s superior performance. Code and data: this https URL
[LG-32] Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections
链接: https://arxiv.org/abs/2510.26328
作者: David Schmotz,Sahar Abdelnabi,Maksym Andriushchenko
类目: Machine Learning (cs.LG)
*备注:
Abstract:Enabling continual learning in LLMs remains a key unresolved research challenge. In a recent announcement, a frontier LLM company made a step towards this by introducing Agent Skills, a framework that equips agents with new knowledge based on instructions stored in simple markdown files. Although Agent Skills can be a very useful tool, we show that they are fundamentally insecure, since they enable trivially simple prompt injections. We demonstrate how to hide malicious instructions in long Agent Skill files and referenced scripts to exfiltrate sensitive data, such as internal files or passwords. Importantly, we show how to bypass system-level guardrails of a popular coding agent: a benign, task-specific approval with the “Don’t ask again” option can carry over to closely related but harmful actions. Overall, we conclude that despite ongoing research efforts and scaling model capabilities, frontier LLMs remain vulnerable to very simple prompt injections in realistic scenarios. Our code is available at this https URL.
[LG-33] On the Impact of Weight Discretization in QUBO-Based SVM Training ECML KDD2025
链接: https://arxiv.org/abs/2510.26323
作者: Sascha Mücke
类目: Machine Learning (cs.LG)
*备注:  Presented at the 7th DSO Workshop at ECML PKDD 2025
Abstract:Training Support Vector Machines (SVMs) can be formulated as a QUBO problem, enabling the use of quantum annealing for model optimization. In this work, we study how the number of qubits - linked to the discretization level of dual weights - affects predictive performance across datasets. We compare QUBO-based SVM training to the classical LIBSVM solver and find that even low-precision QUBO encodings (e.g., 1 bit per parameter) yield competitive, and sometimes superior, accuracy. While increased bit-depth enables larger regularization parameters, it does not always improve classification. Our findings suggest that selecting the right support vectors may matter more than their precise weighting. Although current hardware limits the size of solvable QUBOs, our results highlight the potential of quantum annealing for efficient SVM training as quantum devices scale.
[LG-34] Model Inversion with Layer-Specific Modeling and Alignment for Data-Free Continual Learning NEURIPS2025
链接: https://arxiv.org/abs/2510.26311
作者: Ruilin Tong,Haodong Lu,Yuhang Liu,Dong Gong
类目: Machine Learning (cs.LG)
*备注:  Accepted in NeurIPS 2025
Abstract:Continual learning (CL) aims to incrementally train a model on a sequence of tasks while retaining performance on prior ones. However, storing and replaying data is often infeasible due to privacy or security constraints and impractical for arbitrary pre-trained models. Data-free CL seeks to update models without access to previous data. Beyond regularization, we employ model inversion to synthesize data from the trained model, enabling replay without storing samples. Yet, model inversion in predictive models faces two challenges: (1) generating inputs solely from compressed output labels causes drift between synthetic and real data, and replaying such data can erode prior knowledge; (2) inversion is computationally expensive since each step backpropagates through the full model. These issues are amplified in large pre-trained models such as CLIP. To improve efficiency, we propose Per-layer Model Inversion (PMI), inspired by faster convergence in single-layer optimization. PMI provides strong initialization for full-model inversion, substantially reducing iterations. To mitigate feature shift, we model class-wise features via Gaussian distributions and contrastive model, ensuring alignment between synthetic and real features. Combining PMI and feature modeling, our approach enables continual learning of new classes by generating pseudo-images from semantic-aware projected features, achieving strong effectiveness and compatibility across multiple CL settings.
[LG-35] A Survey of Heterogeneous Graph Neural Networks for Cybersecurity Anomaly Detection
链接: https://arxiv.org/abs/2510.26307
作者: Laura Jiang,Reza Ryan,Qian Li,Nasim Ferdosian
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:  37 pages, 4 figures, 86 references. Submitted to Journal of Computer Security (under review)
Abstract:Anomaly detection is a critical task in cybersecurity, where identifying insider threats, access violations, and coordinated attacks is essential for ensuring system resilience. Graph-based approaches have become increasingly important for modeling entity interactions, yet most rely on homogeneous and static structures, which limits their ability to capture the heterogeneity and temporal evolution of real-world environments. Heterogeneous Graph Neural Networks (HGNNs) have emerged as a promising paradigm for anomaly detection by incorporating type-aware transformations and relation-sensitive aggregation, enabling more expressive modeling of complex cyber data. However, current research on HGNN-based anomaly detection remains fragmented, with diverse modeling strategies, limited comparative evaluation, and an absence of standardized benchmarks. To address this gap, we provide a comprehensive survey of HGNN-based anomaly detection methods in cybersecurity. We introduce a taxonomy that classifies approaches by anomaly type and graph dynamics, analyze representative models, and map them to key cybersecurity applications. We also review commonly used benchmark datasets and evaluation metrics, highlighting their strengths and limitations. Finally, we identify key open challenges related to modeling, data, and deployment, and outline promising directions for future research. This survey aims to establish a structured foundation for advancing HGNN-based anomaly detection toward scalable, interpretable, and practically deployable solutions.
[LG-36] Offline Clustering of Preference Learning with Active-data Augmentation
链接: https://arxiv.org/abs/2510.26301
作者: Jingyuan Liu,Fatemeh Ghaffari,Xuchuang Wang,Mohammad Hajiesmaili,Carlee Joe-Wong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C ^2 PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C ^2 PL. In this setting, our second algorithm, A ^2 -Off-C ^2 PL, actively selects samples that target the least-informative dimensions of the test user’s preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.
[LG-37] Empirical Bayesian Multi-Bandit Learning
链接: https://arxiv.org/abs/2510.26284
作者: Xia Jiang,Rong J.B. Zhu
类目: Machine Learning (cs.LG)
*备注:  33 pages, 13 figures
Abstract:Multi-task learning in contextual bandits has attracted significant research interest due to its potential to enhance decision-making across multiple related tasks by leveraging shared structures and task-specific heterogeneity. In this article, we propose a novel hierarchical Bayesian framework for learning in various bandit instances. This framework captures both the heterogeneity and the correlations among different bandit instances through a hierarchical Bayesian model, enabling effective information sharing while accommodating instance-specific variations. Unlike previous methods that overlook the learning of the covariance structure across bandits, we introduce an empirical Bayesian approach to estimate the covariance matrix of the prior this http URL enhances both the practicality and flexibility of learning across multi-bandits. Building on this approach, we develop two efficient algorithms: ebmTS (Empirical Bayesian Multi-Bandit Thompson Sampling) and ebmUCB (Empirical Bayesian Multi-Bandit Upper Confidence Bound), both of which incorporate the estimated prior into the decision-making process. We provide the frequentist regret upper bounds for the proposed algorithms, thereby filling a research gap in the field of multi-bandit problems. Extensive experiments on both synthetic and real-world datasets demonstrate the superior performance of our algorithms, particularly in complex environments. Our methods achieve lower cumulative regret compared to existing techniques, highlighting their effectiveness in balancing exploration and exploitation across multi-bandits.
[LG-38] Likely Interpolants of Generative Models
链接: https://arxiv.org/abs/2510.26266
作者: Frederik Möbius Rygaard,Shen Zhu,Yinzhu Jin,Søren Hauberg,Tom Fletcher
类目: Machine Learning (cs.LG)
*备注:
Abstract:Interpolation in generative models allows for controlled generation, model inspection, and more. Unfortunately, most generative models lack a principal notion of interpolants without restrictive assumptions on either the model or data dimension. In this paper, we develop a general interpolation scheme that targets likely transition paths compatible with different metrics and probability distributions. We consider interpolants analogous to a geodesic constrained to a suitable data distribution and derive a novel algorithm for computing these curves, which requires no additional training. Theoretically, we show that our method locally can be considered as a geodesic under a suitable Riemannian metric. We quantitatively show that our interpolation scheme traverses higher density regions than baselines across a range of models and datasets.
[LG-39] A Game-Theoretic Spatio-Temporal Reinforcement Learning Framework for Collaborative Public Resource Allocation
链接: https://arxiv.org/abs/2510.26184
作者: Songxin Lei,Qiongyan Wang,Yanchen Zhu,Hanyu Yao,Sijie Ruan,Weilin Ruan,Yuyu Luo,Huaming Wu,Yuxuan Liang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Public resource allocation involves the efficient distribution of resources, including urban infrastructure, energy, and transportation, to effectively meet societal demands. However, existing methods focus on optimizing the movement of individual resources independently, without considering their capacity constraints. To address this limitation, we propose a novel and more practical problem: Collaborative Public Resource Allocation (CPRA), which explicitly incorporates capacity constraints and spatio-temporal dynamics in real-world scenarios. We propose a new framework called Game-Theoretic Spatio-Temporal Reinforcement Learning (GSTRL) for solving CPRA. Our contributions are twofold: 1) We formulate the CPRA problem as a potential game and demonstrate that there is no gap between the potential function and the optimal target, laying a solid theoretical foundation for approximating the Nash equilibrium of this NP-hard problem; and 2) Our designed GSTRL framework effectively captures the spatio-temporal dynamics of the overall system. We evaluate GSTRL on two real-world datasets, where experiments show its superior performance. Our source codes are available in the supplementary materials.
[LG-40] STAR: A Privacy-Preserving Energy-Efficient Edge AI Framework for Human Activity Recognition via Wi-Fi CSI in Mobile and Pervasive Computing Environments
链接: https://arxiv.org/abs/2510.26148
作者: Kexing Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Human Activity Recognition (HAR) via Wi-Fi Channel State Information (CSI) presents a privacy-preserving, contactless sensing approach suitable for smart homes, healthcare monitoring, and mobile IoT systems. However, existing methods often encounter computational inefficiency, high latency, and limited feasibility within resource-constrained, embedded mobile edge environments. This paper proposes STAR (Sensing Technology for Activity Recognition), an edge-AI-optimized framework that integrates a lightweight neural architecture, adaptive signal processing, and hardware-aware co-optimization to enable real-time, energy-efficient HAR on low-power embedded devices. STAR incorporates a streamlined Gated Recurrent Unit (GRU)-based recurrent neural network, reducing model parameters by 33% compared to conventional LSTM models while maintaining effective temporal modeling capability. A multi-stage pre-processing pipeline combining median filtering, 8th-order Butterworth low-pass filtering, and Empirical Mode Decomposition (EMD) is employed to denoise CSI amplitude data and extract spatial-temporal features. For on-device deployment, STAR is implemented on a Rockchip RV1126 processor equipped with an embedded Neural Processing Unit (NPU), interfaced with an ESP32-S3-based CSI acquisition module. Experimental results demonstrate a mean recognition accuracy of 93.52% across seven activity classes and 99.11% for human presence detection, utilizing a compact 97.6k-parameter model. INT8 quantized inference achieves a processing speed of 33 MHz with just 8% CPU utilization, delivering sixfold speed improvements over CPU-based execution. With sub-second response latency and low power consumption, the system ensures real-time, privacy-preserving HAR, offering a practical, scalable solution for mobile and pervasive computing environments.
[LG-41] maxVSTAR: Maximally Adaptive Vision-Guided CSI Sensing with Closed-Loop Edge Model Adaptation for Robust Human Activity Recognition
链接: https://arxiv.org/abs/2510.26146
作者: Kexing Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:WiFi Channel State Information (CSI)-based human activity recognition (HAR) provides a privacy-preserving, device-free sensing solution for smart environments. However, its deployment on edge devices is severely constrained by domain shift, where recognition performance deteriorates under varying environmental and hardware conditions. This study presents maxVSTAR (maximally adaptive Vision-guided Sensing Technology for Activity Recognition), a closed-loop, vision-guided model adaptation framework that autonomously mitigates domain shift for edge-deployed CSI sensing systems. The proposed system integrates a cross-modal teacher-student architecture, where a high-accuracy YOLO-based vision model serves as a dynamic supervisory signal, delivering real-time activity labels for the CSI data stream. These labels enable autonomous, online fine-tuning of a lightweight CSI-based HAR model, termed Sensing Technology for Activity Recognition (STAR), directly at the edge. This closed-loop retraining mechanism allows STAR to continuously adapt to environmental changes without manual intervention. Extensive experiments demonstrate the effectiveness of maxVSTAR. When deployed on uncalibrated hardware, the baseline STAR model’s recognition accuracy declined from 93.52% to 49.14%. Following a single vision-guided adaptation cycle, maxVSTAR restored the accuracy to 81.51%. These results confirm the system’s capacity for dynamic, self-supervised model adaptation in privacy-conscious IoT environments, establishing a scalable and practical paradigm for long-term autonomous HAR using CSI sensing at the network edge.
[LG-42] Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error
链接: https://arxiv.org/abs/2510.26109
作者: Chenming Tang,Hsiu-Yuan Huang,Weijie Liu,Saiyong Yang,Yunfang Wu
类目: Machine Learning (cs.LG)
*备注:  Work in progress
Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of large language models (LLMs) recently. However, existing RLVR approaches merely train LLMs based on their own generated responses and are constrained by the initial capability of LLMs, thus prone to exploration stagnation, in which LLMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems but requires external guidance from experts which suffers from limited availability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach hinting LLMs with their previously self-generated incorrect answers and problem of overlong responses, which does not require any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base. Further analysis confirms that LTE successfully mitigates the problem of exploration stagnation and enhances both exploitation and exploration during training.
[LG-43] ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models NEURIPS2025
链接: https://arxiv.org/abs/2510.26096
作者: Weifei Jin,Yuxin Cao,Junjie Su,Minhui Xue,Jie Hao,Ke Xu,Jin Song Dong,Derui Wang
类目: ound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:  Accepted to NeurIPS 2025
Abstract:Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the model’s utility on benign tasks, we further propose Mel-Gradient Sparse Mask (M-GSM), which restricts perturbations to Mel-frequency bins that are sensitive to jailbreaks but insensitive to speech understanding. Both theoretical analyses and empirical results demonstrate the robustness of our method against both seen and unseen attacks. Overall, \MethodName reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models, while maintaining comparable utility on benign benchmarks, establishing it as the new state of the art. Our code and data are available at this https URL.
[LG-44] LLM Bisect: Breaking Barriers in Bug Bisection with A Comparative Analysis Pipeline
链接: https://arxiv.org/abs/2510.26086
作者: Zheng Zhang,Haonan Li,Xingyu Li,Hang Zhang,Zhiyun Qian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bug bisection has been an important security task that aims to understand the range of software versions impacted by a bug, i.e., identifying the commit that introduced the bug. However, traditional patch-based bisection methods are faced with several significant barriers: For example, they assume that the bug-inducing commit (BIC) and the patch commit modify the same functions, which is not always true. They often rely solely on code changes, while the commit message frequently contains a wealth of vulnerability-related information. They are also based on simple heuristics (e.g., assuming the BIC initializes lines deleted in the patch) and lack any logical analysis of the vulnerability. In this paper, we make the observation that Large Language Models (LLMs) are well-positioned to break the barriers of existing solutions, e.g., comprehend both textual data and code in patches and commits. Unlike previous BIC identification approaches, which yield poor results, we propose a comprehensive multi-stage pipeline that leverages LLMs to: (1) fully utilize patch information, (2) compare multiple candidate commits in context, and (3) progressively narrow down the candidates through a series of down-selection steps. In our evaluation, we demonstrate that our approach achieves significantly better accuracy than the state-of-the-art solution by more than 38%. Our results further confirm that the comprehensive multi-stage pipeline is essential, as it improves accuracy by 60% over a baseline LLM-based bisection method. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.26086 [cs.LG] (or arXiv:2510.26086v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.26086 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-45] New Money: A Systematic Review of Synthetic Data Generation for Finance
链接: https://arxiv.org/abs/2510.26076
作者: James Meldrum,Basem Suleiman,Fethi Rabhi,Muhammad Johan Alibasa
类目: Machine Learning (cs.LG)
*备注:  37 pages, 5 figures, 21 tables
Abstract:Synthetic data generation has emerged as a promising approach to address the challenges of using sensitive financial data in machine learning applications. By leveraging generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), it is possible to create artificial datasets that preserve the statistical properties of real financial records while mitigating privacy risks and regulatory constraints. Despite the rapid growth of this field, a comprehensive synthesis of the current research landscape has been lacking. This systematic review consolidates and analyses 72 studies published since 2018 that focus on synthetic financial data generation. We categorise the types of financial information synthesised, the generative methods employed, and the evaluation strategies used to assess data utility and privacy. The findings indicate that GAN-based approaches dominate the literature, particularly for generating time-series market data and tabular credit data. While several innovative techniques demonstrate potential for improved realism and privacy preservation, there remains a notable lack of rigorous evaluation of privacy safeguards across studies. By providing an integrated overview of generative techniques, applications, and evaluation methods, this review highlights critical research gaps and offers guidance for future work aimed at developing robust, privacy-preserving synthetic data solutions for the financial domain.
[LG-46] owards Scaling Laws for Symbolic Regression NEURIPS2025
链接: https://arxiv.org/abs/2510.26064
作者: David Otte,Jörg K.H. Franke,Frank Hutter
类目: Machine Learning (cs.LG)
*备注:  Accepted at the NeurIPS 2025 Math-AI Workshop
Abstract:Symbolic regression (SR) aims to discover the underlying mathematical expressions that explain observed data. This holds promise for both gaining scientific insight and for producing inherently interpretable and generalizable models for tabular data. In this work we focus on the basics of SR. Deep learning-based SR has recently become competitive with genetic programming approaches, but the role of scale has remained largely unexplored. Inspired by scaling laws in language modeling, we present the first systematic investigation of scaling in SR, using a scalable end-to-end transformer pipeline and carefully generated training data. Across five different model sizes and spanning three orders of magnitude in compute, we find that both validation loss and solved rate follow clear power-law trends with compute. We further identify compute-optimal hyperparameter scaling: optimal batch size and learning rate grow with model size, and a token-to-parameter ratio of \approx 15 is optimal in our regime, with a slight upward trend as compute increases. These results demonstrate that SR performance is largely predictable from compute and offer important insights for training the next generation of SR models.
[LG-47] Accelerating Real-World Overtaking in F1TENTH Racing Employing Reinforcement Learning Methods
链接: https://arxiv.org/abs/2510.26040
作者: Emily Steiner,Daniel van der Spuy,Futian Zhou,Afereti Pama,Minas Liarokapis,Henry Williams
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:While autonomous racing performance in Time-Trial scenarios has seen significant progress and development, autonomous wheel-to-wheel racing and overtaking are still severely limited. These limitations are particularly apparent in real-life driving scenarios where state-of-the-art algorithms struggle to safely or reliably complete overtaking manoeuvres. This is important, as reliable navigation around other vehicles is vital for safe autonomous wheel-to-wheel racing. The F1Tenth Competition provides a useful opportunity for developing wheel-to-wheel racing algorithms on a standardised physical platform. The competition format makes it possible to evaluate overtaking and wheel-to-wheel racing algorithms against the state-of-the-art. This research presents a novel racing and overtaking agent capable of learning to reliably navigate a track and overtake opponents in both simulation and reality. The agent was deployed on an F1Tenth vehicle and competed against opponents running varying competitive algorithms in the real world. The results demonstrate that the agent’s training against opponents enables deliberate overtaking behaviours with an overtaking rate of 87% compared 56% for an agent trained just to race.
[LG-48] Exploring Human-AI Conceptual Alignment through the Prism of Chess
链接: https://arxiv.org/abs/2510.26025
作者: Semyon Lomaso,Judah Goldfeder,Mehmet Hamza Erol,Matthew So,Yao Yan,Addison Howard,Nathan Kutz,Ravid Shwartz Ziv
类目: Machine Learning (cs.LG)
*备注:
Abstract:Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20% across all methods, revealing the model’s reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: this https URL.
[LG-49] Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry
链接: https://arxiv.org/abs/2510.26008
作者: Ziji Chen,Steven Chien,Peng Qian,Noa Zilberman
类目: Performance (cs.PF); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:  12 pages, 9 figures, submitted to nsdi 26
Abstract:Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users’ workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose System-X, which takes a \emphhardware-centric approach, relying only on hardware signals – fully accessible by operators. Using low-level signals collected from the system, System-X detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using System-X, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.
[LG-50] Infrequent Exploration in Linear Bandits NEURIPS2025
链接: https://arxiv.org/abs/2510.26000
作者: Harin Lee,Min-hwan Oh
类目: Machine Learning (cs.LG)
*备注:  NeurIPS 2025 camera-ready version
Abstract:We study the problem of infrequent exploration in linear bandits, addressing a significant yet overlooked gap between fully adaptive exploratory methods (e.g., UCB and Thompson Sampling), which explore potentially at every time step, and purely greedy approaches, which require stringent diversity assumptions to succeed. Continuous exploration can be impractical or unethical in safety-critical or costly domains, while purely greedy strategies typically fail without adequate contextual diversity. To bridge these extremes, we introduce a simple and practical framework, INFEX, explicitly designed for infrequent exploration. INFEX executes a base exploratory policy according to a given schedule while predominantly choosing greedy actions in between. Despite its simplicity, our theoretical analysis demonstrates that INFEX achieves instance-dependent regret matching standard provably efficient algorithms, provided the exploration frequency exceeds a logarithmic threshold. Additionally, INFEX is a general, modular framework that allows seamless integration of any fully adaptive exploration method, enabling wide applicability and ease of adoption. By restricting intensive exploratory computations to infrequent intervals, our approach can also enhance computational efficiency. Empirical evaluations confirm our theoretical findings, showing state-of-the-art regret performance and runtime improvements over existing methods.
[LG-51] Efficient Online Learning with Predictive Coding Networks: Exploiting Temporal Correlations IROS
链接: https://arxiv.org/abs/2510.25993
作者: Darius Masoum Zadeh-Jousdani,Elvin Hajizada,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:  Accepted at EdgeAI4R Workshop, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
Abstract:Robotic systems operating at the edge require efficient online learning algorithms that can continuously adapt to changing environments while processing streaming sensory data. Traditional backpropagation, while effective, conflicts with biological plausibility principles and may be suboptimal for continuous adaptation scenarios. The Predictive Coding (PC) framework offers a biologically plausible alternative with local, Hebbian-like update rules, making it suitable for neuromorphic hardware implementation. However, PC’s main limitation is its computational overhead due to multiple inference iterations during training. We present Predictive Coding Network with Temporal Amortization (PCN-TA), which preserves latent states across temporal frames. By leveraging temporal correlations, PCN-TA significantly reduces computational demands while maintaining learning performance. Our experiments on the COIL-20 robotic perception dataset demonstrate that PCN-TA achieves 10% fewer weight updates compared to backpropagation and requires 50% fewer inference steps than baseline PC networks. These efficiency gains directly translate to reduced computational overhead for moving another step toward edge deployment and real-time adaptation support in resource-constrained robotic systems. The biologically-inspired nature of our approach also makes it a promising candidate for future neuromorphic hardware implementations, enabling efficient online learning at the edge.
[LG-52] A General and Streamlined Differentiable Optimization Framework
链接: https://arxiv.org/abs/2510.25986
作者: Andrew W. Rosemberg,Joaquim Dias Garcia,François Pacaud,Robert B. Parker,Benoît Legat,Kaarthik Sundar,Russell Bent,Pascal Van Hentenryck
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:  17 pages, 4 figures
Abstract:Differentiating through constrained optimization problems is increasingly central to learning, control, and large-scale decision-making systems, yet practical integration remains challenging due to solver specialization and interface mismatches. This paper presents a general and streamlined framework-an updated this http URL-that unifies modeling and differentiation within the Julia optimization stack. The framework computes forward - and reverse-mode solution and objective sensitivities for smooth, potentially nonconvex programs by differentiating the KKT system under standard regularity assumptions. A first-class, JuMP-native parameter-centric API allows users to declare named parameters and obtain derivatives directly with respect to them - even when a parameter appears in multiple constraints and objectives - eliminating brittle bookkeeping from coefficient-level interfaces. We illustrate these capabilities on convex and nonconvex models, including economic dispatch, mean-variance portfolio selection with conic risk constraints, and nonlinear robot inverse kinematics. Two companion studies further demonstrate impact at scale: gradient-based iterative methods for strategic bidding in energy markets and Sobolev-style training of end-to-end optimization proxies using solver-accurate sensitivities. Together, these results demonstrate that differentiable optimization can be deployed as a routine tool for experimentation, learning, calibration, and design-without deviating from standard JuMP modeling practices and while retaining access to a broad ecosystem of solvers.
[LG-53] Contrastive Predictive Coding Done Right for Mutual Information Estimation
链接: https://arxiv.org/abs/2510.25983
作者: J. Jon Ryu,Pavan Yeddanapudi,Xiangxiang Xu,Gregory W. Wornell
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:  26 pages, 5 figures
Abstract:The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as InfoNCE-anchor, for accurate MI estimation. Our modification introduces an auxiliary anchor class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and f -divergence variants, under a single principled framework. Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance. These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.
[LG-54] Risks and Opportunities in Human-Machine Teaming in Operationalizing Machine Learning Target Variables
链接: https://arxiv.org/abs/2510.25974
作者: Mengtian Guo,David Gotz,Yue Wang
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:  23 pages, 6 figures
Abstract:Predictive modeling has the potential to enhance human decision-making. However, many predictive models fail in practice due to problematic problem formulation in cases where the prediction target is an abstract concept or construct and practitioners need to define an appropriate target variable as a proxy to operationalize the construct of interest. The choice of an appropriate proxy target variable is rarely self-evident in practice, requiring both domain knowledge and iterative data modeling. This process is inherently collaborative, involving both domain experts and data scientists. In this work, we explore how human-machine teaming can support this process by accelerating iterations while preserving human judgment. We study the impact of two human-machine teaming strategies on proxy construction: 1) relevance-first: humans leading the process by selecting relevant proxies, and 2) performance-first: machines leading the process by recommending proxies based on predictive performance. Based on a controlled user study of a proxy construction task (N = 20), we show that the performance-first strategy facilitated faster iterations and decision-making, but also biased users towards well-performing proxies that are misaligned with the application goal. Our study highlights the opportunities and risks of human-machine teaming in operationalizing machine learning target variables, yielding insights for future research to explore the opportunities and mitigate the risks.
[LG-55] On the Dataless Training of Neural Networks
链接: https://arxiv.org/abs/2510.25962
作者: Alvaro Velasquez,Susmit Jha,Ismail R. Alkhouri
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper surveys studies on the use of neural networks for optimization in the training-data-free setting. Specifically, we examine the dataless application of neural network architectures in optimization by re-parameterizing problems using fully connected (or MLP), convolutional, graph, and quadratic neural networks. Although MLPs have been used to solve linear programs a few decades ago, this approach has recently gained increasing attention due to its promising results across diverse applications, including those based on combinatorial optimization, inverse problems, and partial differential equations. The motivation for this setting stems from two key (possibly over-lapping) factors: (i) data-driven learning approaches are still underdeveloped and have yet to demonstrate strong results, as seen in combinatorial optimization, and (ii) the availability of training data is inherently limited, such as in medical image reconstruction and other scientific applications. In this paper, we define the dataless setting and categorize it into two variants based on how a problem instance – defined by a single datum – is encoded onto the neural network: (i) architecture-agnostic methods and (ii) architecture-specific methods. Additionally, we discuss similarities and clarify distinctions between the dataless neural network (dNN) settings and related concepts such as zero-shot learning, one-shot learning, lifting in optimization, and over-parameterization.
[LG-56] Modular Linear Tokenization (MLT)
链接: https://arxiv.org/abs/2510.25952
作者: Tcharlies Schmitz
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces Modular Linear Tokenization (MLT), a reversible and deterministic technique for encoding high-cardinality categorical identifiers into compact numerical vectors. Unlike traditional hashing or one-hot encodings, MLT preserves bijective mappings by leveraging modular arithmetic over finite fields and invertible linear transformations. The method offers explicit control of dimensionality and computational scalability while maintaining full reversibility, even for millions of identifiers. Experimental results on the MovieLens 20M dataset show that MLT achieves comparable predictive performance to supervised embeddings while requiring significantly fewer parameters and lower training cost. An open-source implementation of MLT is available on PyPI (this https URL) and GitHub (this https URL).
[LG-57] Robust GNN Watermarking via Implicit Perception of Topological Invariants
链接: https://arxiv.org/abs/2510.25934
作者: Jipeng Li,Yannning Shen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Graph Neural Networks (GNNs) are valuable intellectual property, yet many watermarks rely on backdoor triggers that break under common model edits and create ownership ambiguity. We present InvGNN-WM, which ties ownership to a model’s implicit perception of a graph invariant, enabling trigger-free, black-box verification with negligible task impact. A lightweight head predicts normalized algebraic connectivity on an owner-private carrier set; a sign-sensitive decoder outputs bits, and a calibrated threshold controls the false-positive rate. Across diverse node and graph classification datasets and backbones, InvGNN-WM matches clean accuracy while yielding higher watermark accuracy than trigger- and compression-based baselines. It remains strong under unstructured pruning, fine-tuning, and post-training quantization; plain knowledge distillation (KD) weakens the mark, while KD with a watermark loss (KD+WM) restores it. We provide guarantees for imperceptibility and robustness, and we prove that exact removal is NP-complete.
[LG-58] Active Learning with Task-Driven Representations for Messy Pools
链接: https://arxiv.org/abs/2510.25926
作者: Kianoosh Ashouritaklimi,Tom Rainforth
类目: Machine Learning (cs.LG)
*备注:
Abstract:Active learning has the potential to be especially useful for messy, uncurated pools where datapoints vary in relevance to the target task. However, state-of-the-art approaches to this problem currently rely on using fixed, unsupervised representations of the pool, focusing on modifying the acquisition function instead. We show that this model setup can undermine their effectiveness at dealing with messy pools, as such representations can fail to capture important information relevant to the task. To address this, we propose using task-driven representations that are periodically updated during the active learning process using the previously collected labels. We introduce two specific strategies for learning these representations, one based on directly learning semi-supervised representations and the other based on supervised fine-tuning of an initial unsupervised representation. We find that both significantly improve empirical performance over using unsupervised or pretrained representations.
[LG-59] opology-Aware Active Learning on Graphs
链接: https://arxiv.org/abs/2510.25892
作者: Harris Hardiman-Mostow,Jack Mauro,Adrien Weihs,Andrea L. Bertozzi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a graph-topological approach to active learning that directly targets the core challenge of exploration versus exploitation under scarce label budgets. To guide exploration, we introduce a coreset construction algorithm based on Balanced Forman Curvature (BFC), which selects representative initial labels that reflect the graph’s cluster structure. This method includes a data-driven stopping criterion that signals when the graph has been sufficiently explored. We further use BFC to dynamically trigger the shift from exploration to exploitation within active learning routines, replacing hand-tuned heuristics. To improve exploitation, we introduce a localized graph rewiring strategy that efficiently incorporates multiscale information around labeled nodes, enhancing label propagation while preserving sparsity. Experiments on benchmark classification tasks show that our methods consistently outperform existing graph-based semi-supervised baselines at low label rates.
[LG-60] π_textttRL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
链接: https://arxiv.org/abs/2510.25889
作者: Kang Chen,Zhihao Liu,Tonghe Zhang,Zhen Guo,Si Xu,Hao Lin,Hongzhi Zang,Quanlu Zhang,Zhaofei Yu,Guoliang Fan,Tiejun Huang,Yu Wang,Chao Yu
类目: Machine Learning (cs.LG)
*备注:  Preprint, work in progress. 24 pages
Abstract:Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., \pi_0 , \pi_0.5 ) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with \pi_\textRL , an open-source framework for training flow-based VLAs in parallel simulation. \pi_\textRL implements two RL algorithms: (1) Flow-Noise models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) Flow-SDE integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate \pi_\textRL on LIBERO and ManiSkill benchmarks. On LIBERO, \pi_\textRL boosts few-shot SFT models \pi_0 and \pi_0.5 from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train \pi_\textRL in 320 parallel environments, improving \pi_0 from 41.6% to 85.7% and \pi_0.5 from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, \pi_\textRL achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs. Comments: Preprint, work in progress. 24 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.25889 [cs.LG] (or arXiv:2510.25889v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.25889 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Tonghe Zhang [view email] [v1] Wed, 29 Oct 2025 18:37:39 UTC (1,805 KB)
[LG-61] MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs
链接: https://arxiv.org/abs/2510.25867
作者: Xiaoke Huang,Ningsen Wang,Hui Liu,Xianfeng Tang,Yuyin Zhou
类目: Machine Learning (cs.LG)
*备注:  Project page, code, data, and models: this https URL
Abstract:Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA, outperforming strong medical LMMs. A Ablations verify that both generation and verification are necessary and that more verified data consistently helps, and a targeted contamination analysis detects no leakage from evaluation suites. By operating entirely on open literature and open-weight models, MedVLSynther offers an auditable, reproducible, and privacy-preserving path to scalable medical VQA training data.
[LG-62] Debate2Create: Robot Co-design via Large Language Model Debates
链接: https://arxiv.org/abs/2510.25850
作者: Kevin Qiu,Marek Cygan
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Automating the co-design of a robot’s morphology and control is a long-standing challenge due to the vast design space and the tight coupling between body and behavior. We introduce Debate2Create (D2C), a framework in which large language model (LLM) agents engage in a structured dialectical debate to jointly optimize a robot’s design and its reward function. In each round, a design agent proposes targeted morphological modifications, and a control agent devises a reward function tailored to exploit the new design. A panel of pluralistic judges then evaluates the design-control pair in simulation and provides feedback that guides the next round of debate. Through iterative debates, the agents progressively refine their proposals, producing increasingly effective robot designs. Notably, D2C yields diverse and specialized morphologies despite no explicit diversity objective. On a quadruped locomotion benchmark, D2C discovers designs that travel 73% farther than the default, demonstrating that structured LLM-based debate can serve as a powerful mechanism for emergent robot co-design. Our results suggest that multi-agent debate, when coupled with physics-grounded feedback, is a promising new paradigm for automated robot design.
[LG-63] Flex-GAD : Flexible Graph Anomaly Detection
链接: https://arxiv.org/abs/2510.25809
作者: Apu Chakraborty,Anshul Kumar,Gagan Raj Gupta
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Detecting anomalous nodes in attributed networks, where each node is associated with both structural connections and descriptive attributes, is essential for identifying fraud, misinformation, and suspicious behavior in domains such as social networks, academic citation graphs, and e-commerce platforms. We propose Flex-GAD, a novel unsupervised framework for graph anomaly detection at the node level. Flex-GAD integrates two encoders to capture complementary aspects of graph data. The framework incorporates a novel community-based GCN encoder to model intra-community and inter-community information into node embeddings, thereby ensuring structural consistency, along with a standard attribute encoder. These diverse representations are fused using a self-attention-based representation fusion module, which enables adaptive weighting and effective integration of the encoded information. This fusion mechanism allows automatic emphasis of the most relevant node representation across different encoders. We evaluate Flex-GAD on seven real-world attributed graphs with varying sizes, node degrees, and attribute homogeneity. Flex-GAD achieves an average AUC improvement of 7.98% over the previously best-performing method, GAD-NR, demonstrating its effectiveness and flexibility across diverse graph structures. Moreover, it significantly reduces training time, running 102x faster per epoch than Anomaly DAE and 3x faster per epoch than GAD-NR on average across seven benchmark datasets.
[LG-64] PRESTO: Preimage-Informed Instruction Optimization for Prompting Black-Box LLM s NEURIPS2025
链接: https://arxiv.org/abs/2510.25808
作者: Jaewon Chu,Seunghun Lee,Hyunwoo J. Kim
类目: Machine Learning (cs.LG)
*备注:  Accepted to NeurIPS 2025
Abstract:Large language models (LLMs) have achieved remarkable success across diverse domains, due to their strong instruction-following capabilities. This has led to increasing interest in optimizing instructions for black-box LLMs, whose internal parameters are inaccessible but widely used due to their strong performance. To optimize instructions for black-box LLMs, recent methods employ white-box LLMs to generate candidate instructions from optimized soft prompts. However, white-box LLMs often map different soft prompts to the same instruction, leading to redundant queries. While previous studies regarded this many-to-one mapping as a structure that hinders optimization efficiency, we reinterpret it as a useful prior knowledge that can accelerate the optimization. To this end, we introduce PREimage-informed inSTruction Optimization (PRESTO), a novel framework that leverages the preimage structure of soft prompts for efficient optimization. PRESTO consists of three key components: (1) score sharing, which shares the evaluation score with all soft prompts in a preimage; (2) preimage-based initialization, which selects initial data points that maximize search space coverage using preimage information; and (3) score consistency regularization, which enforces prediction consistency within each preimage. By leveraging preimages, PRESTO achieves the effect of effectively obtaining 14 times more scored data under the same query budget, resulting in more efficient optimization. Experimental results on 33 instruction optimization tasks demonstrate the superior performance of PRESTO. Code is available at this https URL
[LG-65] Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training
链接: https://arxiv.org/abs/2510.25803
作者: Hong Wang,Haiyang Xin,Jie Wang,Xuanze Yang,Fei Zha,Huanshuo Dong,Yan Jiang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equation-specific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts. We pre-train models with parameters from 30M to 0.5B on 6 public PDE datasets. Our model with 90M activated parameters achieves up to a 40% reduction in zero-shot error compared with existing models with 120M activated parameters. Additionally, we conduct interpretability analysis, showing that dataset types can be inferred from router-gating network decisions, which validates the rationality and effectiveness of the MoE architecture.
[LG-66] Attention Augmented GNN RNN-Attention Models for Advanced Cybersecurity Intrusion Detection
链接: https://arxiv.org/abs/2510.25802
作者: Jayant Biradar,Smit Shah,Tanmay Naik
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we propose a novel hybrid deep learning architecture that synergistically combines Graph Neural Networks (GNNs), Recurrent Neural Networks (RNNs), and multi-head attention mechanisms to significantly enhance cy- bersecurity intrusion detection capabilities. By leveraging the comprehensive UNSW-NB15 dataset containing diverse network traffic patterns, our approach effectively captures both spatial dependencies through graph structural relationships and tem- poral dynamics through sequential analysis of network events. The integrated attention mechanism provides dual benefits of improved model interpretability and enhanced feature selection, enabling cybersecurity analysts to focus computational resources on high-impact security events - a critical requirement in modern real-time intrusion detection systems. Our extensive experimental evaluation demonstrates that the proposed hybrid model achieves superior performance compared to traditional machine learning approaches and standalone deep learning models across multiple evaluation metrics, including accuracy, precision, recall, and F1-score. The model achieves particularly strong performance in detecting sophisticated attack patterns such as Advanced Persistent Threats (APTs), Distributed Denial of Service (DDoS) attacks, and zero-day exploits, making it a promising solution for next-generation cybersecurity applications in complex network environments.
[LG-67] FreIE: Low-Frequency Spectral Bias in Neural Networks for Time-Series Tasks
链接: https://arxiv.org/abs/2510.25800
作者: Jialong Sun,Xinpeng Ling,Jiaxuan Zou,Jiawen Kang,Kejia Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The inherent autocorrelation of time series data presents an ongoing challenge to multivariate time series prediction. Recently, a widely adopted approach has been the incorporation of frequency domain information to assist in long-term prediction tasks. Many researchers have independently observed the spectral bias phenomenon in neural networks, where models tend to fit low-frequency signals before high-frequency ones. However, these observations have often been attributed to the specific architectures designed by the researchers, rather than recognizing the phenomenon as a universal characteristic across models. To unify the understanding of the spectral bias phenomenon in long-term time series prediction, we conducted extensive empirical experiments to measure spectral bias in existing mainstream models. Our findings reveal that virtually all models exhibit this phenomenon. To mitigate the impact of spectral bias, we propose the FreLE (Frequency Loss Enhancement) algorithm, which enhances model generalization through both explicit and implicit frequency regularization. This is a plug-and-play model loss function unit. A large number of experiments have proven the superior performance of FreLE. Code is available at this https URL.
[LG-68] Optimal Information Combining for Multi-Agent Systems Using Adaptive Bias Learning
链接: https://arxiv.org/abs/2510.25793
作者: Siavash M. Alamouti,Fay Arjomandi
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:  22 pages, 2 Figures, 62 equations, 47 references
Abstract:Modern multi-agent systems ranging from sensor networks monitoring critical infrastructure to crowdsourcing platforms aggregating human intelligence can suffer significant performance degradation due to systematic biases that vary with environmental conditions. Current approaches either ignore these biases, leading to suboptimal decisions, or require expensive calibration procedures that are often infeasible in practice. This performance gap has real consequences: inaccurate environmental monitoring, unreliable financial predictions, and flawed aggregation of human judgments. This paper addresses the fundamental question: when can we learn and correct for these unknown biases to recover near-optimal performance, and when is such learning futile? We develop a theoretical framework that decomposes biases into learnable systematic components and irreducible stochastic components, introducing the concept of learnability ratio as the fraction of bias variance predictable from observable covariates. This ratio determines whether bias learning is worthwhile for a given system. We prove that the achievable performance improvement is fundamentally bounded by this learnability ratio, providing system designers with quantitative guidance on when to invest in bias learning versus simpler approaches. We present the Adaptive Bias Learning and Optimal Combining (ABLOC) algorithm, which iteratively learns bias-correcting transformations while optimizing combination weights through closedform solutions, guaranteeing convergence to these theoretical bounds. Experimental validation demonstrates that systems with high learnability ratios can recover significant performance (we achieved 40%-70% of theoretical maximum improvement in our examples), while those with low learnability show minimal benefit, validating our diagnostic criteria for practical deployment decisions.
[LG-69] SHA-256 Infused Embedding-Driven Generative Modeling of High-Energy Molecules in Low-Data Regimes
链接: https://arxiv.org/abs/2510.25788
作者: Siddharth Verma,Alankar Alankar
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:High-energy materials (HEMs) are critical for propulsion and defense domains, yet their discovery remains constrained by experimental data and restricted access to testing facilities. This work presents a novel approach toward high-energy molecules by combining Long Short-Term Memory (LSTM) networks for molecular generation and Attentive Graph Neural Networks (GNN) for property predictions. We propose a transformative embedding space construction strategy that integrates fixed SHA-256 embeddings with partially trainable representations. Unlike conventional regularization techniques, this changes the representational basis itself, reshaping the molecular input space before learning begins. Without recourse to pretraining, the generator achieves 67.5% validity and 37.5% novelty. The generated library exhibits a mean Tanimoto coefficient of 0.214 relative to training set signifying the ability of framework to generate a diverse chemical space. We identified 37 new super explosives higher than 9 km/s predicted detonation velocity.
[LG-70] ransformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information
链接: https://arxiv.org/abs/2510.25542
作者: Yuan Cheng,Yu Huang,Zhe Xiong,Yingbin Liang,Vincent Y. F. Tan
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) – which involve multiple parents per node – remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the f -divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a K -parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the f -divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings. Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) Cite as: arXiv:2510.25542 [cs.LG] (or arXiv:2510.25542v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.25542 Focus to learn more arXiv-issued DOI via DataCite
[LG-71] Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders NEURIPS2025
链接: https://arxiv.org/abs/2510.23802
作者: Nathan Paek,Yongyi Zang,Qihui Yang,Randal Leistikow
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:  Accepted to NeurIPS 2025 Mechanistic Interpretability Workshop
Abstract:While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio’s dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.
[LG-72] A Unified Theory for Causal Inference: Direct Debiased Machine Learning via Bregman-Riesz Regression
链接: https://arxiv.org/abs/2510.26783
作者: Masahiro Kato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:This note introduces a unified theory for causal inference that integrates Riesz regression, covariate balancing, density-ratio estimation (DRE), targeted maximum likelihood estimation (TMLE), and the matching estimator in average treatment effect (ATE) estimation. In ATE estimation, the balancing weights and the regression functions of the outcome play important roles, where the balancing weights are referred to as the Riesz representer, bias-correction term, and clever covariates, depending on the context. Riesz regression, covariate balancing, DRE, and the matching estimator are methods for estimating the balancing weights, where Riesz regression is essentially equivalent to DRE in the ATE context, the matching estimator is a special case of DRE, and DRE is in a dual relationship with covariate balancing. TMLE is a method for constructing regression function estimators such that the leading bias term becomes zero. Nearest Neighbor Matching is equivalent to Least Squares Density Ratio Estimation and Riesz Regression.
[LG-73] Bridging the Gap between Empirical Welfare Maximization and Conditional Averag e Treatment Effect Estimation in Policy Learning
链接: https://arxiv.org/abs/2510.26723
作者: Masahiro Kato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:The goal of policy learning is to train a policy function that recommends a treatment given covariates to maximize population welfare. There are two major approaches in policy learning: the empirical welfare maximization (EWM) approach and the plug-in approach. The EWM approach is analogous to a classification problem, where one first builds an estimator of the population welfare, which is a functional of policy functions, and then trains a policy by maximizing the estimated welfare. In contrast, the plug-in approach is based on regression, where one first estimates the conditional average treatment effect (CATE) and then recommends the treatment with the highest estimated outcome. This study bridges the gap between the two approaches by showing that both are based on essentially the same optimization problem. In particular, we prove an exact equivalence between EWM and least squares over a reparameterization of the policy class. As a consequence, the two approaches are interchangeable in several respects and share the same theoretical guarantees under common conditions. Leveraging this equivalence, we propose a novel regularization method for policy learning. Our findings yield a convex and computationally efficient training procedure that avoids the NP-hard combinatorial step typically required in EWM.
[LG-74] Assessment of the conditional exchangeability assumption in causal machine learning models: a simulation study
链接: https://arxiv.org/abs/2510.26700
作者: Gerard T. Portela,Jason B. Gibbons,Sebastian Schneeweiss,Rishi J. Desai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Observational studies developing causal machine learning (ML) models for the prediction of individualized treatment effects (ITEs) seldom conduct empirical evaluations to assess the conditional exchangeability assumption. We aimed to evaluate the performance of these models under conditional exchangeability violations and the utility of negative control outcomes (NCOs) as a diagnostic. We conducted a simulation study to examine confounding bias in ITE estimates generated by causal forest and X-learner models under varying conditions, including the presence or absence of true heterogeneity. We simulated data to reflect real-world scenarios with differing levels of confounding, sample size, and NCO confounding structures. We then estimated and compared subgroup-level treatment effects on the primary outcome and NCOs across settings with and without unmeasured confounding. When conditional exchangeability was violated, causal forest and X-learner models failed to recover true treatment effect heterogeneity and, in some cases, falsely indicated heterogeneity when there was none. NCOs successfully identified subgroups affected by unmeasured confounding. Even when NCOs did not perfectly satisfy its ideal assumptions, it remained informative, flagging potential bias in subgroup level estimates, though not always pinpointing the subgroup with the largest confounding. Violations of conditional exchangeability substantially limit the validity of ITE estimates from causal ML models in routinely collected observational data. NCOs serve a useful empirical diagnostic tool for detecting subgroup-specific unmeasured confounding and should be incorporated into causal ML workflows to support the credibility of individualized inference.
[LG-75] FlowQ-Net: A Generative Framework for Automated Quantum Circuit Design
链接: https://arxiv.org/abs/2510.26688
作者: Jun Dai,Michael Rizvi-Martel,Guillaume Rabusseau
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Designing efficient quantum circuits is a central bottleneck to exploring the potential of quantum computing, particularly for noisy intermediate-scale quantum (NISQ) devices, where circuit efficiency and resilience to errors are paramount. The search space of gate sequences grows combinatorially, and handcrafted templates often waste scarce qubit and depth budgets. We introduce \textscFlowQ-Net (Flow-based Quantum design Network), a generative framework for automated quantum circuit synthesis based on Generative Flow Networks (GFlowNets). This framework learns a stochastic policy to construct circuits sequentially, sampling them in proportion to a flexible, user-defined reward function that can encode multiple design objectives such as performance, depth, and gate count. This approach uniquely enables the generation of a diverse ensemble of high-quality circuits, moving beyond single-solution optimization. We demonstrate the efficacy of \textscFlowQ-Net through an extensive set of simulations. We apply our method to Variational Quantum Algorithm (VQA) ansatz design for molecular ground state estimation, Max-Cut, and image classification, key challenges in near-term quantum computing. Circuits designed by \textscFlowQ-Net achieve significant improvements, yielding circuits that are 10 \times -30 \times more compact in terms of parameters, gates, and depth compared to commonly used unitary baselines, without compromising accuracy. This trend holds even when subjected to error profiles from real-world quantum devices. Our results underline the potential of generative models as a general-purpose methodology for automated quantum circuit design, offering a promising path towards more efficient quantum algorithms and accelerating scientific discovery in the quantum domain.
[LG-76] Action-Driven Processes for Continuous-Time Control
链接: https://arxiv.org/abs/2510.26672
作者: Ruimin He,Shaowei Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:At the heart of reinforcement learning are actions - decisions made in response to observations of the environment. Actions are equally fundamental in the modeling of stochastic processes, as they trigger discontinuous state transitions and enable the flow of information through large, complex systems. In this paper, we unify the perspectives of stochastic processes and reinforcement learning through action- driven processes, and illustrate their application to spiking neural networks. Leveraging ideas from control-as-inference, we show that minimizing the Kullback-Leibler divergence between a policy-driven true distribution and a reward-driven model distribution for a suitably defined action-driven process is equivalent to maximum entropy reinforcement learning.
[LG-77] Hybrid Physical-Neural Simulator for Fast Cosmological Hydrodynamics NEURIPS2025
链接: https://arxiv.org/abs/2510.26593
作者: Arne Thomsen,Tilman Tröster,François Lanusse
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注:  Accepted to the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences
Abstract:Cosmological field-level inference requires differentiable forward models that solve the challenging dynamics of gas and dark matter under hydrodynamics and gravity. We propose a hybrid approach where gravitational forces are computed using a differentiable particle-mesh solver, while the hydrodynamics are parametrized by a neural network that maps local quantities to an effective pressure field. We demonstrate that our method improves upon alternative approaches, such as an Enthalpy Gradient Descent baseline, both at the field and summary-statistic level. The approach is furthermore highly data efficient, with a single reference simulation of cosmological structure formation being sufficient to constrain the neural pressure model. This opens the door for future applications where the model is fit directly to observational data, rather than a training set of simulations.
[LG-78] Physics-Informed Mixture Models and Surrogate Models for Precision Additive Manufacturing
链接: https://arxiv.org/abs/2510.26586
作者: Sebastian Basterrech,Shuo Shan,Debabrata Adhikari,Sankhya Mohanty
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG)
*备注:  Five pages, four figures, to be presented at the AI in Science Summit, Denmark, November, 2025
Abstract:In this study, we leverage a mixture model learning approach to identify defects in laser-based Additive Manufacturing (AM) processes. By incorporating physics based principles, we also ensure that the model is sensitive to meaningful physical parameter variations. The empirical evaluation was conducted by analyzing real-world data from two AM processes: Directed Energy Deposition and Laser Powder Bed Fusion. In addition, we also studied the performance of the developed framework over public datasets with different alloy type and experimental parameter information. The results show the potential of physics-guided mixture models to examine the underlying physical behavior of an AM system.
[LG-79] Multi-Output Robust and Conjugate Gaussian Processes
链接: https://arxiv.org/abs/2510.26401
作者: Joshua Rooijakkers,Leiv Rønneberg,François-Xavier Briol,Jeremias Knoblauch,Matias Altamirano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Multi-output Gaussian process (MOGP) regression allows modelling dependencies among multiple correlated response variables. Similarly to standard Gaussian processes, MOGPs are sensitive to model misspecification and outliers, which can distort predictions within individual outputs. This situation can be further exacerbated by multiple anomalous response variables whose errors propagate due to correlations between outputs. To handle this situation, we extend and generalise the robust and conjugate Gaussian process (RCGP) framework introduced by Altamirano et al. (2024). This results in the multi-output RCGP (MO-RCGP): a provably robust MOGP that is conjugate, and jointly captures correlations across outputs. We thoroughly evaluate our approach through applications in finance and cancer research.
[LG-80] SABER: Symbolic Regression-based Angle of Arrival and Beam Pattern Estimator
链接: https://arxiv.org/abs/2510.26340
作者: Shih-Kai Chou,Mengran Zhao,Cheng-Nan Hu,Kuang-Chung Chou,Carolina Fortuna,Jernej Hribar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:  12 pages, 11 figures
Abstract:Accurate Angle-of-arrival (AoA) estimation is essential for next-generation wireless communication systems to enable reliable beamforming, high-precision localization, and integrated sensing. Unfortunately, classical high-resolution techniques require multi-element arrays and extensive snapshot collection, while generic Machine Learning (ML) approaches often yield black-box models that lack physical interpretability. To address these limitations, we propose a Symbolic Regression (SR)-based ML framework. Namely, Symbolic Regression-based Angle of Arrival and Beam Pattern Estimator (SABER), a constrained symbolic-regression framework that automatically discovers closed-form beam pattern and AoA models from path loss measurements with interpretability. SABER achieves high accuracy while bridging the gap between opaque ML methods and interpretable physics-driven estimators. First, we validate our approach in a controlled free-space anechoic chamber, showing that both direct inversion of the known \cos^n beam and a low-order polynomial surrogate achieve sub-0.5 degree Mean Absolute Error (MAE). A purely unconstrained SR method can further reduce the error of the predicted angles, but produces complex formulas that lack physical insight. Then, we implement the same SR-learned inversions in a real-world, Reconfigurable Intelligent Surface (RIS)-aided indoor testbed. SABER and unconstrained SR models accurately recover the true AoA with near-zero error. Finally, we benchmark SABER against the Cramér-Rao Lower Bounds (CRLBs). Our results demonstrate that SABER is an interpretable and accurate alternative to state-of-the-art and black-box ML-based methods for AoA estimation.
[LG-81] Uncertainty-Aware Diagnostics for Physics-Informed Machine Learning
链接: https://arxiv.org/abs/2510.26121
作者: Mara Daniels,Liam Hodgkinson,Michael Mahoney
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Physics-informed machine learning (PIML) integrates prior physical information, often in the form of differential equation constraints, into the process of fitting machine learning models to physical data. Popular PIML approaches, including neural operators, physics-informed neural networks, neural ordinary differential equations, and neural discrete equilibria, are typically fit to objectives that simultaneously include both data and physical constraints. However, the multi-objective nature of this approach creates ambiguity in the measurement of model quality. This is related to a poor understanding of epistemic uncertainty, and it can lead to surprising failure modes, even when existing statistical metrics suggest strong fits. Working within a Gaussian process regression framework, we introduce the Physics-Informed Log Evidence (PILE) score. Bypassing the ambiguities of test losses, the PILE score is a single, uncertainty-aware metric that provides a selection principle for hyperparameters of a PIML model. We show that PILE minimization yields excellent choices for a wide variety of model parameters, including kernel bandwidth, least squares regularization weights, and even kernel function selection. We also show that, even prior to data acquisition, a special ‘data-free’ case of the PILE score identifies a priori kernel choices that are ‘well-adapted’ to a given PDE. Beyond the kernel setting, we anticipate that the PILE score can be extended to PIML at large, and we outline approaches to do so.
[LG-82] Robust Super-Capacity SRS Channel Inpainting via Diffusion Models
链接: https://arxiv.org/abs/2510.26097
作者: Usman Akram,Fan Zhang,Yang Li,Haris Vikalo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Accurate channel state information (CSI) is essential for reliable multiuser MIMO operation. In 5G NR, reciprocity-based beamforming via uplink Sounding Reference Signals (SRS) face resource and coverage constraints, motivating sparse non-uniform SRS allocation. Prior masked-autoencoder (MAE) approaches improve coverage but overfit to training masks and degrade under unseen distortions (e.g., additional masking, interference, clipping, non-Gaussian noise). We propose a diffusion-based channel inpainting framework that integrates system-model knowledge at inference via a likelihood-gradient term, enabling a single trained model to adapt across mismatched conditions. On standardized CDL channels, the score-based diffusion variant consistently outperforms a UNet score-model baseline and the one-step MAE under distribution shift, with improvements up to 14 dB NMSE in challenging settings (e.g., Laplace noise, user interference), while retaining competitive accuracy under matched conditions. These results demonstrate that diffusion-guided inpainting is a robust and generalizable approach for super-capacity SRS design in 5G NR systems.
[LG-83] Bias-Corrected Data Synthesis for Imbalanced Learning
链接: https://arxiv.org/abs/2510.26046
作者: Pengfei Lyu,Zhengchi Ma,Linjun Zhang,Anru R. Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:  41 pages, 4 figures, includes proofs and appendix
Abstract:Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naively treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.
[LG-84] L_1-norm Regularized Indefinite Kernel Logistic Regression
链接: https://arxiv.org/abs/2510.26043
作者: Shaoxin Wang,Hanjing Yao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:  17 pages, 1 figure
Abstract:Kernel logistic regression (KLR) is a powerful classification method widely applied across diverse domains. In many real-world scenarios, indefinite kernels capture more domain-specific structural information than positive definite kernels. This paper proposes a novel L_1 -norm regularized indefinite kernel logistic regression (RIKLR) model, which extends the existing IKLR framework by introducing sparsity via an L_1 -norm penalty. The introduction of this regularization enhances interpretability and generalization while introducing nonsmoothness and nonconvexity into the optimization landscape. To address these challenges, a theoretically grounded and computationally efficient proximal linearized algorithm is developed. Experimental results on multiple benchmark datasets demonstrate the superior performance of the proposed method in terms of both accuracy and sparsity.
[LG-85] Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation
链接: https://arxiv.org/abs/2510.26026
作者: Feichen Gan,Youcun Lu,Yingying Zhang,Yukun Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals for returns in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.
[LG-86] Enabling Fast and Accurate Neutral Atom Readout through Image Denoising
链接: https://arxiv.org/abs/2510.25982
作者: Chaithanya Naik Mude,Linipun Phuttitarn,Satvik Maurya,Kunal Sinha,Mark Saffman,Swamit Tannu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:  12 pages, 15 figures
Abstract:Neutral atom quantum computers hold promise for scaling up to hundreds of thousands of qubits, but their progress is constrained by slow qubit readout. Measuring qubits currently takes milliseconds-much longer than the underlying quantum gate operations-making readout the primary bottleneck in deploying quantum error correction. Because each round of QEC depends on measurement, long readout times increase cycle duration and slow down program execution. Reducing the readout duration speeds up cycles and reduces decoherence errors that accumulate while qubits idle, but it also lowers the number of collected photons, making measurements noisier and more error-prone. This tradeoff leaves neutral atom systems stuck between slow but accurate readout and fast but unreliable readout. We show that image denoising can resolve this tension. Our framework, GANDALF, uses explicit denoising using image translation to reconstruct clear signals from short, low-photon measurements, enabling reliable classification at up to 1.6x shorter readout times. Combined with lightweight classifiers and a pipelined readout design, our approach both reduces logical error rate by up to 35x and overall QEC cycle time up to 1.77x compared to state-of-the-art CNN-based readout for Cesium (Cs) Neutral Atom arrays. Comments: 12 pages, 15 figures Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2510.25982 [quant-ph] (or arXiv:2510.25982v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2510.25982 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-87] InputDSA: Demixing then Comparing Recurrent and Externally Driven Dynamics
链接: https://arxiv.org/abs/2510.25943
作者: Ann Huang,Mitchell Ostrow,Satpreet H. Singh,Leo Kozachkov,Ila Fiete,Kanaka Rajan
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注:  36 pages, 14 figures
Abstract:In control problems and basic scientific modeling, it is important to compare observations with dynamical simulations. For example, comparing two neural systems can shed light on the nature of emergent computations in the brain and deep neural networks. Recently, Ostrow et al. (2023) introduced Dynamical Similarity Analysis (DSA), a method to measure the similarity of two systems based on their recurrent dynamics rather than geometry or topology. However, DSA does not consider how inputs affect the dynamics, meaning that two similar systems, if driven differently, may be classified as different. Because real-world dynamical systems are rarely autonomous, it is important to account for the effects of input drive. To this end, we introduce a novel metric for comparing both intrinsic (recurrent) and input-driven dynamics, called InputDSA (iDSA). InputDSA extends the DSA framework by estimating and comparing both input and intrinsic dynamic operators using a variant of Dynamic Mode Decomposition with control (DMDc) based on subspace identification. We demonstrate that InputDSA can successfully compare partially observed, input-driven systems from noisy data. We show that when the true inputs are unknown, surrogate inputs can be substituted without a major deterioration in similarity estimates. We apply InputDSA on Recurrent Neural Networks (RNNs) trained with Deep Reinforcement Learning, identifying that high-performing networks are dynamically similar to one another, while low-performing networks are more diverse. Lastly, we apply InputDSA to neural data recorded from rats performing a cognitive task, demonstrating that it identifies a transition from input-driven evidence accumulation to intrinsically-driven decision-making. Our work demonstrates that InputDSA is a robust and efficient method for comparing intrinsic dynamics and the effect of external input on dynamical systems.
[LG-88] Optimizing Mirror-Image Peptide Sequence Design for Data Storag e via Peptide Bond Cleavage Prediction
链接: https://arxiv.org/abs/2510.25814
作者: Yilong Lu,Si Chen,Songyan Gao,Han Liu,Xin Dong,Wenfeng Shen,Guangtai Ding
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:  8 pages, 4 figures
Abstract:Traditional non-biological storage media, such as hard drives, face limitations in both storage density and lifespan due to the rapid growth of data in the big data era. Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium due to their high storage density, structural stability, and long lifespan. The sequencing of mirror-image peptides relies on \textitde-novo technology. However, its accuracy is limited by the scarcity of tandem mass spectrometry datasets and the challenges that current algorithms encounter when processing these peptides directly. This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences. In this work, we introduce DBond, a deep neural network based model that integrates sequence features, precursor ion properties, and mass spectrometry environmental factors for the prediction of mirror-image peptide bond cleavage. In this process, sequences with a high peptide bond cleavage ratio, which are easy to sequence, are selected. The main contributions of this study are as follows. First, we constructed MiPD513, a tandem mass spectrometry dataset containing 513 mirror-image peptides. Second, we developed the peptide bond cleavage labeling algorithm (PBCLA), which generated approximately 12.5 million labeled data based on MiPD513. Third, we proposed a dual prediction strategy that combines multi-label and single-label classification. On an independent test set, the single-label classification strategy outperformed other methods in both single and multiple peptide bond cleavage prediction tasks, offering a strong foundation for sequence optimization.
[LG-89] Multimodal Bandits: Regret Lower Bounds and Optimal Algorithms NEURIPS2025
链接: https://arxiv.org/abs/2510.25811
作者: William Réveillard,Richard Combes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:  31 pages; NeurIPS 2025
Abstract:We consider a stochastic multi-armed bandit problem with i.i.d. rewards where the expected reward function is multimodal with at most m modes. We propose the first known computationally tractable algorithm for computing the solution to the Graves-Lai optimization problem, which in turn enables the implementation of asymptotically optimal algorithms for this bandit problem. The code for the proposed algorithms is publicly available at this https URL
[LG-90] Discovering Interpretable Biological Concepts in Single-cell RNA-seq Foundation Models
链接: https://arxiv.org/abs/2510.25807
作者: Charlotte Claye(MICS),Pierre Marschall,Wassila Ouerdane(MICS),Céline Hudelot(MICS),Julien Duquesne
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Single-cell RNA-seq foundation models achieve strong performance on downstream tasks but remain black boxes, limiting their utility for biological discovery. Recent work has shown that sparse dictionary learning can extract concepts from deep learning models, with promising applications in biomedical imaging and protein models. However, interpreting biological concepts remains challenging, as biological sequences are not inherently human-interpretable. We introduce a novel concept-based interpretability framework for single-cell RNA-seq models with a focus on concept interpretation and evaluation. We propose an attribution method with counterfactual perturbations that identifies genes that influence concept activation, moving beyond correlational approaches like differential expression analysis. We then provide two complementary interpretation approaches: an expert-driven analysis facilitated by an interactive interface and an ontology-driven method with attribution-based biological pathway enrichment. Applying our framework to two well-known single-cell RNA-seq models from the literature, we interpret concepts extracted by Top-K Sparse Auto-Encoders trained on two immune cell datasets. With a domain expert in immunology, we show that concepts improve interpretability compared to individual neurons while preserving the richness and informativeness of the latent representations. This work provides a principled framework for interpreting what biological knowledge foundation models have encoded, paving the way for their use for hypothesis generation and discovery.
[LG-91] Pulsar Detection with Deep Learning
链接: https://arxiv.org/abs/2510.25774
作者: Manideep Pendyala
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:  56 pages, My master’s thesis
Abstract:Pulsar surveys generate millions of candidates per run, overwhelming manual inspection. This thesis builds a deep learning pipeline for radio pulsar candidate selection that fuses array-derived features with image diagnostics. From approximately 500 GB of Giant Metrewave Radio Telescope (GMRT) data, raw voltages are converted to filterbanks (SIGPROC), then de-dispersed and folded across trial dispersion measures (PRESTO) to produce approximately 32,000 candidates. Each candidate yields four diagnostics–summed profile, time vs. phase, subbands vs. phase, and DM curve–represented as arrays and images. A baseline stacked model (ANNs for arrays + CNNs for images with logistic-regression fusion) reaches 68% accuracy. We then refine the CNN architecture and training (regularization, learning-rate scheduling, max-norm constraints) and mitigate class imbalance via targeted augmentation, including a GAN-based generator for the minority class. The enhanced CNN attains 87% accuracy; the final GAN+CNN system achieves 94% accuracy with balanced precision and recall on a held-out test set, while remaining lightweight enough for near–real-time triage. The results show that combining array and image channels improves separability over image-only approaches, and that modest generative augmentation substantially boosts minority (pulsar) recall. The methods are survey-agnostic and extensible to forthcoming high-throughput facilities.
[LG-92] RNAGenScape: Property-guided Optimization and Interpolation of mRNA Sequences with Manifold Langevin Dynamics ICML2025
链接: https://arxiv.org/abs/2510.24736
作者: Danqi Liao,Chen Liu,Xingzhi Sun,Dié Tang,Haochen Wang,Scott Youlten,Srikar Krishna Gopinath,Haejeong Lee,Ethan C. Strayer,Antonio J. Giraldez,Smita Krishnaswamy
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:  ICML 2025 Generative AI and Biology (GenBio) Workshop, Oral presentation (top 9.7%)
Abstract:mRNA design and optimization are important in synthetic biology and therapeutic development, but remain understudied in machine learning. Systematic optimization of mRNAs is hindered by the scarce and imbalanced data as well as complex sequence-function relationships. We present RNAGenScape, a property-guided manifold Langevin dynamics framework that iteratively updates mRNA sequences within a learned latent manifold. RNAGenScape combines an organized autoencoder, which structures the latent space by target properties for efficient and biologically plausible exploration, with a manifold projector that contracts each step of update back to the manifold. RNAGenScape supports property-guided optimization and smooth interpolation between sequences, while remaining robust under scarce and undersampled data, and ensuring that intermediate products are close to the viable mRNA manifold. Across three real mRNA datasets, RNAGenScape improves the target properties with high success rates and efficiency, outperforming various generative or optimization methods developed for proteins or non-biological data. By providing continuous, data-aligned trajectories that reveal how edits influence function, RNAGenScape establishes a scalable paradigm for controllable mRNA design and latent space exploration in mRNA sequence modeling.
信息检索
[IR-0] ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews
链接: https://arxiv.org/abs/2510.26750
作者: Martim Afonso,Nuno Saavedra,Bruno Lourenço,Alexandra Mendes,João Ferreira
类目: Information Retrieval (cs.IR)
*备注:  4 pages, 1 Figure, 2 tables
Abstract:Systematic reviews and mapping studies are critical for synthesizing research, identifying gaps, and guiding future work, but they are often labor-intensive and time-consuming. Existing tools provide partial support for specific steps, leaving much of the process manual and error-prone. We present ProfOlaf, a semi-automated tool designed to streamline systematic reviews while maintaining methodological rigor. ProfOlaf supports iterative snowballing for article collection with human-in-the-loop filtering and uses large language models to assist in analyzing articles, extracting key topics, and answering queries about the content of papers. By combining automation with guided manual effort, ProfOlaf enhances the efficiency, quality, and reproducibility of systematic reviews across research fields. A video describing and demonstrating ProfOlaf is available at: this https URL
[IR-1] WeaveRec: An LLM -Based Cross-Domain Sequential Recommendation Framework with Model Merging
链接: https://arxiv.org/abs/2510.26546
作者: Min Hou,Xin Liu,Le Wu,Chenyi He,Hao Liu,Zhi Li,Xin Li,Si Wei
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Cross-Domain Sequential Recommendation (CDSR) seeks to improve user preference modeling by transferring knowledge from multiple domains. Despite the progress made in CDSR, most existing methods rely on overlapping users or items to establish cross-domain correlations-a requirement that rarely holds in real-world settings. The advent of large language models (LLM) and model-merging techniques appears to overcome this limitation by unifying multi-domain data without explicit overlaps. Yet, our empirical study shows that naively training an LLM on combined domains-or simply merging several domain-specific LLMs-often degrades performance relative to a model trained solely on the target domain. To address these challenges, we first experimentally investigate the cause of suboptimal performance in LLM-based cross-domain recommendation and model merging. Building on these insights, we introduce WeaveRec, which cross-trains multiple LoRA modules with source and target domain data in a weaving fashion, and fuses them via model merging. WeaveRec can be extended to multi-source domain scenarios and notably does not introduce additional inference-time cost in terms of latency or memory. Furthermore, we provide a theoretical guarantee that WeaveRec can reduce the upper bound of the expected error in the target domain. Extensive experiments on single-source, multi-source, and cross-platform cross-domain recommendation scenarios validate that WeaveRec effectively mitigates performance degradation and consistently outperforms baseline approaches in real-world recommendation tasks.
[IR-2] Barlow Twins for Sequential Recommendation
链接: https://arxiv.org/abs/2510.26407
作者: Ivan Razvorotnev,Marina Munkhoeva,Evgeny Frolov
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sequential recommendation models must navigate sparse interaction data popularity bias and conflicting objectives like accuracy versus diversity While recent contrastive selfsupervised learning SSL methods offer improved accuracy they come with tradeoffs large batch requirements reliance on handcrafted augmentations and negative sampling that can reinforce popularity bias In this paper we introduce BT-SR a novel noncontrastive SSL framework that integrates the Barlow Twins redundancyreduction principle into a Transformerbased nextitem recommender BTSR learns embeddings that align users with similar shortterm behaviors while preserving longterm distinctionswithout requiring negative sampling or artificial perturbations This structuresensitive alignment allows BT-SR to more effectively recognize emerging user intent and mitigate the influence of noisy historical context Our experiments on five public benchmarks demonstrate that BTSR consistently improves nextitem prediction accuracy and significantly enhances longtail item coverage and recommendation calibration Crucially we show that a single hyperparameter can control the accuracydiversity tradeoff enabling practitioners to adapt recommendations to specific application needs
[IR-3] DiSE: A diffusion probabilistic model for automatic structure elucidation of organic compounds
链接: https://arxiv.org/abs/2510.26231
作者: Haochen Chen,Qi Huang,Anan Wu,Wenhao Zhang,Jianliang Ye,Jianming Wu,Kai Tan,Xin Lu,Xin Xu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Automatic structure elucidation is essential for self-driving laboratories as it enables the system to achieve truly autonomous. This capability closes the experimental feedback loop, ensuring that machine learning models receive reliable structure information for real-time decision-making and optimization. Herein, we present DiSE, an end-to-end diffusion-based generative model that integrates multiple spectroscopic modalities, including MS, 13C and 1H chemical shifts, HSQC, and COSY, to achieve automated yet accurate structure elucidation of organic compounds. By learning inherent correlations among spectra through data-driven approaches, DiSE achieves superior accuracy, strong generalization across chemically diverse datasets, and robustness to experimental data despite being trained on calculated spectra. DiSE thus represents a significant advance toward fully automated structure elucidation, with broad potential in natural product research, drug discovery, and self-driving laboratories.
[IR-4] ReaKase-8B: Legal Case Retrieval via Knowledge and Reasoning Representations with LLM s
链接: https://arxiv.org/abs/2510.26178
作者: Yanran Tang,Ruihong Qiu,Xue Li,Zi Huang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Legal case retrieval (LCR) is a cornerstone of real-world legal decision making, as it enables practitioners to identify precedents for a given query case. Existing approaches mainly rely on traditional lexical models and pretrained language models to encode the texts of legal cases. Yet there are rich information in the relations among different legal entities as well as the crucial reasoning process that uncovers how legal facts and legal issues can lead to judicial decisions. Such relational reasoning process reflects the distinctive characteristics of each case that can distinguish one from another, mirroring the real-world judicial process. Naturally, incorporating such information into the precise case embedding could further enhance the accuracy of case retrieval. In this paper, a novel ReaKase-8B framework is proposed to leverage extracted legal facts, legal issues, legal relation triplets and legal reasoning for effective legal case retrieval. ReaKase-8B designs an in-context legal case representation learning paradigm with a fine-tuned large language model. Extensive experiments on two benchmark datasets from COLIEE 2022 and COLIEE 2023 demonstrate that our knowledge and reasoning augmented embeddings substantially improve retrieval performance over baseline models, highlighting the potential of integrating legal reasoning into legal case retrieval systems. The code has been released on this https URL.
[IR-5] OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender
链接: https://arxiv.org/abs/2510.26104
作者: Zhaoqi Zhang,Haolei Pei,Jun Guo,Tianyu Wang,Yufei Feng,Hui Sun,Shaowei Liu,Aixin Sun
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In recommendation systems, scaling up feature-interaction modules (e.g., Wukong, RankMixer) or user-behavior sequence modules (e.g., LONGER) has achieved notable success. However, these efforts typically proceed on separate tracks, which not only hinders bidirectional information exchange but also prevents unified optimization and scaling. In this paper, we propose OneTrans, a unified Transformer backbone that simultaneously performs user-behavior sequence modeling and feature interaction. OneTrans employs a unified tokenizer to convert both sequential and non-sequential attributes into a single token sequence. The stacked OneTrans blocks share parameters across similar sequential tokens while assigning token-specific parameters to non-sequential tokens. Through causal attention and cross-request KV caching, OneTrans enables precomputation and caching of intermediate representations, significantly reducing computational costs during both training and inference. Experimental results on industrial-scale datasets demonstrate that OneTrans scales efficiently with increasing parameters, consistently outperforms strong baselines, and yields a 5.68% lift in per-user GMV in online A/B tests.


