本篇博文主要内容为 2025-11-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-11-05)
今日共更新488篇论文,其中:
- 自然语言处理共49篇(Computation and Language (cs.CL))
- 人工智能共142篇(Artificial Intelligence (cs.AI))
- 计算机视觉共78篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共158篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Agent -Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理跨模态任务时存在的局限性,包括对固定模态组合的依赖、高昂的微调成本以及缺乏对复杂跨模态推理的支持。为实现真正具备全模态能力(omni-capable)的系统,作者提出Agent-Omni框架,其核心在于采用主代理(master-agent)机制协调多个专用基础模型,通过意图解析、子任务分配与结果整合实现灵活的多模态推理,无需重新训练即可适配文本、图像、音频和视频等多种输入形式,从而显著提升模型的适应性、可扩展性和解释性。
链接: https://arxiv.org/abs/2511.02834
作者: Huawei Lin,Yunzhi Shi,Tong Geng,Weijie Zhao,Wei Wang,Ravender Pal Singh
机构: Amazon; Rochester Institute of Technology; University of Rochester
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 7 figures, 14 tables. Under Review
Abstract:Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.
zh
[NLP-1] In Good GRACEs: Principled Teacher Selection for Knowledge Distillation
【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation)中教师模型(teacher model)选择难题,即如何高效识别与特定学生模型(student model)和目标任务最匹配的教师,避免昂贵的试错过程。解决方案的关键在于提出一种轻量级评分指标 GRACE(Gradient-based Assessment of Compatibility for Efficient distillation),其通过分析学生模型梯度的分布特性来量化教师的有效性,无需访问验证集、教师输出 logits 或内部结构,且从信息论角度关联到梯度算法的留一稳定性(leave-one-out stability),从而有效预测蒸馏后学生的泛化性能。实验证明,GRACE 在 GSM8K 和 MATH 数据集上与 LLaMA 和 OLMo 学生模型的性能高度相关(Spearman 相关系数达 86%),并能指导温度调节、教师规模约束下的最优选择及同族模型间的最佳匹配,显著提升蒸馏效果(最高提升 7.4%)。
链接: https://arxiv.org/abs/2511.02833
作者: Abhishek Panigrahi,Bingbin Liu,Sadhika Malladi,Sham Kakade,Surbhi Goel
机构: Princeton Language and Intelligence (普林斯顿语言与智能中心); Kempner Institute, Harvard (哈佛大学肯普纳研究所); Microsoft Research, New York (微软研究院,纽约); University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Knowledge distillation is an efficient strategy to use data generated by large “teacher” language models to train smaller capable “student” models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student’s gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
zh
[NLP-2] Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
【速读】: 该论文旨在解决当前大语言模型在长上下文(long context)场景下是否真正有效利用全部上下文长度的问题。现有评估多依赖于从上下文中检索特定片段,导致大量文本被视为噪声,无法全面检验模型对长文本的深层推理能力。为此,作者提出Oolong基准,其核心创新在于设计了一类需要原子级分析文本块并聚合结果以回答分布性问题的任务,涵盖分类、计数、时间关系与用户关系推理等复杂逻辑。该方案的关键在于通过自然合成任务(Oolong-synth)和真实对话数据(Oolong-real)双轨设置,系统性地测试模型在海量文本中的细粒度推理与整合能力,从而更真实地反映模型的长上下文理解水平。
链接: https://arxiv.org/abs/2511.02817
作者: Amanda Bertsch,Adithya Pratapa,Teruko Mitamura,Graham Neubig,Matthew R. Gormley
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
zh
[NLP-3] MemSearcher: Training LLM s to Reason Search and Manage Memory via End-to-End Reinforcement Learning ICIP
【速读】: 该论文旨在解决搜索代理(Search Agent)在多轮交互中因保留完整对话历史而导致上下文过长、计算与内存开销高,而仅使用当前轮次又会丢失关键信息的权衡问题,从而限制了系统的可扩展性。其解决方案的关键在于提出 MemSearcher 工作流,该流程通过迭代维护一个紧凑的记忆模块(memory),并将当前轮次与记忆融合以生成推理轨迹、执行搜索动作并更新记忆,从而稳定多轮交互中的上下文长度,在不牺牲准确性的前提下显著提升效率。为进一步优化该工作流,作者引入多上下文 GRPO(multi-context GRPO),一种端到端的强化学习框架,联合优化推理、搜索策略和记忆管理,通过在不同上下文中采样轨迹组并跨对话传播轨迹级优势,实现更高效的训练与部署。
链接: https://arxiv.org/abs/2511.02805
作者: Qianhao Yuan,Jie Lou,Zichao Li,Jiawei Chen,Yaojie Lu,Hongyu Lin,Le Sun,Debing Zhang,Xianpei Han
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所信息处理实验室); University of Chinese Academy of Sciences (中国科学院大学); Xiaohongshu Inc (小红书公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user’s question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at this https URL
zh
[NLP-4] Can LLM s subtract numbers?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在减法运算中表现显著低于加法运算的问题,尤其是针对非交换性操作特性下模型对负数结果的生成能力不足。研究发现,尽管模型能正确识别结果的数值大小,却常遗漏负号,表明其内部已编码负号信息但未有效输出。解决方案的关键在于通过指令微调(instruction-tuning)显著提升模型对负号的生成准确性,使其在减法任务上达到近乎完美的性能,从而揭示了LLMs在算术推理中可被纠正的局限性与改进路径。
链接: https://arxiv.org/abs/2511.02795
作者: Mayank Jobanputra,Nils Philipp Walter,Maitrey Mehta,Blerta Veseli,Evan Parker Kelly Chapple,Yifan Wang,Sneha Chetani,Ellie Pavlick,Antonio Vergari,Vera Demberg
机构: Saarland University (萨尔兰大学); CISPA Helmholtz Center for Information Security (CISPA赫尔姆霍兹信息安全中心); University of Utah (犹他大学); Brown University (布朗大学); University of Edinburgh (爱丁堡大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work-in-progress; MathNLP non-archival presentation
Abstract:We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ( a-b ) are concentrated in cases where ( ab ). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs’ performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs’ arithmetic capabilities in subtraction.
zh
[NLP-5] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在视觉-centric编码任务中表现不足的问题,即现有研究主要聚焦于语言驱动的代码生成与调试,而对图像到可执行符号化表示(如SVG代码)的转换缺乏系统探索。其核心挑战在于如何让模型从图像中提取并生成具有语义保真度的结构化代码,以支持下游推理任务。解决方案的关键在于提出VCode基准和CodeVQA评估协议,通过将多模态理解重构为代码生成任务,并引入一种名为VCoder的代理式框架:该框架沿两个维度增强VLM能力——(i) 基于修正的思考(Thinking with Revision),通过迭代分析差异并优化SVG代码;(ii) 基于视觉工具的操作(Acting with Visual Tools),利用检测器与解析器提供对象、形状和文本等结构化线索,从而突破模型内在感知限制。实验表明,VCoder显著提升了符号保真度,在多个基准上相较Claude-4-Opus提升12.3分,验证了符号化视觉表示的有效性。
链接: https://arxiv.org/abs/2511.02778
作者: Kevin Qinghong Lin,Yuhao Zheng,Hangyu Ran,Dantong Zhu,Dongxing Mao,Linjie Li,Philip Torr,Alex Jinpeng Wang
机构: University of Oxford (牛津大学); University of Science and Technology of China (中国科学技术大学); Central South University (中南大学); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL Github: this https URL
Abstract:Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model’s intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at this https URL.
zh
[NLP-6] Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
【速读】: 该论文旨在解决传统文本检索模型在处理查询语义歧义时的局限性,即当查询存在多种可能的解释(如多模态条件分布)时,现有基于单一查询向量(query vector)的检索方法难以有效召回所有相关文档。其核心问题是:随着目标文档嵌入向量间距离增大,现有检索器性能显著下降。解决方案的关键在于提出一种新的检索架构——自回归多嵌入检索器(Autoregressive Multi-Embedding Retriever, AMER),该模型通过自回归方式生成多个查询向量,并利用这些向量共同检索文档,从而更好地捕捉查询的多模态语义分布。实验表明,AMER在合成数据上可完美识别多目标分布,相较单嵌入模型提升4倍性能;在真实世界多答案检索数据集上,相对基线分别取得4%和21%的改进,且在目标文档嵌入差异较大的子集上收益更显著。
链接: https://arxiv.org/abs/2511.02770
作者: Hung-Ting Chen,Xiang Liu,Shauli Ravfogel,Eunsol Choi
机构: New York University (纽约大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Most text retrievers generate \emphone query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emphAutoregressive \emphMulti-\emphEmbedding \emphRetriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.
zh
[NLP-7] Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning
【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)在协作推理中存在推理成本不可控的问题,现有去中心化框架通常对每个输入都调用多个LLM,导致成本高昂且难以管理。其解决方案的关键在于提出一种集中式多LLM框架——CoRL,通过一个控制器LLM在预算约束下选择性地协调专家模型,并将协作决策建模为具有双重目标的强化学习问题:最大化任务性能同时最小化总体推理成本。该方法实现了在不同预算条件下自适应行为,使系统在高预算时超越单一最优专家模型,在低预算时仍保持良好性能,从而实现可扩展、成本可控的多代理LLM协同推理。
链接: https://arxiv.org/abs/2511.02755
作者: Bowen Jin,TJ Collins,Donghan Yu,Mert Cemri,Shenao Zhang,Mengyu Li,Jay Tang,Tian Qin,Zhiyang Xu,Jiarui Lu,Guoli Yin,Jiawei Han,Zirui Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Apple (苹果公司); University of California, Berkeley (加州大学伯克利分校); Northwestern University (西北大学); Harvard University (哈佛大学); Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注: 14 pages
Abstract:Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.
zh
[NLP-8] AI Diffusion in Low Resource Language Countries
【速读】: 该论文试图解决的问题是:生成式 AI(Generative AI)在全球扩散过程中存在不均衡现象,尤其是在低资源语言国家(Low-Resource Language Countries, LRLCs)中,AI采纳率显著低于预期,其根本原因是否与语言障碍相关。解决方案的关键在于通过构建加权回归模型,将语言因素从社会经济和人口统计学因素中分离出来,从而量化语言可及性对AI扩散的独立影响。研究发现,LRLCs的AI用户占比相较基准水平低约20%,表明语言可及性是阻碍AI公平扩散的重要独立壁垒。
链接: https://arxiv.org/abs/2511.02752
作者: Amit Misra,Syed Waqas Zamir,Wassim Hamidouche,Inbal Becker-Reshef,Juan Lavista Ferres
机构: Microsoft AI for Good Research Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 4 tables. Also available at this https URL
Abstract:Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.
zh
[NLP-9] CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理评估中过度关注任务完成率而忽视资源效率与适应性的缺陷,尤其是代理在动态环境中进行成本最优规划和实时调整的能力不足问题。解决方案的关键在于提出CostBench——一个以成本为中心的可扩展基准,其核心设计包括:在旅行规划场景下构建多种原子与复合工具组合的多路径任务,支持自定义成本结构,并引入四类动态阻塞事件(如工具失效和成本变化),从而系统性地评估代理的经济推理能力和实时重规划性能。实证表明,现有主流开源及专有模型在静态环境下即难以识别最优解,且在动态条件下性能显著下降,凸显了当前LLM代理在经济理性方面的重大短板,为未来具备成本敏感性和鲁棒性的智能代理研发提供了明确方向。
链接: https://arxiv.org/abs/2511.02734
作者: Jiayu Liu,Cheng Qian,Zhaochen Su,Qing Zong,Shijue Huang,Bingxiang He,Yi R. Fung
机构: Hong Kong University of Science and Technology (香港科技大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents’ ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents’ economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.
zh
[NLP-10] Prag ExTra: A Multilingual Corpus of Prag matic Explicitation in Translation
【速读】: 该论文旨在解决翻译中“语用显化”(pragmatic explicitation)现象缺乏计算建模的问题,即译者为使隐含的文化意义对目标语读者更加明确而主动添加背景信息的现象。解决方案的关键在于构建了首个多语言语料库PragExTra及其检测框架,涵盖TED-Multi和Europarl中的八组语言对,并通过空对齐识别候选案例,结合主动学习与人工标注进行精炼。实验表明,实体和系统级显化最为常见,且主动学习使分类器准确率提升7–8个百分点,最高达0.88准确率和0.82 F1值,从而将语用显化确立为可测量的跨语言现象,推动了文化感知机器翻译的发展。
链接: https://arxiv.org/abs/2511.02721
作者: Doreen Osmelak,Koel Dutta Chowdhury,Uliana Sentsova,Cristina España-Bonet,Josef van Genabith
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation. Keywords: translation, multilingualism, explicitation
zh
[NLP-11] he Collaboration Gap
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 系统在多智能体协作场景下的能力评估与提升问题,特别是面对异构智能体(具有不同信息、权限和工具)在部分可观测环境下如何实现高效协同。其核心挑战在于,现有模型在独立执行任务时表现优异,但在协作时往往出现性能显著下降的现象,即所谓的“协作鸿沟”(collaboration gap)。解决方案的关键在于提出一个可扩展的迷宫求解基准测试框架,该框架通过隔离协作能力、调节问题复杂度、支持自动化评分且不强制输出格式,从而真实反映智能体间的交互效果;同时发现并验证了“接力推理”(relay inference)策略的有效性——即由较强智能体先行主导任务执行,在关键节点交由较弱智能体继续处理,能显著缩小协作差距,为未来 AI-AI 与人-AI 协作系统的评估与设计提供了重要指导。
链接: https://arxiv.org/abs/2511.02687
作者: Tim R. Davidson,Adam Fourney,Saleema Amershi,Robert West,Eric Horvitz,Ece Kamar
机构: EPFL(瑞士联邦理工学院); Microsoft Research(微软研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a “collaboration gap”: models that perform well solo often degrade substantially when required to collaborate. Collaboration can break down dramatically; for instance, small distilled models that solve mazes well alone may fail almost completely in certain pairings. We find that starting with the stronger agent often improves outcomes, motivating a “relay inference” approach where the stronger agent leads before handing off to the weaker one, closing much of the gap. Our findings argue for (1) collaboration-aware evaluation, (2) training strategies developed to enhance collaborative capabilities, and (3) interaction design that reliably elicits agents’ latent skills, guidance that applies to AI-AI and human-AI collaboration.
zh
[NLP-12] Optimal Singular Damage: Efficient LLM Inference in Low Storag e Regimes
【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)在微调后参数更新存储效率低的问题。由于微调通常仅改变模型中的一小部分参数,但现有方法在存储这些更新时仍面临显著的内存开销,因此亟需更高效的存储策略。解决方案的关键在于利用微调更新具有低秩(low-rank)和稀疏性(sparsity)的特性,并提出一种称为“最优奇异损伤”(Optimal Singular Damage)的方法:该方法通过选择性地对低秩近似更新进行稀疏化,基于奇异向量的交错重要性排序,保留最具影响力的组件,从而在相同内存预算下实现比单独使用低秩近似或稀疏化更高的存储效率与模型精度。
链接: https://arxiv.org/abs/2511.02681
作者: Mohammadsajad Alipour,Mohammad Mohammadi Amiri
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
zh
[NLP-13] Understanding New-Knowledge-Induced Factual Hallucinations in LLM s: Analysis Solution and Interpretation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中引入新知识时,导致对已知信息产生事实性幻觉(factual hallucinations)的问题,尤其是针对此类幻觉的具体表现形式及其内在机制缺乏深入理解的现状。解决方案的关键在于提出 KnownPatch 方法:在训练后期向数据集中注入少量已知知识样本,从而有效缓解由新知识引发的幻觉现象;该方法通过恢复模型对问题中关键实体的关注度,抑制因学习新知识而产生的注意力偏移,进而降低幻觉扩散风险,并提升任务性能。
链接: https://arxiv.org/abs/2511.02626
作者: Renfei Dang,Peng Hu,Changjiang Gao,Shujian Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Previous studies show that introducing new knowledge during large language models (LLMs) fine-tuning can lead to the generation of erroneous output when tested on known information, thereby triggering factual hallucinations. However, existing studies have not deeply investigated the specific manifestations and underlying mechanisms of these hallucinations. Our work addresses this gap by designing a controlled dataset Biography-Reasoning, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that when fine-tuned on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit significantly increased hallucination tendencies. This suggests that the high unfamiliarity of a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations, and these tendencies can even affect other knowledge types in QA tasks. To mitigate such factual hallucinations, we propose KnownPatch, which patches a small number of known knowledge samples in the later stages of training, effectively alleviating new-knowledge-induced hallucinations. Through attention analysis, we find that learning new knowledge reduces the model’s attention to key entities in the question, thus causing excessive focus on the surrounding context, which may increase the risk of hallucination. Moreover, the attention pattern can propagate to similar contexts, facilitating the spread of hallucinations to textually similar questions. Our method effectively mitigates the disruption of new knowledge learning to the model’s attention on key entities, accompanied by improved performance.
zh
[NLP-14] he Realignment Problem: When Right becomes Wrong in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中因静态对齐策略导致的“对齐-现实差距”(Alignment-Reality Gap)问题,即现有模型难以适应不断变化的社会规范与政策要求,且传统再标注或标准去学习方法存在成本高、性能下降等缺陷。解决方案的关键在于提出TRACE框架——一种基于对齐冲突评估的系统性去学习机制,其核心创新是将再对齐重构为程序化政策执行问题:通过量化偏好数据与新政策之间的对齐影响得分(alignment impact score),智能筛选并分类处理冲突样本(清洁地反转、丢弃或保留偏好),从而实现精准、高效且不损害模型通用能力的动态对齐更新。
链接: https://arxiv.org/abs/2511.02623
作者: Aakash Sen Sharma,Debdeep Sanyal,Vivek Srivastava,Shirish Karande,Murari Mandal
机构: InvideoAI; Birla AI Labs; TCS Research; Kalinga Institute of Industrial Technology
类目: Computation and Language (cs.CL)
备注: 23 Pages
Abstract:The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.
zh
[NLP-15] UniChange: Unifying Change Detection with Multimodal Large Language Model
【速读】: 该论文旨在解决当前变化检测(Change Detection, CD)模型在知识获取上的局限性问题,即现有模型通常仅依赖单一类型标注数据,难以同时利用二值变化检测(Binary Change Detection, BCD)和语义变化检测(Semantic Change Detection, SCD)等多种数据集,导致模型泛化能力差、适用场景受限。其解决方案的关键在于提出首个基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的统一变化检测模型 UniChange,通过引入三个特殊标记 [T1]、[T2] 和 [CHANGE] 实现 BCD 与 SCD 任务的统一建模,并借助文本提示(text prompts)引导变化类别识别,从而摆脱对预定义分类头的依赖,使模型能够从多源异构数据中有效学习,即使不同数据集的类别定义存在冲突也能适应。
链接: https://arxiv.org/abs/2511.02607
作者: Xu Zhang,Danyang Li,Xiaohang Dong,Tianhao Wu,Hualong Yu,Jianye Wang,Qicheng Li,Xiang Li
机构: TMCC, Computer Science, Nankai University (南开大学计算机科学系); VCIP, Computer Science, Nankai University (南开大学计算机科学系); NKIARI, Futian, Shenzhen (深圳市南科院人工智能研究院); CMEE, Sichuan Agricultural University (四川农业大学机电工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at this https URL.
zh
[NLP-16] CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试阶段多次调用后通过多数投票进行预测的自一致性策略(self-consistency)中存在的两个问题:一是固定调用次数导致计算资源浪费,二是当正确答案较为罕见时策略可能失效。其解决方案的关键在于提出一种基于置信度引导的早期停止机制(Confidence-Guided Early Stopping, CGES),该机制利用标量置信度信号(来自token概率或奖励模型)构建候选答案的后验分布,并在后验质量达到预设阈值时自适应终止采样过程,从而在保证精度的前提下显著减少模型调用次数。
链接: https://arxiv.org/abs/2511.02603
作者: Ehsan Aghazadeh,Ahmad Ghasemi,Hedyeh Beyhaghi,Hossein Pishro-Nik
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注: Efficient Reasoning @ NeurIPS2025
Abstract:Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.
zh
[NLP-17] Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour
【速读】: 该论文旨在解决知识追踪(Knowledge Tracing, KT)任务中因忽略题目文本内容而导致的预测性能受限问题。现有KT模型通常仅利用答题正确性及技能标签、时间戳等元数据,而未充分挖掘题目文本所蕴含的教学信息,从而限制了对学生学习状态的准确建模。解决方案的关键在于提出一种新的方法——Next Token Knowledge Tracing (NTKT),它将KT重构为基于预训练大语言模型(Large Language Models, LLMs)的下一个词预测任务,通过将学生历史交互和题目内容统一表示为文本序列,使LLMs能够同时学习行为模式与语言特征,从而显著提升预测准确性,并在冷启动问题(如新用户或新题目)上展现出更强的泛化能力。
链接: https://arxiv.org/abs/2511.02599
作者: Max Norris,Kobi Gal,Sahan Bulathwela
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.
zh
[NLP-18] he Analysis of Lexical Errors in Machine Translation from English into Romanian
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)在将英文文本翻译成罗马尼亚语过程中存在的词汇错误问题,特别是针对与新冠疫情期间官方信息相关的专业文本,如世界卫生组织(WHO)和Gavi组织发布的文件以及疫苗或药物说明书。研究通过分析230篇由Google Translate自动翻译的文本,识别并量化其中的词汇层面错误,其解决方案的关键在于优化词项选择(lexical selection),从而提升机器翻译的整体准确性与专业性,为改进Google Translate等多语言自动翻译系统提供实证依据与优化方向。
链接: https://arxiv.org/abs/2511.02587
作者: Angela Stamatie(Dumitran)
机构: 未知
类目: Computation and Language (cs.CL)
备注: Doctoral thesis
Abstract:The research explores error analysis in the performance of translating by Machine Translation from English into Romanian, and it focuses on lexical errors found in texts which include official information, provided by the World Health Organization (WHO), the Gavi Organization, by the patient information leaflet (the information about the active ingredients of the vaccines or the medication, the indications, the dosage instructions, the storage instructions, the side effects and warning, etc.). All of these texts are related to Covid-19 and have been translated by Google Translate, a multilingual Machine Translation that was created by Google. In the last decades, Google has actively worked to develop a more accurate and fluent automatic translation system. This research, specifically focused on improving Google Translate, aims to enhance the overall quality of Machine Translation by achieving better lexical selection and by reducing errors. The investigation involves a comprehensive analysis of 230 texts that have been translated from English into Romanian.
zh
[NLP-19] Smart-Hiring: An Explainable end-to-end Pipeline for CV Information Extraction and Job Matching
【速读】: 该论文旨在解决招聘过程中人工筛选简历效率低、易出错且存在偏见的问题。其核心解决方案是提出一个端到端的自然语言处理(Natural Language Processing, NLP)流水线——Smart-Hiring,通过文档解析、命名实体识别和上下文文本嵌入技术,自动从非结构化简历中提取技能、经验与资质信息,并将简历与职位描述映射到共享向量空间以计算语义相似度,从而实现候选者与岗位的精准匹配。该方法具有模块化和可解释性特点,支持对抽取实体及匹配依据的可视化审查,实验表明其在多领域真实数据集上具备良好的匹配准确率与决策透明度,为招聘分析提供了可扩展、实用的NLP框架。
链接: https://arxiv.org/abs/2511.02537
作者: Kenza Khelkhal,Dihia Lanasri
机构: ATM Mobilis
类目: Computation and Language (cs.CL)
备注:
Abstract:Hiring processes often involve the manual screening of hundreds of resumes for each job, a task that is time and effort consuming, error-prone, and subject to human bias. This paper presents Smart-Hiring, an end-to-end Natural Language Processing (NLP) pipeline de- signed to automatically extract structured information from unstructured resumes and to semantically match candidates with job descriptions. The proposed system combines document parsing, named-entity recognition, and contextual text embedding techniques to capture skills, experience, and qualifications. Using advanced NLP technics, Smart-Hiring encodes both resumes and job descriptions in a shared vector space to compute similarity scores between candidates and job postings. The pipeline is modular and explainable, allowing users to inspect extracted entities and matching rationales. Experiments were conducted on a real-world dataset of resumes and job descriptions spanning multiple professional domains, demonstrating the robustness and feasibility of the proposed approach. The system achieves competitive matching accuracy while preserving a high degree of interpretability and transparency in its decision process. This work introduces a scalable and practical NLP frame- work for recruitment analytics and outlines promising directions for bias mitigation, fairness-aware modeling, and large-scale deployment of data-driven hiring solutions.
zh
[NLP-20] DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding NEURIPS NEURIPS2025
【速读】: 该论文旨在解决当前多模态模型在火灾领域应用受限的问题,其核心挑战在于缺乏高质量、大规模且标注丰富的公开火灾数据集。解决方案的关键在于构建DetectiumFire——一个包含22.5k张高分辨率火灾图像和2.5k段真实火灾视频的多模态数据集,覆盖多种火情类型、环境及风险等级,并同时提供传统计算机视觉标签(如边界框)与详细文本描述提示,从而支持合成数据生成、火灾风险推理等下游任务。该数据集在规模、多样性与数据质量上显著优于现有基准,有效减少了冗余并增强了对现实场景的覆盖能力,为推动火灾相关AI研究和智能安全系统发展提供了重要基础。
链接: https://arxiv.org/abs/2511.02495
作者: Zixuan Liu,Siavash H. Khajavi,Guangkai Jiang
机构: Tulane University (杜兰大学); Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Advances in Neural Information Processing Systems 2025 (NeurIPS 2025), Poster, this https URL
Abstract:Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at this https URL
zh
[NLP-21] Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas
【速读】: 该论文旨在解决如何通过角色化提示(persona-based prompting)提升大型语言模型(Large Language Model, LLM)在宏观经济预测任务中的表现问题。其核心挑战在于评估特定人格特征描述是否能增强LLM对经济变量(如HICP、核心HICP、GDP增长率和失业率)的预测准确性,尤其是在面对未见过的样本时。解决方案的关键在于设计了一套基于2,368个经济学相关人格的提示框架,并利用GPT-4o模型在50个季度周期(2013–2025)内模拟欧洲央行专业预测者调查(ECB Survey of Professional Forecasters)的数据进行系统性对比实验。结果表明,尽管GPT-4o在预测精度上与人类专家相当,但引入人格提示并未带来可测量的性能提升,说明在保持预测准确性的同时,可以省略此类复杂提示以降低计算成本。
链接: https://arxiv.org/abs/2511.02458
作者: Giulia Iadisernia,Carolina Camassa
机构: Banca d’Italia(意大利银行)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)
备注: 9 pages, 8-pages appendix, accepted at ICAIF 25
Abstract:We evaluate whether persona-based prompting improves Large Language Model (LLM) performance on macroeconomic forecasting tasks. Using 2,368 economics-related personas from the PersonaHub corpus, we prompt GPT-4o to replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025). We compare the persona-prompted forecasts against the human experts panel, across four target variables (HICP, core HICP, GDP growth, unemployment) and four forecast horizons. We also compare the results against 100 baseline forecasts without persona descriptions to isolate its effect. We report two main findings. Firstly, GPT-4o and human forecasters achieve remarkably similar accuracy levels, with differences that are statistically significant yet practically modest. Our out-of-sample evaluation on 2024-2025 data demonstrates that GPT-4o can maintain competitive forecasting performance on unseen events, though with notable differences compared to the in-sample period. Secondly, our ablation experiment reveals no measurable forecasting advantage from persona descriptions, suggesting these prompt components can be omitted to reduce computational costs without sacrificing accuracy. Our results provide evidence that GPT-4o can achieve competitive forecasting accuracy even on out-of-sample macroeconomic events, if provided with relevant context data, while revealing that diverse prompts produce remarkably homogeneous forecasts compared to human panels.
zh
[NLP-22] Merging Continual Pretraining Models for Domain-Specialized LLM s: A Case Study in Finance
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在金融等专业领域性能不足的问题,特别是针对大语言模型(LLM)在特定领域如金融、数学和多语言处理中所需技能的融合难题。传统方法依赖于昂贵且不稳定的多技能联合训练,而本文提出通过合并经过领域持续预训练(Continual Pre-training, CPT)的专家模型来实现更高效、稳定的多技能模型构建。解决方案的关键在于设计了一个三阶段评估框架(知识恢复、互补性与涌现性),并系统比较了三种合并方法(Task Arithmetic、TIES 和 DARE-TIES),发现将CPT专家与基础模型合并可有效恢复因领域微调丢失的一般知识,同时多个专家间的合并能提升性能并产生跨领域的涌现能力,其中TIES方法在稳定性上优于Task Arithmetic,为基于现有资产构建多技能金融LLM提供了首个系统性分析和实践指导。
链接: https://arxiv.org/abs/2511.02451
作者: Kentaro Ueda,François Portet,Hirohiko Suwa,Keiichi Yasumoto
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While LLMs excel at general tasks, they struggle in specialized domains like finance, requiring diverse skills in domain knowledge, mathematical reasoning, and multilingual processing. Merging domain-specific Continual Pre-training (CPT) “experts” offers a practical alternative to costly and unstable multi-skill training. However, unlike established Supervised Fine-Tuning (SFT) model-based merging, CPT model merging remains largely unexplored. We address this gap by creating financial LLMs from experts in finance, math, and Japanese. We propose a three-stage evaluation focusing on knowledge recovery, complementarity, and emergence, and assess three merging methods (Task Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated from 18 tasks across 8 established datasets. Results show that merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Among the methods, Task Arithmetic performs strongly but is hyperparameter-sensitive, whereas TIES is more robust. Our findings also suggest that while model similarity correlates with merging success, emergent skills depend on more complex factors. This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets.
zh
[NLP-23] AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话中易受越狱攻击(jailbreaking attacks)的问题,这类攻击通过自适应的多轮交互诱导模型产生有害输出,而现有评估大多局限于单轮交互,无法反映真实场景下的安全风险。解决方案的关键在于提出一种无需训练的自动化多轮越狱框架 AutoAdv,其核心创新包括三个自适应机制:模式管理器(pattern manager)通过学习历史成功攻击策略优化后续提示;温度控制器(temperature manager)根据失败模式动态调整采样参数以增强探索能力;以及两阶段重写策略(two-phase rewriting strategy),先对有害请求进行伪装,再逐步精细化重构,从而在六轮内实现高达95%的攻击成功率(相比单轮基线提升24%),揭示了当前对齐机制在多轮场景下显著失效的系统性漏洞。
链接: https://arxiv.org/abs/2511.02376
作者: Aashray Reddy,Andrew Zagula,Nicholas Saban
机构: Del Norte High School (德尔诺特高中); Bridgewater-Raritan High School (布里奇沃特-拉里坦高中); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
zh
[NLP-24] AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高度专业化领域(如传统医学体系Ayurveda)中表现不佳的问题,尤其体现在对文化、语言和专业领域知识的准确理解和应用上。其解决方案的关键在于构建一个基于高质量、专家标注的双语(英语与印地语)Ayurveda数据集,并以此对Param-1-2.9B模型进行微调,形成AyurParam-2.9B这一专用领域语言模型。该模型通过引入上下文感知、推理型及客观问答任务,强化了事实准确性与指令清晰度,最终在BhashaBench-Ayur基准测试中展现出优于同规模开源模型甚至媲美更大模型的性能,验证了真实域适应与高质量监督对于可靠、文化契合的专科医疗知识AI系统的重要性。
链接: https://arxiv.org/abs/2511.02374
作者: Mohd Nauman,Sravan Gvm,Vijay Devane,Shyam Pawar,Viraj Thakur,Kundeshwar Pundalik,Piyush Sawarkar,Rohit Saluja,Maunendra Desarkar,Ganesh Ramakrishnan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam’s dataset incorporates context-aware, reasoning, and objective-style QA in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5–3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.
zh
[NLP-25] LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLM s in Chinese Context
【速读】: 该论文旨在解决中文语言大模型(Large Language Models, LLMs)在实际应用中面临的安全性评估缺乏动态、系统化基准工具的问题。现有安全评测多基于静态数据集,难以适应快速演进的威胁场景,尤其在中国法律与社会语境下,亟需本土化的持续更新机制。解决方案的关键在于提出LiveSecBench——一个面向中文LLM应用场景的动态安全基准,涵盖合法性(Legality)、伦理合规性(Ethics)、事实准确性(Factuality)、隐私保护(Privacy)、对抗鲁棒性(Adversarial Robustness)和推理安全性(Reasoning Safety)六大维度,并通过定期更新机制引入新型风险类型(如文本到图像生成安全和代理安全),确保评测结果持续反映真实世界的安全挑战。
链接: https://arxiv.org/abs/2511.02366
作者: Yudong Li,Zhongliang Yang,Kejiang Chen,Wenxuan Wang,Tianxin Zhang,Sifang Wan,Kecheng Wang,Haitian Li,Xu Wang,Lefan Cheng,Youdan Yang,Baocheng Chen,Ziyu Liu,Yufei Sun,Liyan Wu,Wenya Wen,Xingchi Gu,Peiru Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at this https URL.
zh
[NLP-26] CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在推理过程中受限于离散、刚性的语言标记空间,从而无法充分捕捉视觉感知的高维连续特性这一问题。解决方案的关键在于提出一种名为CoCoVa(Chain of Continuous Vision-Language Thought)的新框架,其核心是通过一个迭代推理循环实现跨模态连续推理:其中,新型的潜空间Q-Former(Latent Q-Former, LQ-Former)作为动态推理引擎,通过跨模态融合不断优化一组潜思向量链;同时引入基于显著性区域选择的注意力机制以聚焦关键视觉信息,并采用结合对比学习与基于扩散的重建损失的多任务训练目标,确保潜空间表示在语义上与视觉和文本模态对齐。该设计有效弥合了离散语言处理与连续视觉理解之间的表征鸿沟。
链接: https://arxiv.org/abs/2511.02360
作者: Jizheng Ma,Xiaofei Zhou,Yanlong Song,Han Yan
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
zh
[NLP-27] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation CIKM2025
【速读】: 该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的查询增强方法中存在的两个核心问题:其一,对所有查询统一进行增强会导致显著的嵌入延迟(embedding latency),且对部分查询而言,增强反而会损害检索性能;其二,现有方法未在多模态(multimodal)场景下进行探索。解决方案的关键在于提出M-Solomon——一种可自适应决定是否进行查询增强的通用多模态嵌入器(universal multimodal embedder)。其核心机制包括:首先在训练数据集层面将查询分为两类(需增强与无需增强),随后利用强大的多模态大语言模型(Multimodal LLM, MLLM)生成针对性的合成增强内容,并通过引入自适应查询增强策略,使模型学习仅在必要时输出前缀“/augment”以触发增强,否则输出“/embed”直接嵌入,从而实现高效且精准的查询处理。
链接: https://arxiv.org/abs/2511.02358
作者: Wongyu Kim,Hochang Lee,Sanghak Lee,Yoonsung Kim,Jaehyun Park
机构: NC AI(NC人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to MMGenSR Workshop (CIKM 2025)
Abstract:Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
zh
[NLP-28] LTD-Bench: Evaluating Large Language Models by Letting Them Draw NEURIPS2025
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估范式中存在的核心缺陷——即依赖不透明的数值指标,掩盖了模型在空间推理(spatial reasoning)能力上的根本局限,且无法提供直观的能力理解,导致实际性能与报告结果之间存在危险脱节,尤其在需要物理世界理解的应用场景中尤为显著。解决方案的关键在于提出LTD-Bench这一突破性基准测试框架,通过要求模型以点阵图或可执行代码形式生成可视化输出,将抽象分数转化为可直接观察的图形结果,从而让空间推理能力的不足对非专家也一目了然;该方法同时包含生成任务(测试空间想象力)和识别任务(评估空间感知力),并覆盖三个逐步提升难度的层级,系统性地评估语言到空间概念的双向映射能力,进而揭示当前先进模型在建立这种双向映射时存在的严重能力缺口。
链接: https://arxiv.org/abs/2511.02347
作者: Liuhao Lin,Ke Li,Zihan Xu,Yuchen Shi,Yulei Qin,Yan Zhang,Xing Sun,Rongrong Ji
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2025
Abstract:Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research–relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept–a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench’s visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
zh
[NLP-29] Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning
【速读】: 该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中,如何在集中训练、分散执行(centralized training, decentralized execution)框架下,高效学习具备协作能力且能完成复杂时序任务的多任务多智能体策略问题。现有方法在样本效率上表现不佳,且仅适用于单任务场景。论文提出Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL) 框架,其关键在于利用自动机(automata)对任务进行形式化建模,从而将复杂任务分解为可分配给各智能体的子任务,并通过任务条件化的策略学习实现多智能体间的协同决策。此外,论文证明了所学价值函数可用于测试阶段最优的任务分配,实验验证了智能体间能够涌现出任务感知的多步协作行为,如按按钮解锁门、持门以及任务短路等。
链接: https://arxiv.org/abs/2511.02304
作者: Beyazit Yalcinkaya,Marcell Vazquez-Chanlatte,Ameesh Shah,Hanna Krasowski,Sanjit A. Seshia
机构: University of California, Berkeley (加州大学伯克利分校); Nissan North America (日产北美公司)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注:
Abstract:We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL’s feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.
zh
[NLP-30] Unlocking the Power of Multi-Agent LLM for Reasoning : From Lazy Agents to Deliberation
【速读】: 该论文旨在解决多智能体推理框架中因“懒惰代理行为”(lazy agent behavior)导致的合作失效问题,即一个代理主导任务而另一个代理贡献甚微,从而退化为单代理系统,削弱了多智能体协同的优势。其解决方案的关键在于:首先,提出一种稳定且高效的因果影响度量方法以缓解懒惰行为;其次,设计一种可验证奖励机制(verifiable reward mechanism),促使推理代理在多轮交互中主动丢弃噪声输出、整合指令并适时重启推理过程,从而增强协作稳定性与推理质量。
链接: https://arxiv.org/abs/2511.02303
作者: Zhiwei Zhang,Xiaomin Li,Yudi Lin,Hui Liu,Ramraj Chandradevan,Linlin Wu,Minhua Lin,Fali Wang,Xianfeng Tang,Qi He,Suhang Wang
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Harvard University (哈佛大学); Michigan State University (密歇根州立大学); University of Utah (犹他大学); Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.
zh
[NLP-31] Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions ICDAR2025
【速读】: 该论文旨在解决手写数学表达式(Handwritten Mathematical Expression, HME)结构识别的难题,尤其是如何准确建模符号间的空间依赖关系以提升整体识别精度。其解决方案的关键在于将HME表示为图结构(Graph Neural Network, GNN),其中节点代表符号,边表示空间关系;首先通过深度双向长短期记忆网络(Deep BLSTM)完成符号分割、识别与初步空间关系分类,构建初始原始图;随后利用二维上下文无关文法(2D-CFG)解析器生成所有可能的空间关系,并由GNN-based链接预测模型去除冗余连接,最终形成精确的符号标签图(Symbol Label Graph),从而显著提升HME结构识别性能。
链接: https://arxiv.org/abs/2511.02288
作者: Cuong Tuan Nguyen,Ngoc Tuan Nguyen,Triet Hoang Minh Dao,Huy Minh Nhat,Huy Truong Dinh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: accepted for ICDAR2025-WML
Abstract:We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.
zh
[NLP-32] SAIL-RL: Guiding MLLM s in When and How to Think via Dual-Reward RL Tuning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理能力上的局限性,具体包括:现有方法依赖结果导向的监督信号(outcome-only supervision),仅奖励正确答案而无法保证推理过程的质量;同时采用统一的思考策略,导致简单任务过度推理、复杂任务推理不足。解决方案的关键在于提出SAIL-RL框架,其核心创新为双奖励机制:一是“思考奖励”(Thinking Reward),通过事实依据、逻辑连贯性和答案一致性评估推理质量;二是“判断奖励”(Judging Reward),动态决定何时进行深度推理或直接作答,从而实现更可靠且自适应的推理行为。实验表明,该方法显著提升了MLLMs在推理和多模态理解任务上的表现,并有效减少幻觉现象。
链接: https://arxiv.org/abs/2511.02280
作者: Fangxun Shu,Yongjie Ye,Yue Liao,Zijian Kang,Weijie Yin,Jiacong Wang,Xiao Liang,Shuicheng Yan,Chao Feng
机构: Douyin SAIL Team; National University of Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at this https URL.
zh
[NLP-33] Demo: Statistically Significant Results On Biases and Errors of LLM s Do Not Guarantee Generalizable Results
【速读】: 该论文旨在解决医疗场景下大语言模型(Large Language Models, LLMs)在面对包含人口统计学信息等非医学因素时,可能出现幻觉(hallucination)、遗漏(omission)和偏倚(bias)等问题,导致其提供的医疗建议不一致或不可靠的问题。解决方案的关键在于构建一个自动化评估基础设施,包括两个核心模块:1)通过采样患者人口特征、病史、疾病类型及写作风格生成真实且多样化的查询语句;2)利用多LLM作为评判者(LLM-as-a-judge)的设置,结合代理式工作流(agentic workflows)与分类检测器,对回答进行幻觉、遗漏识别及治疗类别判断。该方法能够系统性地探测LLM在不同变量下的表现差异,并揭示当前评估体系中因LLM间一致性低(平均Cohen’s Kappa=0.118)而可能导致的非泛化结论风险,从而推动更透明、可靠的LLM评估实践。
链接: https://arxiv.org/abs/2511.02246
作者: Jonathan Liu,Haoling Qiu,Jonathan Lasko,Damianos Karakos,Mahsa Yarmohammadi,Mark Dredze
机构: Princeton University (普林斯顿大学); RTX BBN Technologies; Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen’s Kappa \kappa=0.118 ), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: this https URL.
zh
[NLP-34] An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM
【速读】: 该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在标准训练中将非文本信息(如音频)与文本提示拼接处理时,导致模态间融合浅层化的问题,从而限制了模型对核心语言模型推理能力的充分利用。其解决方案的关键在于采用交错式指令微调(interleaved instruction tuning),即在提示中将音频token与文本token交错排列,使模型在训练过程中更深入地整合多模态信息。实验基于新构建的SHARD基准测试集(Synonym and Hypernym Audio Reasoning Dataset)验证发现,即使零样本下使用交错提示也能提升音频语义推理性能,而少量交错训练进一步优化结果,但会略微损害模型原有的音频分类能力。
链接: https://arxiv.org/abs/2511.02234
作者: Jiawei Liu,Enis Berk Çoban,Zarina Schevchenko,Hao Tang,Zhigang Zhu,Michael I Mandel,Johanna Devaney
机构: The Graduate Center, CUNY (纽约市立大学研究生院); Brooklyn College, CUNY (纽约市立大学布鲁克林学院); Borough of Manhattan Community College, CUNY (纽约市立大学曼哈顿社区学院); The City College of New York, CUNY (纽约市立大学城市学院)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model’s ability to leverage the core language model’s reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM’s audio labeling ability.
zh
[NLP-35] IG-Pruning: Input-Guided Block Pruning for Large Language Models EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因计算需求高而导致的推理效率问题,特别是现有深度剪枝(Depth Pruning)方法依赖固定块掩码(fixed block masks)导致在不同任务和输入下性能受限的问题。解决方案的关键在于提出IG-Pruning,一种输入感知的逐块剪枝方法,其核心创新包括:(1) 通过语义聚类与L0正则化优化发现多样化的掩码候选集;(2) 在推理时动态选择最优层掩码,无需额外训练即可实现高效剪枝。该方法显著优于当前主流静态深度剪枝技术,尤其适用于资源受限场景。
链接: https://arxiv.org/abs/2511.02213
作者: Kangyu Qiao,Shaolei Zhang,Yang Feng
机构: Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025. Code is available at this https URL
Abstract:With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios.
zh
[NLP-36] raining Proactive and Personalized LLM Agents
【速读】: 该论文旨在解决当前AI代理(Agent)研究中过度聚焦于任务成功率,而忽视用户中心交互质量的问题。作者指出,真正有效的现实世界代理需同时优化三个维度:生产力(Productivity,即任务完成度)、主动性(Proactivity,即主动提出关键问题)和个性化(Personalization,即适配多样用户偏好)。解决方案的关键在于提出UserVille——一个基于大语言模型(LLM)的用户模拟环境,支持可配置的多样化用户偏好;并设计PPP(Multi-Objective Reinforcement Learning)方法,联合优化上述三重目标。实验表明,PPP训练的代理在软件工程与深度研究任务中显著优于GPT-5等强基线模型(平均提升21.6分),展现出提出战略性澄清问题、适应未见用户偏好及通过优化交互提升任务成功率的能力。
链接: https://arxiv.org/abs/2511.02208
作者: Weiwei Sun,Xuhui Zhou,Weihua Du,Xingyao Wang,Sean Welleck,Graham Neubig,Maarten Sap,Yiming Yang
机构: Carnegie Mellon University (卡内基梅隆大学); OpenHands
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
zh
[NLP-37] Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning
【速读】: 该论文旨在解决个体决策模型与群体最优预测之间存在的偏差问题,特别是在高风险场景(如疫苗接种选择)中,个体决策往往受数值属性(如成本、时间)和语言因素(如个人偏好与约束)的复杂影响,导致传统方法难以准确建模。其解决方案的关键在于提出一种自适应文本-符号人类中心推理框架(ATHENA),该框架包含两个核心阶段:首先利用大语言模型(LLM)增强的符号发现机制提取稳健的群体级符号效用函数;其次通过个体层面的语义适配生成个性化语义模板,以最优效用为导向建模个体选择行为。实证结果表明,ATHENA在真实世界的出行方式和疫苗选择任务中显著优于基于效用、机器学习及其他LLM的方法,F1分数提升至少6.5%,且消融实验验证了两个阶段的互补性与必要性。
链接: https://arxiv.org/abs/2511.02194
作者: Yibo Zhao,Yang Zhao,Hongru Du,Hao Frank Yang
机构: Johns Hopkins University (约翰霍普金斯大学); University of Virginia (弗吉尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at this https URL.
zh
[NLP-38] Rethinking LLM Human Simulation: When a Graph is What You Need
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在人类行为模拟任务中是否必要这一问题,特别是在离散选择场景下,是否存在更高效、可解释且性能相当的替代方案。其核心解决方案是提出Graph-basEd Models for human Simulation (GEMS),关键在于将离散选择模拟任务建模为图神经网络(Graph Neural Network, GNN)上的链接预测问题,通过利用关系知识进行建模,仅在必要时引入语言表示,从而在保持高精度的同时显著降低模型复杂度(比LLM小三个数量级),并提升效率、可解释性和透明度。
链接: https://arxiv.org/abs/2511.02135
作者: Joseph Suh,Suhong Moon,Serina Chang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:Large language models (LLMs) are increasingly used to simulate humans, with applications ranging from survey prediction to decision-making. However, are LLMs strictly necessary, or can smaller, domain-grounded models suffice? We identify a large class of simulation problems in which individuals make choices among discrete options, where a graph neural network (GNN) can match or surpass strong LLM baselines despite being three orders of magnitude smaller. We introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete choice simulation tasks as a link prediction problem on graphs, leveraging relational knowledge while incorporating language representations only when needed. Evaluations across three key settings on three simulation datasets show that GEMS achieves comparable or better accuracy than LLMs, with far greater efficiency, interpretability, and transparency, highlighting the promise of graph-based modeling as a lightweight alternative to LLMs for human simulation. Our code is available at this https URL.
zh
[NLP-39] InsurAg ent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance
【速读】: 该论文旨在解决美国高风险人群中洪水保险参与率低的问题,核心在于理解并建模影响保险决策的行为机制。其解决方案的关键在于提出一个名为InsurAgent的大型语言模型(Large Language Model, LLM)赋能的智能体,该智能体包含感知、检索、推理、行动和记忆五个模块;其中,检索模块利用检索增强生成(Retrieval-Augmented Generation, RAG)技术将决策锚定在实证调查数据上,实现对边际概率与联合概率的准确估计;推理模块则借助LLM常识进行外推,捕捉传统模型难以处理的情境信息;记忆模块支持模拟随时间演化的决策过程,从而为行为建模与政策分析提供可扩展的工具框架。
链接: https://arxiv.org/abs/2511.02119
作者: Ziheng Geng,Jiachen Liu,Ran Cao,Lu Cheng,Dan M. Frangopol,Minghui Cheng
机构: University of Miami (迈阿密大学); Hunan University (湖南大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); Lehigh University (利哈伊大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.
zh
[NLP-40] Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow Preferences NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐人类价值观时存在的根本性问题:即模型是否真正学习了深层的价值观(如道德原则),还是仅仅捕捉了偏好数据中的表层特征(如语言风格)。这一区分对于人工智能对齐(AI alignment)至关重要,因为仅依赖浅层模式的模型可能在分布外场景中产生严重偏离人类意图的行为。解决方案的关键在于提出深度价值基准(Deep Value Benchmark, DVB),其核心创新是一种受控的混杂实验设计——在训练阶段人为制造深层价值与浅层特征之间的虚假相关性(例如用户偏好“非伤害性”且正式的语言选项),而在测试阶段打破这种相关性,从而精准测量模型的深度价值泛化率(Deep Value Generalization Rate, DVGR)。实验证明,所有测试模型的平均DVGR仅为0.30,表明它们未能有效泛化深层价值观,反而更倾向于依赖浅层特征,这揭示了当前LLMs在价值观理解上的局限性。
链接: https://arxiv.org/abs/2511.02109
作者: Joshua Ashkinaze,Hua Shen,Sai Avula,Eric Gilbert,Ceren Budak
机构: University of Michigan (密歇根大学); New York University Shanghai (纽约大学上海分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: NeurIPS 2025 (Spotlight)
Abstract:We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features – for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model’s Deep Value Generalization Rate (DVGR) – the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.
zh
[NLP-41] LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS NEURIPS2025
【速读】: 该论文旨在解决对比一致性搜索(Contrast-Consistent Search, CCS)方法中目标函数机制不清晰、对随机初始化敏感的问题。其核心解决方案在于重新诠释CCS的目标为优化相对对比一致性(relative contrast consistency),并基于此将原方法重构为一个特征值问题(eigenproblem),从而获得闭式解(closed-form solutions),其中特征值具有可解释性,并自然扩展至多变量情形。这一改进不仅提升了方法的稳定性与可解释性,也为更广泛的探针(probing)和机制可解释性研究提供了新路径。
链接: https://arxiv.org/abs/2511.02089
作者: Stefan F. Schouten,Peter Bloem
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025
Abstract:Contrast-Consistent Search (CCS) is an unsupervised probing method able to test whether large language models represent binary features, such as sentence truth, in their internal activations. While CCS has shown promise, its two-term objective has been only partially understood. In this work, we revisit CCS with the aim of clarifying its mechanisms and extending its applicability. We argue that what should be optimized for, is relative contrast consistency. Building on this insight, we reformulate CCS as an eigenproblem, yielding closed-form solutions with interpretable eigenvalues and natural extensions to multiple variables. We evaluate these approaches across a range of datasets, finding that they recover similar performance to CCS, while avoiding problems around sensitivity to random initialization. Our results suggest that relativizing contrast consistency not only improves our understanding of CCS but also opens pathways for broader probing and mechanistic interpretability methods.
zh
[NLP-42] Regularization Through Reasoning : Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分类任务中仅通过标签进行微调时性能受限的问题,核心假设是:在微调过程中为每个标签附加简短解释(explanation)是否能提升模型的准确性与可靠性。其解决方案的关键在于引入结构化的“解释增强”机制——无论是真实语义的解释还是语法不连贯但词汇一致的伪解释(如洗牌或词袋重构文本),均能显著优于仅使用标签的基线方法。实验表明,这种改进并非源于解释本身的语义内容,而是由于额外的token预算促使模型在中间层产生更丰富的计算表征(表现为激活熵升高),并在输出层形成更集中的预测分布,从而减少过拟合和决策捷径,实现更稳健的推理过程。
链接: https://arxiv.org/abs/2511.02044
作者: Vivswan Shah,Randy Cogill,Hanwei Yue,Gopinath Chennupati,Rinat Khaziev
机构: Amazon Central Analytics and Research Science (亚马逊中央分析与研究科学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2511.02044 [cs.LG] (or arXiv:2511.02044v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.02044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-43] apOut: A Bandit-Based Approach to Dynamic Speculative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于推测解码(speculative decoding)技术时,如何动态确定最优推测 token 数量这一关键问题,以实现最大化的推理加速效果。现有方法依赖于手工调优的敏感阈值(如 token 熵),不仅设置成本高且在不同模型和任务间泛化能力差。其解决方案的关键在于提出 TapOut——一种无需训练、即插即用的在线算法,利用多臂赌博机(multi-armed bandits)机制,在多个无参数的动态推测策略之间进行智能选择,依据历史奖励与探索反馈自适应优化决策,从而在不进行任何超参数调优的情况下实现优于或相当主流基线的加速性能。
链接: https://arxiv.org/abs/2511.02017
作者: Aditya Sridhar,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa
机构: Cerebras Systems; University of Waterloo (滑铁卢大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, 5 tables
Abstract:Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach’s effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
zh
[NLP-44] Retrieval-Augmented Multimodal Depression Detection
【速读】: 该论文旨在解决当前多模态深度学习在抑郁症检测中面临的计算成本高、领域适配性差以及情感知识静态化等问题。其解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的新框架:给定抑郁相关文本后,通过从情感数据集中检索语义相关的表情内容,并利用大语言模型(Large Language Model, LLM)生成一个情感提示(Emotion Prompt)作为辅助模态,从而增强情感表征能力并提升模型可解释性。该方法在AVEC 2019数据集上实现了最先进的性能,相关系数(CCC)达0.593,平均绝对误差(MAE)为3.95。
链接: https://arxiv.org/abs/2511.01892
作者: Ruibo Hou,Shiyu Teng,Jiaqing Liu,Shurong Chai,Yinhao Li,Lanfen Lin,Yen-Wei Chen
机构: Ritsumeikan University (立命馆大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted in IEEE EMBC 2025
Abstract:Multimodal deep learning has shown promise in depression detection by integrating text, audio, and video signals. Recent work leverages sentiment analysis to enhance emotional understanding, yet suffers from high computational cost, domain mismatch, and static knowledge limitations. To address these issues, we propose a novel Retrieval-Augmented Generation (RAG) framework. Given a depression-related text, our method retrieves semantically relevant emotional content from a sentiment dataset and uses a Large Language Model (LLM) to generate an Emotion Prompt as an auxiliary modality. This prompt enriches emotional representation and improves interpretability. Experiments on the AVEC 2019 dataset show our approach achieves state-of-the-art performance with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.
zh
[NLP-45] Multi-Personality Generation of LLM s at Decoding-time WSDM2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中多性格生成(Multi-Personality Generation, MPG)的核心挑战,即如何在解码阶段同时实现多个个性化属性的灵活控制,而无需依赖昂贵的再训练或外部模型。现有方法要么计算成本高、扩展性差,要么受限于启发式规则或额外模型,导致灵活性和鲁棒性不足。其解决方案的关键在于提出一种基于解码时组合范式的新型MPG框架,利用单维度模型中的隐式密度比(implicit density ratios)作为“免费午餐”,将多性格生成重构为从目标策略中采样问题,从而避免了对稀缺多维模型或额外训练的需求;进一步地,设计了基于推测性分块拒绝采样(Speculative Chunk-level based Rejection sampling, SCR)机制,在滑动窗口内并行验证分块响应,显著降低计算开销的同时保持高质量输出。
链接: https://arxiv.org/abs/2511.01891
作者: Rongxin Chen,Yunfan Li,Yige Yuan,Bingbing Xu,Huawei Shen
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: WSDM2026
Abstract:Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a “free lunch” to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at this https URL .
zh
[NLP-46] CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
【速读】: 该论文旨在解决当前生成式 AI 在 CUDA 内核(CUDA kernel)自动设计与优化中效率低下、计算开销高且泛化能力差的问题。现有方法往往依赖大量训练数据或昂贵的硬件资源,难以在不同 GPU 架构和基础模型之间保持性能一致性。其解决方案的关键在于提出一种无需训练的多智能体工作流 CudaForge,该流程由两个 LLM 智能体——“编码器”(Coder)和“裁判”(Judge)协同迭代完成:前者生成初始内核代码,后者基于 Nsight Compute(NCU)等硬件反馈进行验证与优化,从而模拟人类专家的迭代改进过程。实验表明,CudaForge 在保持 97.6% 正确率的同时实现平均 1.68× 的加速比,并在多种 GPU 和大模型上展现出优异的泛化能力,同时显著降低 API 成本(仅需约 $0.3)和时间开销(约 26.5 分钟/内核),优于当前最先进的 agentic 方法(如 OpenAI-o3 和 Kevin)。
链接: https://arxiv.org/abs/2511.01884
作者: Zijian Zhang,Rong Wang,Shiyang Li,Yuebo Luo,Mingyi Hong,Caiwen Ding
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6% correctness of generated kernels and an average 1.68 \times speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on KernelBench. Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and \ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at this https URL
zh
[NLP-47] SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model
【速读】: 该论文旨在解决科学文献爆炸式增长背景下,如何高效且准确地从多模态、异构且不一致的文档中提取结构化数据的问题(即“多模态信息结构化提取难题”)。其解决方案的关键在于提出了一种名为SciDaSynth的交互式系统,该系统基于大语言模型(Large Language Models, LLMs),能够根据用户查询自动整合文本、表格和图表等异源信息,生成标准化的数据表,并通过多维可视化摘要与语义分组功能实现跨文档数据一致性校验与优化,从而显著提升结构化数据提取的质量与效率。
链接: https://arxiv.org/abs/2404.13765
作者: Xingbo Wang,Samantha L. Huey,Rui Sheng,Saurabh Mehta,Fei Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Preprint version of the paper accepted to Campbell Systematic Reviews. Code is available at this https URL
Abstract:The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that automatically generates structured data tables according to users’ queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth’s effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human-AI collaborative systems supporting data extraction tasks. The system code is available at this https URL
zh
[NLP-48] Complete asymptotic type-token relationship for growing complex systems with inverse power-law count rankings
【速读】: 该论文旨在解决复杂系统中类型-标记关系(type-token relationship)的理论推导问题,特别是如何从Zipf定律(即类型计数与类型排名呈幂律关系 $ S \sim r^{-\alpha} $)出发,精确刻画类型数量随总标记数增长的渐近行为。其解决方案的关键在于构建一个理想化的确定性增长模型,该模型能够生成任意幂指数 α 对应的类型计数排名,并通过纯数学分析直接推导出类型-标记关系的统一渐近表达式,从而避免了以往研究中对随机机制或抽样过程的依赖,且修正了此前在 α=1 和 α≫1 特殊情形下的不准确结果。这一方法表明,一般性的类型-标记关系可仅由Zipf定律所决定。
链接: https://arxiv.org/abs/2511.02069
作者: Pablo Rosillo-Rodes,Laurent Hébert-Dufresne,Peter Sheridan Dodds
机构: 未知
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注: 5 pages, 2 figures
Abstract:The growth dynamics of complex systems often exhibit statistical regularities involving power-law relationships. For real finite complex systems formed by countable tokens (animals, words) as instances of distinct types (species, dictionary entries), an inverse power-law scaling S \sim r^-\alpha between type count S and type rank r , widely known as Zipf’s law, is widely observed to varying degrees of fidelity. A secondary, summary relationship is Heaps’ law, which states that the number of types scales sublinearly with the total number of observed tokens present in a growing system. Here, we propose an idealized model of a growing system that (1) deterministically produces arbitrary inverse power-law count rankings for types, and (2) allows us to determine the exact asymptotics of the type-token relationship. Our argument improves upon and remedies earlier work. We obtain a unified asymptotic expression for all values of \alpha , which corrects the special cases of \alpha = 1 and \alpha \gg 1 . Our approach relies solely on the form of count rankings, avoids unnecessary approximations, and does not involve any stochastic mechanisms or sampling processes. We thereby demonstrate that a general type-token relationship arises solely as a consequence of Zipf’s law.
zh
计算机视觉
[CV-0] WIST2: Scalable Portable and Holistic Humanoid Data Collection System
【速读】:该论文旨在解决当前人形机器人(humanoid robotics)在数据收集方面缺乏高效、可扩展框架的问题,尤其针对现有远程操作(teleoperation)系统要么采用解耦控制导致动作不自然,要么依赖昂贵的动作捕捉(motion capture, mocap)设备而难以推广的局限性。其解决方案的关键在于提出TWIST2系统——一个便携式、无需动作捕捉的远程操作与数据采集平台,通过PICO4U VR设备实时获取全身人类运动,并结合自研的2自由度(2-DoF)低成本机器人颈部结构实现第一人称视角(egocentric vision),从而在保持完整全身控制的同时显著提升数据采集效率和系统可扩展性。该方案支持以接近100%成功率在15分钟内完成100次示范数据采集,并进一步构建了基于第一人称视觉的分层视觉-运动策略(hierarchical visuomotor policy),实现了复杂灵巧操作与动态踢腿等全身体协调任务的自主控制。
链接: https://arxiv.org/abs/2511.02832
作者: Yanjie Ze,Siheng Zhao,Weizhuo Wang,Angjoo Kanazawa,Rocky Duan,Pieter Abbeel,Guanya Shi,Jiajun Wu,C. Karen Liu
机构: Amazon FAR; Stanford University; USC; UC Berkeley; CMU
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL
Abstract:Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around 250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at this https URL . Our collected dataset is also open-sourced at this https URL .
zh
[CV-1] Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
【速读】:该论文旨在解决人类头部图像中高质量密集对应关系(dense correspondences)的建模问题,尤其在姿态变化和个体差异下保持一致性与完整性。解决方案的关键在于提出一种名为DenseMarks的新颖可学习表示方法:通过Vision Transformer网络为每张二维人头图像的每个像素预测一个三维嵌入(embedding),该嵌入映射到一个3D规范单位立方体(canonical unit cube)中的空间位置,从而构建一个可解释且可查询的规范空间。训练过程中利用基于状态最优点追踪器生成的成对点匹配数据,并采用对比损失(contrastive loss)引导匹配点具有相近嵌入;同时引入多任务学习约束(如人脸关键点和分割监督)以及通过潜在立方体特征强制嵌入的空间连续性,最终实现对整个头部区域(包括头发)鲁棒的几何感知匹配与单目头部跟踪。
链接: https://arxiv.org/abs/2511.02830
作者: Dmitrii Pozdeev,Alexey Artemov,Ananta R. Bhattarai,Artem Sevastopolsky
机构: Technical University of Munich (TUM); University of Bielefeld
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL .Video: this https URL .21 pages, 13 figures, 2 tables
Abstract:We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.
zh
[CV-2] PLUTO-4: Frontier Pathology Foundation Models
【速读】:该论文旨在解决当前病理学基础模型在跨任务迁移能力、部署灵活性与性能上限之间的平衡问题。其解决方案的关键在于构建PLUTO-4系列视觉Transformer模型,通过两种互补架构实现优化:一是轻量高效的PLUTO-4S模型,采用FlexiViT结构和2D-RoPE嵌入,适配多尺度部署需求;二是前沿规模的PLUTO-4G模型,基于单一图像块尺寸训练以最大化表征能力和稳定性。两者均在包含55万张全切片图像(Whole Slide Images, WSI)的大规模多机构数据集上进行自监督预训练,从而在多种病理任务中实现卓越性能,包括病灶级分类、分割及整片诊断,显著提升临床转化潜力。
链接: https://arxiv.org/abs/2511.02826
作者: Harshith Padigela,Shima Nofallah,Atchuth Naveen Chilaparasetti,Ryun Han,Andrew Walker,Judy Shen,Chintan Shah,Blake Martin,Aashish Sood,Elliot Miller,Ben Glass,Andy Beck,Harsha Pokkalla,Syed Ashar Javed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4’s potential to transform real-world applications as a backbone for translational research and diagnostic use cases.
zh
[CV-3] AI-Generated Image Detection: An Empirical Study and Future Research Directions
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 伪造媒体(如深度伪造)对多媒体取证、虚假信息检测及生物特征识别系统带来的挑战,这些问题已导致公众对司法体系信任度下降、欺诈事件激增以及社会工程攻击增多。现有取证方法存在三大关键局限:非标准化基准测试(使用GAN或扩散模型生成的图像)、训练协议不一致(如从头训练、冻结权重或微调),以及评估指标不足,难以衡量泛化能力和可解释性。为应对这些问题,论文提出一个统一的基准测试框架,在受控且可复现的条件下系统评估十种前沿取证方法(涵盖不同训练策略)和七个公开数据集(包括GAN与扩散模型生成的数据)。其核心解决方案在于通过多维度性能指标(准确率、平均精度、ROC-AUC、错误率、类别敏感性)和可解释性分析(置信度曲线与Grad-CAM热力图)实现全面、公平的对比,从而揭示当前方法在分布内表现优异但跨模型迁移能力差的问题,推动更鲁棒、泛化性强且具备可解释性的下一代取证技术发展。
链接: https://arxiv.org/abs/2511.02791
作者: Nusrat Tasnim,Kutub Uddin,Khalid Mahmood Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
备注:
Abstract:The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.
zh
[CV-4] When Visualizing is the First Step to Reasoning : MIRA a Benchmark for Visual Chain-of-Thought
【速读】:该论文旨在解决当前多模态大语言模型在复杂推理任务中依赖纯文本提示(Chain-of-Thought, CoT)时表现不佳的问题,尤其是在需要生成和利用中间视觉图像(如草图、结构图或路径图)才能有效推理的场景下。解决方案的关键在于提出MIRA基准,这是一个包含546个多模态问题的高质量数据集,每个问题均配有中间视觉图像和最终答案,并设计了三层次评估协议:仅输入图像与问题、纯文本CoT输入结合图像和思维提示、以及Visual-CoT输入(包含标注的图像线索和文本思维提示)。实验表明,当引入中间视觉提示时,模型性能平均提升33.7%,显著优于仅使用文本提示的方法,凸显了想象中的视觉信息在复杂推理中的关键作用。
链接: https://arxiv.org/abs/2511.02779
作者: Yiyang Zhou,Haoqin Tu,Zijun Wang,Zeyu Wang,Niklas Muennighoff,Fan Nie,Yejin Choi,James Zou,Chaorui Deng,Shen Yan,Haoqi Fan,Cihang Xie,Huaxiu Yao,Qinghao Ye
机构: ByteDance(字节跳动); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 15 figures
Abstract:We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through “drawing to think”. To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
zh
[CV-5] PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction Editing ATC WWW
【速读】:该论文旨在解决单图像三维人脸重建(single-image 3D head reconstruction)与语义三维编辑(semantic 3D editing)两大挑战性问题,其核心难点在于严重的视角遮挡、弱感知监督信号以及三维空间中编辑的不确定性。解决方案的关键在于提出了一种统一的基础模型PercHead,该模型采用双分支编码器结合基于视觉Transformer(ViT)的解码器,通过迭代交叉注意力机制将二维特征提升至三维空间,并利用高斯点绘(Gaussian Splatting)进行渲染;同时设计了一种基于DINOv2和SAM2.1的新颖感知监督策略,提供几何与外观保真度的强泛化信号,从而显著提升新视角合成质量及极端视角下的鲁棒性。此外,该模型可通过替换编码器并微调网络实现语义编辑,借助分割图控制几何结构、文本或参考图像指定外观,实现了直观且强大的交互式三维编辑能力。
链接: https://arxiv.org/abs/2511.02777
作者: Antonio Oroz,Matthias Nießner,Tobias Kirschstein
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Video: this https URL
Abstract:We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: this https URL Video: this https URL Comments: Project Page: this https URL Video: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.02777 [cs.CV] (or arXiv:2511.02777v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.02777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-6] Dynamic Reflections: Probing Video Representations with Text Alignment
【速读】:该论文旨在解决视频-文本表示对齐(video-text representation alignment)问题,即如何通过跨模态对齐来揭示不同视频与语言编码器在时空数据上的表征能力差异。其解决方案的关键在于:首先,提出参数化测试时扩展定律(parametric test-time scaling laws),量化静态图像与多帧视频、单句描述与集合文本等不同丰富度的数据输入对当前先进视频编码器性能的影响;其次,系统性地验证语义对齐程度与下游任务(包括语义与非语义任务)性能之间的相关性,初步表明强对齐能力可能预示着通用视频表征能力;最后,将时间推理能力与跨模态对齐关联起来,为视觉-语言模型提供了一个具有挑战性的评估基准。
链接: https://arxiv.org/abs/2511.02767
作者: Tyler Zhu,Tengda Han,Leonidas Guibas,Viorica Pătrăucean,Maks Ovsjanikov
机构: Princeton University (普林斯顿大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 12 figures
Abstract:The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at this https URL
zh
[CV-7] LLEXICORP: End-user Explainability of Convolutional Neural Networks
【速读】:该论文旨在解决现有概念相关性传播(Concept Relevance Propagation, CRP)方法在可解释人工智能(XAI)中面临的两大问题:一是专家需手动分析激活图像以命名发现的概念,二是需人工合成冗长的解释文本,导致解释过程效率低、可扩展性差且难以面向不同受众。解决方案的关键在于提出LLEXICORP——一个将CRP与多模态大语言模型(Multimodal Large Language Model, MLLM)集成的模块化流水线,通过设计特定提示(prompt)使语言模型理解CRP语义并分离命名与解释任务,从而自动为概念原型分配描述性名称,并生成结构化的自然语言解释,既保证解释的忠实性(faithfulness),又能按需输出面向技术专家或非技术人员的不同层次说明,显著降低深度神经网络解释的门槛。
链接: https://arxiv.org/abs/2511.02720
作者: Vojtěch Kůr,Adam Bajger,Adam Kukučka,Marek Hradil,Vít Musil,Tomáš Brázdil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.02720 [cs.CV] (or arXiv:2511.02720v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.02720 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-8] VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
【速读】:该论文旨在解决视频中情感理解与预测的难题,特别是针对情绪的动态性和依赖线索(cues)的特性,这使得复杂且不断演变的情感状态难以被合理解释。其解决方案的关键在于提出了一种新颖的情感线索引导推理框架(affective cues-guided reasoning framework),通过分阶段统一基础属性感知、表情分析与高层情感理解,并引入专为情感推理和指令遵循设计的视频情感基础模型(VidEmo)。该模型采用两阶段微调策略:首先进行课程化情感学习以注入情感知识,随后通过情感树强化学习(affective-tree reinforcement learning)提升推理能力;同时构建了包含210万条多样化指令样本的细粒度情感数据集(Emo-CFG),支持可解释的情感问答、细粒度描述及推理依据,从而显著提升了在15项人脸感知任务上的性能表现。
链接: https://arxiv.org/abs/2511.02712
作者: Zhicheng Zhang,Weicheng Wang,Yongjie Zhu,Wenyu Qin,Pengfei Wan,Di Zhang,Jufeng Yang
机构: Nankai University (南开大学); Pengcheng Laboratory (鹏城实验室); Kuaishou Technology (快手科技); Nankai International Advanced Research Institute (南开大学国际先进研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 26 figures
Abstract:Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
zh
[CV-9] Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification
【速读】:该论文旨在解决可见光-红外行人重识别(Visible-Infrared Person Re-Identification, VI-ReID)中跨模态特征对齐困难的问题,尤其是由于可见光与红外模态间固有差异导致的特征分布不一致问题。现有方法通常依赖中间表示(如生成中间图像或融合中间特征)进行跨模态对齐,但这些方法要么引入额外参数,要么缺乏可解释性,未能充分利用中间特征的有效信息。其解决方案的关键在于提出一种基于模态转换表示学习(Modality-Transition Representation Learning, MTRL)的新框架,利用一个中间生成图像作为从可见光到红外模态的“传输器”,该中间图像在语义上与原始可见光图像高度一致,同时在特征空间上接近红外模态;并通过模态转换对比损失(modality-transition contrastive loss)和模态查询正则化损失(modality-query regularization loss)进行联合优化,从而实现更有效的跨模态特征对齐。该方法无需增加额外参数,保持与主干网络相同的推理速度,且在三个典型VI-ReID数据集上显著优于当前最优方法(SOTA)。
链接: https://arxiv.org/abs/2511.02685
作者: Chao Yuan,Zanwu Liu,Guiwei Zhang,Haoxuan Xu,Yujian Zhao,Guanglin Niu,Bo Li
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visible-infrared person re-identification (VI-ReID) technique could associate the pedestrian images across visible and infrared modalities in the practical scenarios of background illumination changes. However, a substantial gap inherently exists between these two modalities. Besides, existing methods primarily rely on intermediate representations to align cross-modal features of the same person. The intermediate feature representations are usually create by generating intermediate images (kind of data enhancement), or fusing intermediate features (more parameters, lack of interpretability), and they do not make good use of the intermediate features. Thus, we propose a novel VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a middle generated image as a transmitter from visible to infrared modals, which are fully aligned with the original visible images and similar to the infrared modality. After that, using a modality-transition contrastive loss and a modality-query regularization loss for training, which could align the cross-modal features more effectively. Notably, our proposed framework does not need any additional parameters, which achieves the same inference speed to the backbone while improving its performance on VI-ReID task. Extensive experimental results illustrate that our model significantly and consistently outperforms existing SOTAs on three typical VI-ReID datasets.
zh
[CV-10] Differentiable Hierarchical Visual Tokenization NEURIPS2025
【速读】:该论文旨在解决视觉Transformer(Vision Transformer)中固定图像块令牌(patch tokens)忽略图像空间和语义结构的问题。其解决方案的关键在于提出一种端到端可微的分层tokenizer,能够以像素级粒度自适应地调整图像内容表示,同时保持与现有架构的向后兼容性,从而支持对预训练模型的无缝改造,并在图像分类和密集预测任务中实现竞争性性能,甚至可直接实现栅格到矢量的转换。
链接: https://arxiv.org/abs/2511.02652
作者: Marius Aasan,Martine Hjelkrem-Tan,Nico Catalano,Changkyu Choi,Adín Ramírez Rivera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Spotlight
Abstract:Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
zh
[CV-11] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
链接: https://arxiv.org/abs/2511.02650
作者: Tianfan Peng,Yuntao Du,Pengzhou Ji,Shijie Dong,Kailin Jiang,Mingchuan Ma,Yijun Tian,Jinhe Bi,Qian Li,Wei Du,Feng Xiao,Lizhen Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-12] Robust Face Liveness Detection for Biometric Authentication using Single Image
【速读】:该论文旨在解决面部识别系统(Face Recognition Systems, FRS)在面对呈现攻击(presentation attacks,又称spoofing)时的脆弱性问题,此类攻击包括打印/显示攻击(print/display)、视频回放攻击(video)和包裹攻击(wrap)。为应对这一挑战,论文提出了一种轻量级卷积神经网络(CNN)框架,其关键在于通过设计鲁棒的特征提取架构实现高效的活体检测(liveness detection),能够在CPU上实现1-2秒内的快速生物特征认证,同时具备对多种2D spoof攻击类型的识别能力。该方案还配套构建了一个包含60名受试者超过500段视频的新颖2D欺骗攻击数据集,用于验证模型的有效性。
链接: https://arxiv.org/abs/2511.02645
作者: Poulami Raha,Yeongnam Chae
机构: Rakuten Group Inc. (乐天集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Biometric technologies are widely adopted in security, legal, and financial systems. Face recognition can authenticate a person based on the unique facial features such as shape and texture. However, recent works have demonstrated the vulnerability of Face Recognition Systems (FRS) towards presentation attacks. Using spoofing (aka.,presentation attacks), a malicious actor can get illegitimate access to secure systems. This paper proposes a novel light-weight CNN framework to identify print/display, video and wrap attacks. The proposed robust architecture provides seamless liveness detection ensuring faster biometric authentication (1-2 seconds on CPU). Further, this also presents a newly created 2D spoof attack dataset consisting of more than 500 videos collected from 60 subjects. To validate the effectiveness of this architecture, we provide a demonstration video depicting print/display, video and wrap attack detection approaches. The demo can be viewed in the following link: this https URL
zh
[CV-13] Zero-Shot Multi-Animal Tracking in the Wild
【速读】:该论文旨在解决多动物跟踪(multi-animal tracking)在不同物种、栖息地和运动模式下难以泛化的问题,传统方法通常需要针对每个应用场景进行大量模型微调和启发式设计。其解决方案的关键在于利用近期视觉基础模型(vision foundation models),构建一个无需重新训练或超参数调整的零样本(zero-shot)跟踪框架:通过将Grounding Dino目标检测器与Segment Anything Model 2(SAM 2)追踪器相结合,并辅以精心设计的启发式规则,实现了对新数据集的即插即用式应用,在ChimpAct、Bird Flock Tracking、AnimalTrack及GMOT-40子集等多个基准上展现出一致且强大的性能。
链接: https://arxiv.org/abs/2511.02591
作者: Jan Frederik Meier,Timo Lüddecke
机构: University of Göttingen (哥廷根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at this https URL.
zh
[CV-14] AUE: Training-free Noise Transplant and Cultivation Diffusion Model
【速读】:该论文旨在解决文本到图像扩散模型在专业应用场景中因仅输出单一扁平图像而难以实现图层级控制的问题。现有方法要么依赖大规模、难以获取的数据集进行微调,要么虽无需训练但仅能生成孤立前景元素,无法构建完整且连贯的场景。其解决方案的关键在于提出一种名为“噪声移植与培育扩散模型”(TAUE)的零样本图层生成框架,核心创新为“噪声移植与培育”(Noise Transplantation and Cultivation, NTC)技术:通过提取前景和合成生成过程中的中间潜在表示,并将其移植到初始噪声中用于后续图层生成,从而确保前景、背景及合成图层之间的语义与结构一致性,实现无需微调即可生成高质量、多图层协同一致的图像。
链接: https://arxiv.org/abs/2511.02580
作者: Daichi Nagai,Ryugo Morita,Shunsuke Kitada,Hitoshi Iyatomi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: this https URL
Abstract:Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
zh
[CV-15] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
【速读】:该论文旨在解决跨被试通用的脑解码(subject-agnostic brain decoding)问题,即在无需针对每个受试者进行单独训练的情况下,从功能性磁共振成像(fMRI)信号中重建连续的视觉体验。这一问题的关键挑战在于如何实现跨个体的泛化能力以及处理大脑信号的高度复杂性。解决方案的核心是提出一种名为视觉皮层流架构(Visual Cortex Flow Architecture, VCFlow)的分层解码框架,该框架显式建模人类视觉系统的腹侧-背侧通路(ventral-dorsal architecture),通过解耦并利用早期视觉皮层、腹侧流和背侧流的特征,捕获多样且互补的认知信息;同时引入特征级对比学习策略(feature-level contrastive learning),增强对跨被试不变语义表征的提取,从而显著提升对未见过被试的适用性。相较传统方法需要超过12小时的每被试数据和大量计算资源,VCFlow仅损失平均7%的重建精度,即可在无重训练情况下于10秒内生成视频,具备临床可扩展性。
链接: https://arxiv.org/abs/2511.02565
作者: Jingyu Lu,Haonan Wang,Qixiang Zhang,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages main text with 6 figures (excluding references), supplementary material included
Abstract:Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.
zh
[CV-16] Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification
【速读】:该论文旨在解决跨视角视频行人重识别(Cross-View Video Person Re-identification, CVReID)中的关键挑战,包括极端视角变化、尺度差异以及时间不一致性等问题。其解决方案的核心在于提出一种参数高效的框架MTF-CVReID,通过在ViT-B/16骨干网络基础上引入七个互补模块:包括用于校正相机与视角偏差的交叉流特征归一化(Cross-Stream Feature Normalization, CSFN)、多分辨率特征调和(Multi-Resolution Feature Harmonization, MRFH)以稳定不同高度下的尺度变化、身份感知记忆模块(Identity-Aware Memory Module, IAMM)强化持久身份特征、时序动态建模(Temporal Dynamics Modeling, TDM)实现运动感知的短期时序编码、视图间特征对齐(Inter-View Feature Alignment, IVFA)构建视角不变表示、分层时序模式学习(Hierarchical Temporal Pattern Learning, HTPL)捕捉多尺度时序规律,以及多视角身份一致性学习(Multi-View Identity Consistency Learning, MVICL)利用对比学习强制跨视角身份一致性。这些模块共同提升了模型在跨视角场景下的鲁棒性和时序一致性,同时保持了实时推理效率(189 FPS),在AG-VPReID等基准上达到SOTA性能,并展现出良好的跨数据集泛化能力。
链接: https://arxiv.org/abs/2511.02564
作者: Md Rashidunnabi,Kailash A. Hambarde,Vasco Lopes,Joao C. Neves,Hugo Proenca
机构: DeepNeuronic, Lda.(DeepNeuronic, Lda.); University of Beira Interior (贝拉大学); Instituto de Telecomunicações (电信研究所); NOVA LINCS, NOVA University Lisbon (NOVA LINCS,里斯本新大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at this https URL
zh
[CV-17] he Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic
【速读】:该论文旨在解决现有全球目标检测基准在印度复杂交通场景下泛化能力不足的问题,特别是针对本地车辆类别多样性和城市交通流异质性缺乏高质量标注数据的瓶颈。解决方案的关键在于构建并公开首个大规模、高分辨率且针对印度交通特点标注的视觉感知数据集UVH-26,其包含26,646张来自班加罗尔安防摄像头的1080p图像,并通过众包众测(565名大学生参与)对14类本土车辆进行精细化标注(共180万边界框),采用多数投票(Majority Voting)和STAPLE算法生成共识真值标签。在此基础上训练的YOLO与DETR系列模型在mAP50:95指标上相较COCO预训练模型提升达8.4–31.5%,验证了领域特定数据对于提升智能交通系统(ITS)中目标检测性能的重要性。
链接: https://arxiv.org/abs/2511.02563
作者: Akash Sharma,Chinmay Mhatre,Sankalp Gawali,Ruthvik Bokkasam,Brij Kishore,Vishwajeet Pattanaik,Tarun Rambha,Abdul R. Pinjari,Vijay Kovvali,Anirban Chakraborty,Punit Rathore,Raghu Krishnapuram,Yogesh Simmhan
机构: Indian Institute of Science (印度科学研究所); Department of Computation and Data Sciences (计算与数据科学系); Centre for Infrastructure, Sustainable Transportation and Urban Planning (基础设施、可持续交通与城市规划中心); Robert Bosch Centre for Cyberphysical Systems (罗伯特·博世 cyber-physical 系统中心); Centre for Data for Public Good (公共数据研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This report describes the UVH-26 dataset, the first public release by AIM@IISc of a large-scale dataset of annotated traffic-camera images from India. The dataset comprises 26,646 high-resolution (1080p) images sampled from 2800 Bengaluru’s Safe-City CCTV cameras over a 4-week period, and subsequently annotated through a crowdsourced hackathon involving 565 college students from across India. In total, 1.8 million bounding boxes were labeled across 14 vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler (Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller, Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k consensus ground truth bounding boxes and labels were derived for distinct objects in the 26k images using Majority Voting and STAPLE algorithms. Further, we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X, and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50, mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in mAP50:95 over equivalent baseline models trained on COCO dataset, with RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40 for COCO-trained weights for common classes (Car, Bus, and Truck). This demonstrates the benefits of domain-specific training data for Indian traffic scenarios. The release package provides the 26k images with consensus annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the 6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the heterogeneity of Indian urban mobility directly from operational traffic-camera streams, UVH-26 addresses a critical gap in existing global benchmarks, and offers a foundation for advancing detection, classification, and deployment of intelligent transportation systems in emerging nations with complex traffic conditions.
zh
[CV-18] SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration
【速读】:该论文旨在解决物理场景中人-智能体协作(Human-AI Collaboration)研究缺乏真实、多模态交互数据的问题。现有研究多依赖模拟环境或静态数据集,难以反映实际任务中人类与AI在混合现实(Mixed Reality, MR)环境下动态协同的复杂性。解决方案的关键在于构建SigmaCollab数据集,其包含85个未受训参与者在MR辅助下完成程序化任务的多模态数据流,涵盖音频、第一视角视频、深度图、头部/手部/视线追踪信息及事后标注,从而为开发和评估面向物理情境的协作型AI模型提供更贴近现实的测试平台。
链接: https://arxiv.org/abs/2511.02560
作者: Dan Bohus,Sean Andrist,Ann Paradiso,Nick Saw,Tim Schoonbeek,Maia Stiber
机构: Microsoft Research (微软研究院); Eindhoven University of Technology (埃因霍温理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at this https URL.
zh
[CV-19] Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
【速读】:该论文旨在解决从基线磁共振成像(MRI)预测个体未来脑部状态这一核心挑战,尤其关注神经退行性病变(如阿尔茨海默病)的个体化预后问题。传统方法多聚焦于预测认知评分或临床结局(如轻度认知障碍向痴呆转化),而本文创新性地提出纵向MRI图像到图像的预测任务,直接建模空间分布复杂的神经退行模式。解决方案的关键在于采用五种深度学习架构(UNet、U2-Net、UNETR、Time-Embedding UNet 和 ODE-UNet)对两个纵向队列(ADNI 和 AIBL)进行训练与评估,并通过全局相似性和局部差异指标量化预测MRI与实际随访扫描的一致性,结果表明最优模型可在体素级别实现高保真预测,且在独立外部数据集上表现出强泛化能力,验证了其跨队列鲁棒性。
链接: https://arxiv.org/abs/2511.02558
作者: Ali Farki,Elaheh Moradi,Deepika Koundal,Jussi Tohka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer’s disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant’s entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.
zh
[CV-20] Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data
【速读】:该论文旨在解决剪切散斑(Shearography)在工业应用中因依赖专家解读而难以大规模推广的问题,核心挑战在于如何实现无需标注数据的自动化异常检测。解决方案的关键在于采用无监督深度学习方法,特别是提出并验证了一种学生-教师特征匹配模型(student-teacher feature matching model),该模型仅使用无缺陷样本进行训练,即可实现对剪切散斑图像中局部缺陷的高鲁棒性分类与精确空间定位。相较于全连接和卷积自编码器,该方法在特征表示上具有更好的可分性,并通过t-SNE可视化和YOLOv8基准对比验证了其在实际变形条件下的优越性能,从而为工业场景下高效、标签稀疏的剪切散斑检测提供了可行路径。
链接: https://arxiv.org/abs/2511.02541
作者: Jessica Plassmann,Nicolas Schuler,Georg von Freymann,Michael Schuth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
Abstract:Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.
zh
[CV-21] LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization
【速读】:该论文旨在解决稀疏体素光栅化(Sparse-voxel rasterization, SVR)在基于优化的场景重建中存在低频内容欠拟合、依赖脆弱的剪枝启发式策略以及显存(VRAM)过度膨胀的问题。其核心解决方案是提出 LiteVoxel,一个自适应调优的训练流程:通过逆Sobel重加权结合中期gamma ramp机制使损失函数具备低频感知能力,仅在几何稳定后将梯度预算分配至平坦区域;采用基于最大混合权重的深度分位数剪枝逻辑替代固定阈值,并引入EMA-滞后保护机制稳定剪枝过程;同时,在显式增长预算约束下,利用射线足迹优先级驱动的细分策略精细化结构。该方案显著降低峰值显存消耗(40%-60%),提升低频细节保留能力,且保持与强基线SVR相当的PSNR/SSIM、训练速度和帧率。
链接: https://arxiv.org/abs/2511.02510
作者: Jee Won Lee,Jongseong Brad Choi
机构: State University of New York, Korea (纽约州立大学韩国分校); State University of New York, Stony Brook (纽约州立大学石溪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sparse-voxel rasterization is a fast, differentiable alternative for optimization-based scene reconstruction, but it tends to underfit low-frequency content, depends on brittle pruning heuristics, and can overgrow in ways that inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that makes SV rasterization both steadier and lighter. Our loss is made low-frequency aware via an inverse-Sobel reweighting with a mid-training gamma-ramp, shifting gradient budget to flat regions only after geometry stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning logic on maximum blending weight, stabilized by EMA-hysteresis guards and refines structure through ray-footprint-based, priority-driven subdivision under an explicit growth budget. Ablations and full-system results across Mip-NeRF 360 (6scenes) and Tanks Temples (3scenes) datasets show mitigation of errors in low-frequency regions and boundary instability while keeping PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline. Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency detail that prior setups miss, enabling more predictable, memory-efficient training without sacrificing perceptual quality.
zh
[CV-22] Keeping it Local Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems
【速读】:该论文旨在解决移动机器人在复杂动态环境中部署时,因缺乏高效、自动化评估手段而导致的系统验证与接受度受限的问题。其核心挑战在于如何对来自多模态传感器的异构数据进行有效分析,并生成可读性强、信息丰富的自然语言报告,以支持关键应用场景(如自动驾驶和服务型机器人)的性能评估。解决方案的关键在于提出了一种仅依赖本地模型的自动化报告生成流水线,该流水线可在边缘计算设备上运行,无需外部服务,从而保障所有参与方的数据隐私,同时实现了对多种环境(室内、室外及城市场景)下机器人行为的定量与定性评估。
链接: https://arxiv.org/abs/2511.02507
作者: Nicolas Schuler,Lea Dewald,Jürgen Graf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages, 4 figures, 1 table; accepted for MECATRONICS-REM 2025 International Conference, PARIS, FRANCE December 3-5 2025
Abstract:Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.
zh
[CV-23] ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
【速读】:该论文旨在解决视频剪辑中 shot assembly(镜头组接)自动化与艺术表达一致性难以兼顾的问题,即现有智能视频编辑技术虽能完成部分自动化任务,但难以捕捉创作者独特的艺术风格。解决方案的关键在于提出一种基于能量模型(energy-based model)的优化方法:首先通过大语言模型生成脚本与视频库进行视觉-语义匹配,筛选出语义对齐的候选镜头;接着从参考视频中提取镜头属性(如景别、摄像机运动、语义信息)并建立标签体系;然后利用能量模型学习参考视频的组接风格,对候选序列进行打分;最终结合多语法规则实现最优镜头排序,从而在保持叙事逻辑或艺术风格一致性的前提下,自动生成符合参考样式的连贯视觉序列。
链接: https://arxiv.org/abs/2511.02505
作者: Yaosen Chen,Wei Wang,Xuming Wen,Han Yang,Yanru Zhang
机构: Sobey Media Intelligence Laboratory; University of Electronic Science and Technology of China (电子科技大学); SiChuan University (四川大学); Qinghai Normal University (青海师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator’s unique artistic expression in shot this http URL address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: this https URL
zh
[CV-24] Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
【速读】:该论文旨在解决如何将通用基础模型(如语言模型 Language Models, LLMs 和视觉-语言模型 Vision-Language Models, VLMs)有效适配至特定科学任务(以ptychographic分析为例)的问题,尤其是在数据稀缺场景下。其核心挑战在于确定最优的领域适应策略:是采用监督微调(Supervised Fine-Tuning, SFT)还是上下文学习(In-Context Learning, ICL)。解决方案的关键在于通过构建PtychoBench这一多模态、多任务基准测试平台,系统性地比较两种策略在不同任务类型下的表现——发现最优路径具有任务模态依赖性:对于视觉任务(如图像伪影检测),SFT与ICL高度互补,结合上下文感知示例的微调模型性能最佳(Micro-F1=0.728);而对于文本任务(如参数推荐),直接在大模型上使用ICL更优(Micro-F1=0.847),甚至优于专门训练的“超级专家”SFT模型(0-shot Micro-F1=0.839)。这一结果为科学场景中AI代理系统的开发提供了清晰的优化框架。
链接: https://arxiv.org/abs/2511.02503
作者: Robinson Umeike,Neil Getty,Yin Xiangyu,Yi Jiang
机构: The University of Alabama (阿拉巴马大学); Argonne National Laboratory (阿贡国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful “super-expert” SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.
zh
[CV-25] Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization
【速读】:该论文旨在解决在GNSS拒止环境下无人机(UAV)视觉定位失效的问题,特别是针对跨时间、跨视角、异质航空图像匹配中的挑战。传统方法通常将定位任务建模为图像检索或分类问题,依赖于有限的公开数据集和场景标签,或通过极坐标重投影、透视变换等手段减少域间差异,但易出现配准偏差、内容丢失及真实感不足等问题。本文的关键解决方案是利用现代目标检测技术从无人机与卫星图像中精准提取显著目标实例,并引入图神经网络(Graph Neural Network, GNN)建模图像内与图像间的节点关系,基于细粒度的图相似性度量实现高效且鲁棒的跨模态图像匹配,从而在复杂场景下提升定位精度与泛化能力。
链接: https://arxiv.org/abs/2511.02489
作者: Tao Liu,Kan Ren,Qian Chen
机构: Jiangsu Key Laboratory of Spectral Imaging and Intelligent Sense, Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, Submitted to IEEE TIM
Abstract:With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: this https URL.
zh
[CV-26] OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在物体中心逆渲染(object-centric inverse rendering)、新视角合成(novel view synthesis)和重光照(relighting)任务中,因过度依赖合成数据集训练及小规模真实数据集评估而导致的现实感不足与泛化能力有限的问题。解决方案的关键在于构建 OLATverse 数据集——一个包含约 900 万张图像、765 种真实物体的大规模真实世界数据集,这些物体在精确控制的光照条件下从多个视角拍摄,每件物体由 35 台 DSLR 相机和 331 个独立可控光源采集,从而实现高保真外观建模;同时提供校准后的相机参数、精确物体掩膜、光度表面法向量和漫反射反照率等辅助资源,并建立首个全面的基于真实世界的物体中心基准测试集,显著提升模型在真实场景中的适用性与性能。
链接: https://arxiv.org/abs/2511.02483
作者: Xilong Zhou,Jianchun Chen,Pramod Rao,Timo Teufel,Linjie Lyu,Tigran Minasian,Oleksandr Sotnychenko,Xiaoxiao Long,Marc Habermann,Christian Theobalt
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at this https URL.
zh
[CV-27] MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer ICIP2024
【速读】:该论文旨在解决多视角动作识别(multi-view action recognition)在时空动作识别(spatio-temporal action recognition, STAR)场景下的适用性问题。传统方法仅适用于从整段视频中识别单一动作的任务,无法有效处理STAR设置中需对个体动作进行逐帧识别的需求。解决方案的关键在于提出MVAFormer,其核心创新是引入一种基于Transformer的新型视图间协作模块:该模块利用保留空间信息的特征图而非丢失空间细节的嵌入向量进行跨视图融合,并通过将自注意力机制区分为同一视图内和不同视图间的计算方式,更有效地建模多视角之间的关系,从而显著提升STAR任务下的识别性能。
链接: https://arxiv.org/abs/2511.02473
作者: Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shotaro Tora
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Selected as Best Industry Paper Award at ICIP2024
Abstract:Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person’s action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately 4.4 points on the F-measure.
zh
[CV-28] HAGI: Head-Assisted Gaze Imputation and Generation
【速读】:该论文旨在解决移动眼动追踪(mobile eye tracking)中因眨眼、瞳孔检测错误或光照变化导致的眼球注视数据缺失问题,这对后续 gaze 数据分析构成重大挑战。解决方案的关键在于提出 HAGI++——一种基于扩散模型(diffusion model)的多模态填补方法,首次利用头姿传感器与眼动之间的内在关联来提升填补精度;其核心创新是采用基于 Transformer 的扩散模型学习眼动与头动表示间的跨模态依赖关系,并可扩展融合其他身体运动信息(如腕部动作),从而在多种数据集上显著优于传统插值法和深度学习时序填补基线,尤其在 100% 眼动数据缺失情况下仍能生成符合人类真实注视行为的 gaze 分布。
链接: https://arxiv.org/abs/2511.02468
作者: Chuhan Jiao,Zhiming Hu,Andreas Bulling
机构: University of Stuttgart (斯图加特大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version of our UIST’25 paper “HAGI: Head-Assisted Gaze Imputation for Mobile Eye Trackers”
Abstract:Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.
zh
[CV-29] KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image
【速读】:该论文旨在解决高分辨率(Very High-Resolution, VHR)卫星图像修复中的关键挑战,即如何在保持高效性的同时实现高精度的缺失区域重建。现有方法通常依赖于预条件模型(preconditioned models)需大量重训练,或后条件模型(postconditioned models)存在显著计算开销。其解决方案的关键在于提出KAO框架,通过引入**潜在空间条件化(Latent Space Conditioning)优化紧凑的潜在空间,从而实现高效且准确的修复;同时结合显式传播机制(Explicit Propagation)**于扩散过程中,促进前向-反向融合,提升方法的稳定性和精度,最终在DeepGlobe和Massachusetts Roads等VHR数据集上达到新基准性能。
链接: https://arxiv.org/abs/2511.02462
作者: Teerapong Panboonyuen
机构: Chulalongkorn University (朱拉隆功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.
zh
[CV-30] From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
【速读】:该论文旨在解决在移动机器人(Mobile Robotics)场景下,如何在边缘设备(edge devices)上高效部署视觉语言模型(Visual Language Models, VLMs)以实现场景理解(Scene Interpretation)与动作识别(Action Recognition)的问题。其核心挑战在于平衡模型性能与推理时间之间的权衡,尤其是在资源受限的边缘环境中。解决方案的关键在于评估当前最先进的小型VLMs在多样化真实世界场景(包括城市景观、校园和室内环境)中的表现,明确其在边缘部署时的潜力、局限性、固有偏差及实际应用价值,从而为移动机器人提供轻量级、实时且具备常识推理能力的视觉理解方案。
链接: https://arxiv.org/abs/2511.02427
作者: Nicolas Schuler,Lea Dewald,Nick Baldig,Jürgen Graf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
Abstract:Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: this https URL
zh
[CV-31] Synthetic Crop-Weed Image Generation and its Impact on Model Generalization
链接: https://arxiv.org/abs/2511.02417
作者: Garen Boyadjian(INRAE),Cyrille Pierre(INRAE),Johann Laconte(INRAE, UR TSCF),Riccardo Bertoglio(INRAE)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
[CV-32] ChartM3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension EMNLP25
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂图表理解任务中面临的挑战,即现有研究对真实应用场景中常见的复杂图表场景和计算密集型推理任务覆盖不足的问题。解决方案的关键在于提出一种自动化的多阶段代码驱动数据生成流水线,该流水线结合检索增强生成(Retrieval-Augmented Generation, RAG)以获取专业图表模板,并利用思维链(Chain-of-Thought, CoT)策略生成模拟真实数据分布的推理代码,从而驱动图表渲染与相关统计计算,最终构建出高质量、多维度、多步骤的ChartM³数据集(含38K图表和142K问答对),显著提升了模型的推理能力与跨域泛化性能。
链接: https://arxiv.org/abs/2511.02415
作者: Duo Xu,Hao Cheng,Xin Lin,Zhen Xie,Hao Wang
机构: Alibaba Cloud Computing (阿里云计算)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, EMNLP25 Accepted
Abstract:Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM ^3 , a multi-dimensional and multi-step dataset containing 38K charts and 142K QA pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.
zh
[CV-33] IllumFlow: Illumination-Adaptive Low-Light Enhancement via Conditional Rectified Flow and Retinex Decomposition
【速读】:该论文旨在解决低光照图像增强(Low-Light Image Enhancement, LLIE)中因光照不足导致的细节丢失、噪声干扰以及色彩失真等问题。其核心解决方案是提出IllumFlow框架,关键在于基于Retinex理论将输入图像分解为反射率(reflectance)和照度(illumination)两个独立成分,并分别处理:通过条件化修正流(Conditional Rectified Flow, CRF)建模照度变化的连续流场以实现精准的亮度调节;同时设计一个去噪网络,利用流生成的数据增强策略有效去除反射率分量中的噪声与色差,同时保持颜色保真度。这一解耦优化策略使得模型在不同光照条件下均能实现高质量的图像增强与可定制的亮度调整。
链接: https://arxiv.org/abs/2511.02411
作者: Wenyang Wei,Yang yang,Xixi Jia,Xiangchu Feng,Weiwei Wang,Renzhen Wang
机构: Xidian University (西安电子科技大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present IllumFlow, a novel framework that synergizes conditional Rectified Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our model addresses low-light enhancement through separate optimization of illumination and reflectance components, effectively handling both lighting variations and noise. Specifically, we first decompose an input image into reflectance and illumination components following Retinex theory. To model the wide dynamic range of illumination variations in low-light images, we propose a conditional rectified flow framework that represents illumination changes as a continuous flow field. While complex noise primarily resides in the reflectance component, we introduce a denoising network, enhanced by flow-derived data augmentation, to remove reflectance noise and chromatic aberration while preserving color fidelity. IllumFlow enables precise illumination adaptation across lighting conditions while naturally supporting customizable brightness enhancement. Extensive experiments on low-light enhancement and exposure correction demonstrate superior quantitative and qualitative performance over existing methods.
zh
[CV-34] Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs ViTs and Self-Supervised ViTs
【速读】:该论文旨在解决猫与人类在视觉系统中因眼结构差异(如猫具有垂直椭圆瞳孔,与伏击捕食行为相关)而导致的下游视觉表征如何跨物种对齐的问题。其核心挑战在于理解不同物种间视觉信息处理的共性与差异,尤其是在深度学习模型中是否能捕捉到这种跨物种的神经表征一致性。解决方案的关键在于构建一个统一的、冻结编码器的基准测试框架,利用层级中心核对齐(Centered Kernel Alignment, CKA)和表示相似性分析(Representational Similarity Analysis, RSA),对比多种架构(包括卷积神经网络 CNNs、监督式 Vision Transformers、窗口化 Transformer 和自监督 ViT DINO)在野生条件下对猫与人视觉表征的对齐能力。结果表明,基于 token 级别自监督学习的 DINO ViT-B/16 模型在早期块中表现出最强的跨物种表征对齐(平均 CKA-RBF ≈ 0.814,RSA ≈ 0.698),揭示了自监督机制结合 ViT 架构的归纳偏置可更有效地建模跨物种视觉计算的收敛性,为神经科学提供了可验证的假设。
链接: https://arxiv.org/abs/2511.02404
作者: Arya Shah,Vaibhav Tripathi
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘地纳格尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF \approx0.814 , mean CKA-linear \approx0.745 , mean RSA \approx0.698 ), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA \approx0.53 at block8; ViT-L/16 \approx0.47 at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.
zh
[CV-35] A Novel Grouping-Based Hybrid Color Correction Algorithm for Color Point Clouds
【速读】:该论文旨在解决点云数据中颜色一致性校正的问题(color consistency correction for color point clouds),这是3D渲染与压缩应用中的基础且关键任务。传统方法主要针对彩色图像进行颜色校正,难以直接适用于点云场景。论文提出了一种基于分组的混合颜色校正算法,其核心在于根据源点云与目标点云之间的重叠率自适应地将目标点划分为近邻组(Gcl)、中距离组(Gmod)和远距离组(Gdist),并分别采用不同的校正策略:对Gcl使用K近邻双边插值(KBI),对Gmod采用联合KBI与直方图均衡化(JKHE)的方法,对Gdist则采用单一直方图均衡化(HE)。这种分组驱动的差异化处理机制显著提升了颜色一致性校正的效果,在1086组测试点云对上优于当前最优方法。
链接: https://arxiv.org/abs/2511.02397
作者: Kuo-Liang Chung,Ting-Chung Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Color consistency correction for color point clouds is a fundamental yet important task in 3D rendering and compression applications. In the past, most previous color correction methods aimed at correcting color for color images. The purpose of this paper is to propose a grouping-based hybrid color correction algorithm for color point clouds. Our algorithm begins by estimating the overlapping rate between the aligned source and target point clouds, and then adaptively partitions the target points into two groups, namely the close proximity group Gcl and the moderate proximity group Gmod, or three groups, namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated overlapping rate is low or high, respectively. To correct color for target points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method is proposed. To correct color for target points in Gmod, a joint KBI and the histogram equalization (JKHE) method is proposed. For target points in Gdist, a histogram equalization (HE) method is proposed for color correction. Finally, we discuss the grouping-effect free property and the ablation study in our algorithm. The desired color consistency correction benefit of our algorithm has been justified through 1086 testing color point cloud pairs against the state-of-the-art methods. The C++ source code of our algorithm can be accessed from the website: this https URL.
zh
[CV-36] Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds ITSC2025
【速读】:该论文旨在解决稀疏且噪声严重的雷达点云(Radar Point Clouds)中移动目标分割(Moving Object Segmentation)的标签效率低问题,即传统监督学习方法因标注数据获取成本高而难以有效应用。解决方案的关键在于提出一种两阶段自监督学习框架:首先通过新颖的基于聚类的对比损失函数(Clustering-Based Contrastive Loss Function)结合动态点移除策略进行预训练,使网络生成具备运动感知能力的表示;随后利用少量标注数据进行微调,显著提升模型在移动目标分割任务中的性能与标签使用效率。
链接: https://arxiv.org/abs/2511.02395
作者: Leon Schwarzer,Matthias Zeller,Daniel Casado Herraez,Simon Dierl,Michael Heidingsfeld,Cyrill Stachniss
机构: CARIAD SE (CARIAD SE); TU Dortmund University (多特蒙德工业大学); University of Bonn (波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at IEEE International Conference on Intelligent Transportation Systems (ITSC 2025), 8 pages, 3 figures
Abstract:Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point’s Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.
zh
[CV-37] RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
【速读】:该论文旨在解决化学反应图像数据难以被机器读取的问题,即现有化学反应数据多以图像形式存在于文献中,缺乏结构化表示,无法直接用于训练机器学习模型。其解决方案的关键在于提出Rxncaption框架,将传统的基于坐标预测的化学反应图谱解析任务重构为图像描述生成问题,并引入“边界框与索引作为视觉提示”(BBox and Index as Visual Prompt, BIVP)策略:通过先进的分子检测器MolYOLO在输入图像上预标注分子边界框和编号,从而将下游解析任务转化为自然语言描述问题,显著提升了结构提取质量并简化了模型设计。
链接: https://arxiv.org/abs/2511.02384
作者: Jiahe Song,Chuang Wang,Bowen Jiang,Yinfan Wang,Hao Zheng,Xingjian Wei,Chengjin Liu,Junyuan Gao,Yubin Wang,Lijun Wu,Jiang Wu,Qian Yu,Conghui He
机构: Peking University (北京大学); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed “BBox and Index as Visual Prompt” (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
zh
[CV-38] M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings
【速读】:该论文旨在解决当前基于智能手机的视频光电容积脉搏波描记法(video-based photoplethysmography, vPPG)在心血管疾病患者中应用时面临的可靠性问题,包括运动伪影、光照变化以及单视角限制等因素导致的测量误差。其解决方案的关键在于构建了首个公开的双视角移动光电容积脉搏波描记数据集(M3PD),该数据集同步采集自智能手机前后摄像头的面部与指尖视频,并在此基础上提出F3Mamba模型,通过基于Mamba的时序建模融合双视角信息,显著降低了心率估计误差(较现有单视角基线提升21.9%–30.2%),并增强了复杂真实场景下的鲁棒性。
链接: https://arxiv.org/abs/2511.02349
作者: Jiankai Tang,Tao Zhang,Jia Li,Yiru Zhang,Mingyu Zhang,Kegang Wang,Yuming Hao,Bolin Wang,Haiyang Li,Xingyao Wang,Yuanchun Shi,Yuntao Wang,Sichong Qian
机构: Tsinghua University (清华大学); Beijing Anzhen Hospital, Capital Medical University (首都医科大学附属北京安贞医院); Agency for Science, Technology and Research (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: this https URL.
zh
[CV-39] GAFD-CC: Global-Aware Feature Decoupling with Confidence Calibration for OOD Detection
【速读】:该论文旨在解决现有后验(post-hoc)分布外(Out-of-distribution, OOD)检测方法在不重新训练模型的前提下,因忽视特征与logits之间内在相关性而导致检测性能受限的问题。其解决方案的关键在于提出全局感知的特征解耦与置信度校准方法(Global-Aware Feature Decoupling with Confidence Calibration, GAFD-CC):首先,基于分类权重进行全局感知的特征解耦,将特征沿全局分类方向对齐,从而分离出两类关键信息——正相关特征用于优化ID/OOD边界,负相关特征用于抑制误报并收紧边界;其次,自适应融合多尺度logit-based置信度与解耦后的特征,实现更全面、鲁棒的OOD检测。
链接: https://arxiv.org/abs/2511.02335
作者: Kun Zou,Yongheng Xu,Jianxing Yu,Yan Pan,Jian Yin,Hanjiang Lai
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is paramount to ensuring the reliability and robustness of learning models in real-world applications. Existing post-hoc OOD detection methods detect OOD samples by leveraging their features and logits information without retraining. However, they often overlook the inherent correlation between features and logits, which is crucial for effective OOD detection. To address this limitation, we propose Global-Aware Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to refine decision boundaries and increase discriminative performance. Firstly, it performs global-aware feature decoupling guided by classification weights. This involves aligning features with the direction of global classification weights to decouple them. From this, GAFD-CC extracts two types of critical information: positively correlated features that promote in-distribution (ID)/OOD boundary refinement and negatively correlated features that suppress false positives and tighten these boundaries. Secondly, it adaptively fuses these decoupled features with multi-scale logit-based confidence for comprehensive and robust OOD detection. Extensive experiments on large-scale benchmarks demonstrate GAFD-CC’s competitive performance and strong generalization ability compared to those of state-of-the-art methods.
zh
[CV-40] Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization NEURIPS2025
【速读】:该论文旨在解决相机位姿估计(camera pose estimation)中的全局精度与鲁棒性问题,尤其在缺乏精确距离信息的情况下实现高精度位姿恢复。其解决方案的关键在于提出了一种名为Cycle-Sync的框架,核心创新是将原本用于群同步(group synchronization)的消息传递最小二乘法(message-passing least squares, MPLS)改进为适用于相机位置估计的算法:通过强调环一致性(cycle consistency)信息、利用前迭代中估计的距离重新定义环一致性,并引入Welsch型鲁棒损失函数提升抗噪能力;同时,该方法首次建立了目前已知最强的确定性精确恢复保证,证明仅依赖环一致性即可实现最低采样复杂度下的位姿重建,且无需依赖捆绑调整(bundle adjustment)。
链接: https://arxiv.org/abs/2511.02329
作者: Shaohan Li,Yunpeng Shi,Gilad Lerman
机构: University of Minnesota (明尼苏达大学); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Numerical Analysis (math.NA); Methodology (stat.ME)
备注: NeurIPS 2025 spotlight paper
Abstract:We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) – originally developed for group synchronization – to camera location estimation. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, showing that cycle consistency alone – without access to inter-camera distances – suffices to achieve the lowest sample complexity currently known. To further enhance robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation synchronization. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.
zh
[CV-41] 3D Point Cloud Object Detection on Edge Devices for Split Computing
【速读】:该论文旨在解决自动驾驶中基于深度学习的3D目标检测模型在边缘设备上因计算复杂度高而导致的处理时间长和功耗大的问题。其解决方案的关键在于采用分割计算(Split Computing)策略,将深度神经网络模型的推理过程分布在边缘设备与云端之间,仅传输中间特征数据而非原始点云数据,从而显著降低边缘端的计算负载和能耗,同时提升整体推理效率。实验表明,在体素化后分割可使推理时间减少70.8%,边缘执行时间减少90.0%;而在网络内部分割时,推理时间最多减少57.1%,边缘执行时间最多减少69.5%。
链接: https://arxiv.org/abs/2511.02293
作者: Taisuke Noguchi,Takuya Azumi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. This version includes minor lstlisting configuration adjustments for successful compilation. No changes to content or layout. Originally published at ACM/IEEE RAGE 2024
Abstract:The field of autonomous driving technology is rapidly advancing, with deep learning being a key component. Particularly in the field of sensing, 3D point cloud data collected by LiDAR is utilized to run deep neural network models for 3D object detection. However, these state-of-the-art models are complex, leading to longer processing times and increased power consumption on edge devices. The objective of this study is to address these issues by leveraging Split Computing, a distributed machine learning inference method. Split Computing aims to lessen the computational burden on edge devices, thereby reducing processing time and power consumption. Furthermore, it minimizes the risk of data breaches by only transmitting intermediate data from the deep neural network model. Experimental results show that splitting after voxelization reduces the inference time by 70.8% and the edge device execution time by 90.0%. When splitting within the network, the inference time is reduced by up to 57.1%, and the edge device execution time is reduced by up to 69.5%.
zh
[CV-42] Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows? BMVC2025
【速读】:该论文旨在解决3D计算机视觉中目标姿态估计(object pose estimation)问题,特别是在姿态不唯一的情况下,如何提供更鲁棒和信息丰富的概率性输出。传统方法通常仅给出单一点估计,但在传感器限制或物体固有对称性导致歧义时,这种确定性输出可能不足。论文的关键解决方案是探索使用经典的欧拉角(Euler angles)参数化作为基础,构建归一化流(Normalizing Flows)模型来建模姿态的联合概率分布,从而在保持计算效率的同时提升对复杂姿态不确定性的表达能力。
链接: https://arxiv.org/abs/2511.02277
作者: Giorgos Sfikas,Konstantina Nikolaidou,Foteini Papadopoulou,George Retsinas,Anastasios L. Kesidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025 workshop proceedings (Smart Cameras for Smarter Autonomous Vehicles Robots)
Abstract:Object pose estimation is a task that is of central importance in 3D Computer Vision. Given a target image and a canonical pose, a single point estimate may very often be sufficient; however, a probabilistic pose output is related to a number of benefits when pose is not unambiguous due to sensor and projection constraints or inherent object symmetries. With this paper, we explore the usefulness of using the well-known Euler angles parameterisation as a basis for a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation, 3D pose has been parameterized in a number of ways, either in or out of the context of parameter estimation. We explore the idea that Euler angles, despite their shortcomings, may lead to useful models in a number of aspects, compared to a model built on a more complex parameterisation.
zh
[CV-43] Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
【速读】:该论文旨在解决医学报告生成(Medical Report Generation, MRG)中三个核心挑战:领域知识理解不足、文本与视觉实体嵌入对齐效果差,以及跨模态偏倚导致的虚假相关性。其解决方案的关键在于提出一种分层任务分解框架HTSC-CIF,将问题划分为低、中、高层任务:低层通过空间定位对齐医疗实体特征以增强视觉编码器的领域知识;中层利用前缀语言建模(Prefix Language Modeling)和掩码图像建模(Masked Image Modeling)实现跨模态相互引导,提升对齐能力;高层引入基于前门干预(front-door intervention)的跨模态因果干预模块,减少混杂因素并提高模型可解释性。该方法系统性地解决了现有MRG模型仅针对单一挑战的局限,显著优于当前最先进方法。
链接: https://arxiv.org/abs/2511.02271
作者: Yucheng Song,Yifan Ge,Junhao Li,Zhining Liao,Zhifang Liao
机构: Central South University (中南大学); University of Science and Technology of China (中国科学技术大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists’ burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF’s effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.
zh
[CV-44] Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency
【速读】:该论文旨在解决内窥镜视频中绝对深度估计(absolute depth estimation)的难题,尤其是在缺乏真实场景下带标注深度图的情况下,如何提升生成式AI(Generative AI)驱动的单目深度估计(Monocular Depth Estimation, MDE)模型在实际手术环境中的准确性。当前方法依赖于图像级无监督域自适应(unsupervised domain adaptation),将已知深度的合成图像转换为真实内窥镜图像风格进行训练,但此类方法仍存在真实与合成图像之间的域差距(domain gap)。本文的关键解决方案是提出一种潜在特征对齐(latent feature alignment)策略,通过对抗学习(adversarial learning)和方向特征一致性(directional feature consistency)约束,使深度网络在输入经图像翻译后的合成帧与真实内窥镜帧时,能够学习到域不变的潜在特征表示,从而显著缩小域差距并提升绝对深度估计性能。该方法不依赖于特定的图像翻译过程,专注于深度估计本身,且在中央气道内窥镜视频数据集上验证了其有效性,优于现有最优MDE方法,在多种骨干网络和预训练权重设置下均具一致性改进。
链接: https://arxiv.org/abs/2511.02247
作者: Hao Li,Daiwei Lu,Jesse d’Almeida,Dilara Isik,Ehsan Khodapanah Aghdam,Nick DiSanto,Ayberk Acar,Susheela Sharma,Jie Ying Wu,Robert J. Webster III,Ipek Oguz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at this https URL.
zh
[CV-45] Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimers Disease Diagnosis
【速读】:该论文旨在解决多模态神经影像融合(如MRI和PET)在阿尔茨海默病(Alzheimer’s disease, AD)早期诊断中,因忽视模态特异性特征及模态间分布差异导致的表征偏差与噪声问题。其解决方案的关键在于提出一种协同注意力与一致性引导融合框架(Collaborative Attention and Consistent-Guided Fusion framework),通过引入可学习参数表示(learnable parameter representation, LPR)块补偿缺失模态信息,结合共享编码器与模态无关编码器以保留跨模态共享特征与模态特异性特征,并设计一致性引导机制显式对齐不同模态的潜在分布,从而提升分类性能。
链接: https://arxiv.org/abs/2511.02228
作者: Delin Ma,Menghui Zhou,Jun Qi,Yun Yang,Po Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Alzheimer’s disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.
zh
[CV-46] Can Foundation Models Revolutionize Mobile AR Sparse Sensing?
【速读】:该论文旨在解决移动传感系统中长期存在的感知质量与效率之间的权衡问题,尤其是在计算资源、功耗等受限条件下如何保持高精度感知。传统稀疏传感(sparse sensing)方法通过仅采集和处理部分传感器数据来提升效率,但常因时空维度上的信息缺失导致精度下降。本文提出的关键解决方案是利用基础模型(foundation models)增强稀疏传感的性能,特别是通过几何感知的图像变形(geometry-aware image warping)技术实现跨帧信息的高精度复用,从而在保证效率的同时显著提升3D场景重建等任务的准确性,并验证了该方案在真实移动增强现实(AR)数据上的可扩展性和优越性。
链接: https://arxiv.org/abs/2511.02215
作者: Yiqin Zhao,Tian Guo
机构: Rochester Institute of Technology (罗切斯特理工学院); Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.
zh
[CV-47] Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning ALT
【速读】:该论文旨在解决传统心室节段纵向应变(Segmental Longitudinal Strain, SLS)评估方法依赖大量人工干预、效率低下且资源消耗大,难以用于临床持续监测的问题。其关键解决方案是提出首个基于深度学习(Deep Learning, DL)的自动化SLS估算流程autoStrain,利用经食管超声心动图(Transesophageal Echocardiography, TEE)影像实现精准运动估计;具体采用两种DL模型进行对比:TeeFlow(基于RAFT光学流模型,适用于密集帧间预测)与TeeTracker(基于CoTracker点轨迹模型,适用于稀疏长序列预测),并通过自建的高保真合成TEE(synTEE)数据集(含80例患者的真实心肌运动标签)进行训练和验证,最终在临床数据上验证了其准确性(平均差异1.09%,95%一致性界限为-8.90%至11.09%),显著提升了心脏功能评估的精度与效率。
链接: https://arxiv.org/abs/2511.02210
作者: Anders Austlid Taskén,Thierry Judge,Erik Andreas Rye Berg,Jinyang Yu,Bjørnar Grenne,Frank Lindseth,Svend Aakhus,Pierre-Marc Jodoin,Nicolas Duchateau,Olivier Bernard,Gabriel Kiss
机构: Norwegian University of Science and Technology (挪威科技大学); University of Sherbrooke (舍布鲁克大学); INSA-Lyon, Universite Claude Bernard Lyon 1, CNRS, Inserm, CREATIS UMR 5220, U1294 (法国里昂国立应用科学学院,克莱德·贝尔纳里昂第一大学,法国国家科学研究中心,法国国家健康与医学研究院,CREATIS UMR 5220,U1294); Institut Universitaire de France (法国大学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 13 pages, IEEE Journal of Biomedical and Health Informatics
Abstract:Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings. Comments: 13 pages, IEEE Journal of Biomedical and Health Informatics Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2511.02210 [cs.CV] (or arXiv:2511.02210v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.02210 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/JBHI.2025.3605793 Focus to learn more DOI(s) linking to related resources
zh
[CV-48] Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping
【速读】:该论文旨在解决传统植物表型分析方法在草莓(strawberry)表型性状测量中存在的时间消耗大、劳动强度高以及破坏性等问题,同时针对当前基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的重建方法因包含背景信息而导致噪声干扰、计算成本高及下游特征提取困难的问题。解决方案的关键在于提出一种面向对象的3D重建框架,通过引入预处理流程——利用Segment Anything Model v2(SAM-2)与Alpha通道背景掩膜相结合的方式实现背景去除,从而获得纯净的草莓植株三维几何表示;在此基础上,进一步结合DBSCAN聚类与主成分分析(Principal Component Analysis, PCA)自动估算植株高度和冠层宽度等关键表型参数,显著提升了重建精度与计算效率,为草莓表型组学提供了可扩展且非破坏性的技术路径。
链接: https://arxiv.org/abs/2511.02207
作者: Jiajia Li,Keyi Zhu,Qianwen Zhang,Dong Chen,Qi Sun,Zhaojian Li
机构: Michigan State University (密歇根州立大学); Mississippi State University (密西西比州立大学); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 3 tables
Abstract:Strawberries are among the most economically significant fruits in the United States, generating over 2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.
zh
[CV-49] Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中对淀粉样蛋白正电子发射断层扫描(amyloid-beta positron emission tomography, Abeta-PET)高度依赖所带来的成本高和可及性差的问题。其解决方案的关键在于开发一种基于大语言模型(large language model, LLM)驱动的多模态信息融合生成模型,能够从血液生物标志物(blood-based biomarkers, BBMs)和T1加权磁共振成像(MRI)数据中合成高质量的Abeta-PET图像。该合成方法在结构相似性和区域模式一致性上表现优异(SSIM = 0.920 ± 0.003,Pearson相关系数 r = 0.955 ± 0.007),并支持构建全自动AD诊断流程,显著提升诊断性能(AUC = 0.78),优于单独使用MRI或BBMs的模型,验证了LLM增强生成机制与提示工程在提升多模态合成精度和临床适用性方面的核心作用。
链接: https://arxiv.org/abs/2511.02206
作者: Zhengjie Zhang,Xiaoxie Mao,Qihao Guo,Shaoting Zhang,Qi Huang,Mu Zhou,Fang Xie,Mianxin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 8 figures
Abstract:Background: Alzheimer’s disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson’s r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer’s disease.
zh
[CV-50] OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
【速读】:该论文旨在解决真实世界实验数据中多模态时空学习的两大挑战:一是单模态测量稀疏、不规则且存在噪声(如质量控制(QA/QC)伪影),但模态间存在相关性;二是可用模态在时空维度上变化不定,导致有效记录缩减,除非模型能在训练和测试时适应任意模态子集。解决方案的关键在于提出OmniField框架,其核心是基于可用模态条件化的连续神经场(continuous neural field)与迭代跨模态信息融合机制,通过多模态交叉对话(multimodal crosstalk block)架构和迭代跨模态优化,在解码器前对信号进行对齐,从而实现统一的重建、插值、预测和跨模态预测,无需网格化或代理预处理,且在强模拟传感器噪声下仍保持接近无噪输入的性能,展现出优异的鲁棒性。
链接: https://arxiv.org/abs/2511.02205
作者: Kevin Valencia,Thilina Balasooriya,Xihaier Luo,Shinjae Yoo,David Keetae Park
机构: UCLA (加州大学洛杉矶分校); Columbia University (哥伦比亚大学); Brookhaven National Laboratory (布鲁克海文国家实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 12 figures, 8 tables
Abstract:Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.
zh
[CV-51] MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation
【速读】:该论文旨在解决视网膜血管分割中因血管结构极度细长且分支拓扑复杂、跨图像全局形态差异大而导致的分割精度与鲁棒性不足的问题。解决方案的关键在于提出一种名为MM-UNet的新架构,其核心创新包括:1)引入Morph Mamba Convolution层替代传统点卷积,通过形态感知的状态自适应特征采样增强对分支拓扑结构的感知能力;2)设计Reverse Selective State Guidance模块,融合反向引导理论与状态空间建模,提升几何边界感知能力和解码效率。实验表明,该方法在DRIVE和STARE两个公开数据集上分别获得1.64%和1.25%的F1分数提升,显著优于现有方法。
链接: https://arxiv.org/abs/2511.02193
作者: Jiawen Liu,Yuanbo Zeng,Jiaming Liang,Yizhen Yang,Yiheng Zhang,Enhui Cai,Xiaoqi Sheng,Hongmin Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper was accepted by IEEE BIBM 2025 conference
Abstract:Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 % on DRIVE and 1.25 % on STARE, demonstrating its effectiveness and advancement. The project code is public via this https URL.
zh
[CV-52] Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models ICCV2025
【速读】:该论文旨在解决地面视频问答(Grounded Video Question Answering, GVQA)任务中的多模态复杂推理、视觉定位与目标时序跟踪难题。其解决方案的关键在于提出一个三阶段流水线架构,其中核心创新是通过所设计的CORTEX提示生成“触发时刻”(trigger moment),即目标对象在视频中最具可见性的单帧图像,作为后续视觉定位和时序跟踪的鲁棒锚点,从而显著提升模型在GVQA任务上的性能,最终实现0.4968的HOTA得分,相较去年冠军成绩(0.2704)有大幅提升。
链接: https://arxiv.org/abs/2511.02182
作者: Jinhwan Seo,Yoonki Cho,Junhyug Noh,Sung-eui Yoon
机构: KAIST(韩国科学技术院); Ewha Womans University(梨花女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st place winner of Grounded Videoqa track at the ICCV2025 Perception Test
Abstract:In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \ QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year’s winning score of 0.2704 on GVQA task.
zh
[CV-53] Autobiasing Event Cameras for Flickering Mitigation
【速读】:该论文旨在解决由光照强度快速变化引起的频闪效应(flicker effect)对事件相机(event camera)性能的影响问题,尤其在不同光照环境下的适应性挑战。其解决方案的关键在于提出一种创新的自主调参机制——通过卷积神经网络(CNN)识别空间域中的频闪现象,并动态调整事件相机内部的偏置参数(bias settings),从而有效抑制频率范围为25 Hz至500 Hz的频闪干扰,无需额外硬件或软件滤波模块。该方法显著提升了人脸检测任务中的YOLO置信度指标和检测帧率,同时使边缘检测梯度平均下降38.2%(明亮环境)和53.6%(低光环境),验证了其在复杂光照条件下的鲁棒性与有效性。
链接: https://arxiv.org/abs/2511.02180
作者: Mehdi Sefidgar Dilmaghani,Waseem Shariff,Cian Ryan,Joe Lemley,Peter Corcoran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.
zh
[CV-54] Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis
【速读】:该论文旨在解决道路裂缝宽度精确量化难题,尤其针对传统方法在复杂非均匀裂缝边界形态下效果受限,以及缺乏从任意像素点快速测量能力的问题。解决方案的关键在于提出一种级联式框架,融合主成分分析(Principal Component Analysis, PCA)与鲁棒主成分分析(Robust Principal Component Analysis, RPCA),通过三个步骤实现高效裂纹宽度提取:首先利用成熟检测算法进行初始裂纹分割获得二值图像,其次基于PCA确定准平行裂纹的主方向轴,最后采用RPCA提取不规则裂纹几何结构的主传播轴(Main Propagation Axis, MPA)。该方法在多个公开数据集上验证表现出优于现有最先进方法的计算效率和测量精度。
链接: https://arxiv.org/abs/2511.02144
作者: Zhicheng Wang,Junbiao Pang
机构: Beijing University of Technology (北京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:
Abstract:Accurate quantification of pavement crack width plays a pivotal role in assessing structural integrity and guiding maintenance interventions. However, achieving precise crack width measurements presents significant challenges due to: (1) the complex, non-uniform morphology of crack boundaries, which limits the efficacy of conventional approaches, and (2) the demand for rapid measurement capabilities from arbitrary pixel locations to facilitate comprehensive pavement condition evaluation. To overcome these limitations, this study introduces a cascaded framework integrating Principal Component Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from digital images. The proposed methodology comprises three sequential stages: (1) initial crack segmentation using established detection algorithms to generate a binary representation, (2) determination of the primary orientation axis for quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations were conducted across three publicly available datasets, demonstrating that the proposed approach achieves superior performance in both computational efficiency and measurement accuracy compared to existing state-of-the-art techniques.
zh
[CV-55] From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera
【速读】:该论文旨在解决浮游有孔虫(planktonic foraminifera)壳室生长轨迹自动追踪问题,传统方法依赖人工分割每个壳室,存在效率低、主观性强等局限。解决方案的关键在于提出一个端到端的自动化流程,整合实例分割(instance segmentation)与专用壳室排序算法,从高分辨率计算机断层扫描(CT)数据中重建三维生长轨迹。该方法通过优化不同空间特征的实例分割模型,并结合鲁棒的壳室排序机制,在保证生物学意义准确性的前提下显著减少人工干预,即使在小壳室因体素精度不足导致欠分割的情况下仍能稳定恢复发育轨迹,首次实现了浮游有孔虫生长分析的全自动与可重复性,为大规模数据驱动的生态研究奠定基础。
链接: https://arxiv.org/abs/2511.02142
作者: Huahua Lin,Xiaohao Cai,Mark Nixon,James M. Mulqueeney,Thomas H. G. Ezard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time-consuming and subjective. In this study, we propose an end-to-end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three-dimensional growth trajectories from high-resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth-order reconstruction accuracy. Experimental results on expert-annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under-segmentation in smaller chambers due to reduced voxel fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large-scale, data-driven ecological studies.
zh
[CV-56] A Step Toward World Models: A Survey on Robotic Manipulation
【速读】:该论文旨在解决当前自主代理在复杂、动态和不确定环境中实现高级任务(如操作、导航与决策)时,因缺乏对环境内在机制与动态规律的理解而难以超越简单反应式控制或状态复现的问题。其核心挑战在于明确世界模型(world model)的定义边界、架构范式及其关键能力,并推动其在机器人操作领域的通用化与实用性发展。解决方案的关键在于通过系统性回顾机器人操作领域的方法,识别具备世界模型核心功能——即感知、预测与控制一体化能力的模型结构,提炼出构成真实世界模型的必要组件与功能模块,并在此基础上提出一条面向可泛化、可实践的世界模型构建路线图。
链接: https://arxiv.org/abs/2511.02097
作者: Peng-Fei Zhang,Ying Cheng,Xiaofan Sun,Shijie Wang,Lei Zhu,Heng Tao Shen
机构: Tongji University (同济大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 5 figures
Abstract:Autonomous agents are increasingly expected to operate in complex, dynamic, and uncertain environments, performing tasks such as manipulation, navigation, and decision-making. Achieving these capabilities requires agents to understand the underlying mechanisms and dynamics of the world, moving beyond purely reactive control or simple replication of observed states. This motivates the development of world models as internal representations that encode environmental states, capture dynamics, and enable prediction, planning, and reasoning. Despite growing interest, the definition, scope, architectures, and essential capabilities of world models remain ambiguous. In this survey, rather than directly imposing a fixed definition and limiting our scope to methods explicitly labeled as world models, we examine approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation. We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a real world model should possess. Building on this analysis, we aim to outline a roadmap for developing generalizable and practical world models for robotics.
zh
[CV-57] Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study
【速读】:该论文旨在解决在实际手术场景中,如何实现无需标记点(markerless)且仅依赖深度信息的增强现实(AR)配准精度问题,尤其针对小尺寸或低曲率解剖结构(如足部、耳部和下肢)的临床应用挑战。解决方案的关键在于提出了一种基于头戴式显示器(HMD)的深度-only、无标记AR注册流程,其核心包括:(i) 深度偏置校正,(ii) 简短的人工介入初始化,以及 (iii) 全局与局部配准策略的结合。该方法在真实手术环境中实现了平均3.9 mm的点对点误差,显著提升了小尺度和低特征区域的配准准确性,从而增强了无标记AR导航系统的临床可用性。
链接: https://arxiv.org/abs/2511.02086
作者: Yue Yang,Fabian Necker,Christoph Leuze,Michelle Chen,Andrey Finegersh,Jake Lee,Vasu Divi,Bruce Daniel,Brian Hargreaves,Jie Ying Wu,Fred M Baik
机构: Stanford University (斯坦福大学); Vanderbilt University (范德堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing “skin-to-bone” relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.
zh
[CV-58] xt-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis
【速读】:该论文旨在解决大规模场景文本视觉问答(text-VQA)数据集构建中依赖人工标注所带来的繁琐与挑战问题。其解决方案的关键在于提出了一种端到端的自动化合成管道,能够基于图像中的场景文本自动生成忠实且高质量的问答对(QA pairs)。该方法整合了光学字符识别(OCR)检测与识别(text spotting)、感兴趣区域(ROI)检测、图像描述生成和问题生成等多个模块,并通过协同优化实现高效的数据合成与验证,从而显著提升数据集的规模与可扩展性,最终在约44K张图像上成功构建了包含72K QA对的大规模text-VQA数据集。
链接: https://arxiv.org/abs/2511.02046
作者: Soham Joshi,Shwet Kamal Mishra,Viswanath Gopalakrishnan
机构: International Institute of Information Technology Bangalore (印度国际信息科技研究所班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First two authors contributed equally
Abstract:Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.
zh
[CV-59] StrengthSense: A Dataset of IMU Signals Capturing Everyday Strength-Demanding Activities
【速读】:该论文旨在解决当前缺乏全面涵盖高强度活动的可穿戴传感器数据集的问题,以支持肌肉力量、耐力和功率的监测研究。解决方案的关键在于构建并公开发布一个名为StrengthSense的数据集,该数据集包含来自29名健康受试者的10个惯性测量单元(Inertial Measurement Units, IMUs)信号,覆盖11种高强度需求活动(如起立-坐下、爬楼梯和拖地)以及2种低强度活动,所有数据均通过视频标注进行校准,并提供了详尽的数据采集、预处理与技术验证流程,从而为人类活动识别算法开发和健康监测应用提供可靠的基础。
链接: https://arxiv.org/abs/2511.02027
作者: Zeyu Yang,Clayton Souza Leite,Yu Xiao
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tracking strength-demanding activities with wearable sensors like IMUs is crucial for monitoring muscular strength, endurance, and power. However, there is a lack of comprehensive datasets capturing these activities. To fill this gap, we introduce \textitStrengthSense, an open dataset that encompasses IMU signals capturing 11 strength-demanding activities, such as sit-to-stand, climbing stairs, and mopping. For comparative purposes, the dataset also includes 2 non-strength demanding activities. The dataset was collected from 29 healthy subjects utilizing 10 IMUs placed on limbs and the torso, and was annotated using video recordings as references. This paper provides a comprehensive overview of the data collection, pre-processing, and technical validation. We conducted a comparative analysis between the joint angles estimated by IMUs and those directly extracted from video to verify the accuracy and reliability of the sensor data. Researchers and developers can utilize \textitStrengthSense to advance the development of human activity recognition algorithms, create fitness and health monitoring applications, and more.
zh
[CV-60] owards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images
【速读】:该论文旨在解决医疗影像中受保护健康信息(Protected Health Information, PHI)检测的准确性与效率问题,以保障患者隐私并满足监管合规要求。传统方法依赖光学字符识别(Optical Character Recognition, OCR)模型结合命名实体识别(Named Entity Recognition, NER),但存在文本提取精度不足和语义理解能力有限的问题。解决方案的关键在于利用大型多模态模型(Large Multimodal Model, LMM)提升OCR性能及语义分析能力,通过系统性对比GPT-4o、Gemini 2.5 Flash和Qwen 2.5 7B三种闭源与开源LMM在两种流水线配置下的表现——单一文本分析与OCR+语义分析融合方案,发现LMM显著优于传统OCR工具(词错误率WER: 0.03–0.05,字符错误率CER: 0.02–0.03),尤其在复杂印迹场景下PHI检测准确率提升明显;同时指出,在文本清晰且对比度高的情况下,不同流水线配置效果相近,从而为实际部署提供基于操作约束的LMM选型建议与模块化可扩展架构策略。
链接: https://arxiv.org/abs/2511.02014
作者: Tuan Truong,Guillermo Jimenez Perez,Pedro Osorio,Matthias Lenga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ISBI 2026
Abstract:The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.
zh
[CV-61] Locally-Supervised Global Image Restoration
【速读】:该论文旨在解决从不完整测量中进行图像重建的问题,涵盖超分辨率(upsampling)和图像修复(inpainting)两类任务。传统监督学习方法需要完整的采样真值数据,而自监督方法虽可接受不完整真值,但通常依赖随机采样以保证对图像的期望覆盖。本文针对固定且确定性的采样模式(此类模式在期望下也无法覆盖全图)带来的挑战,提出利用图像分布的多重不变性(multiple invariances),理论上可实现与全监督方法相当的重建性能。其解决方案的关键在于通过挖掘数据分布中的结构化先验信息,而非依赖大量完整标注样本,从而显著降低对高质量真值数据的需求,并在光声显微成像(photoacoustic microscopy, PAM)中的超分辨率任务上验证了方法的有效性。
链接: https://arxiv.org/abs/2511.01998
作者: Benjamin Walder,Daniel Toader,Robert Nuster,Günther Paltauf,Peter Burgholzer,Gregor Langer,Lukas Krainer,Markus Haltmeier
机构: Universität Innsbruck (因斯布鲁克大学); Universität Graz (格拉茨大学); Research Center for Non Destructive Testing GmbH (无损检测研究中心有限公司); Prospective Instruments LK OG (前瞻性仪器有限责任公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:We address the problem of image reconstruction from incomplete measurements, encompassing both upsampling and inpainting, within a learning-based framework. Conventional supervised approaches require fully sampled ground truth data, while self-supervised methods allow incomplete ground truth but typically rely on random sampling that, in expectation, covers the entire image. In contrast, we consider fixed, deterministic sampling patterns with inherently incomplete coverage, even in expectation. To overcome this limitation, we exploit multiple invariances of the underlying image distribution, which theoretically allows us to achieve the same reconstruction performance as fully supervised approaches. We validate our method on optical-resolution image upsampling in photoacoustic microscopy (PAM), demonstrating competitive or superior results while requiring substantially less ground truth data.
zh
[CV-62] Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1 Sentinel-2 and Planetscope for end-users
【速读】:该论文旨在解决Geo-Foundational Models (GFMs) 在洪水淹没范围制图中是否优于传统模型(如U-Net)这一关键问题,尤其在不同传感器和数据可用性场景下的性能差异尚不明确,导致用户难以选择最优模型。解决方案的关键在于通过系统性比较三种GFMs(Prithvi 2.0、Clay V1.5、DOFA)及其变体UViT与传统模型TransNorm、U-Net、Attention U-Net,在PlanetScope、Sentinel-1和Sentinel-2多源遥感数据上进行定量评估,并结合跨区域交叉验证、少样本实验和计算效率分析,发现Clay在多数场景下表现最优,且具有更低的参数量(26M)和推理速度优势,从而证明GFMs可在保持或小幅提升精度的同时显著降低标注成本与计算开销。
链接: https://arxiv.org/abs/2511.01990
作者: Saurabh Kaushik,Lalit Maurya,Elizabeth Tellman,ZhiJie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay’s superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.
zh
[CV-63] Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models
【速读】:该论文旨在解决个性化图像生成模型在实际应用中缺乏细粒度可解释性的问题,尤其是现有自然语言解释方法难以精确识别多个个性化维度及其各自程度的局限。解决方案的关键在于提出一种名为FineXL的新技术,该技术能够为每个独立的个性化方面提供自然语言描述,并附带定量评分以反映各方面的个性化水平,从而实现更精准、可量化的自然语言可解释性。
链接: https://arxiv.org/abs/2511.01932
作者: Haoming Wang,Wei Gao
机构: University of Pittsburgh (匹兹堡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Image generation models are usually personalized in practical uses in order to better meet the individual users’ heterogeneous needs, but most personalized models lack explainability about how they are being personalized. Such explainability can be provided via visual features in generated images, but is difficult for human users to understand. Explainability in natural language is a better choice, but the existing approaches to explainability in natural language are limited to be coarse-grained. They are unable to precisely identify the multiple aspects of personalization, as well as the varying levels of personalization in each aspect. To address such limitation, in this paper we present a new technique, namely \textbfFineXL, towards \textbfFine-grained e\textbfXplainability in natural \textbfLanguage for personalized image generation models. FineXL can provide natural language descriptions about each distinct aspect of personalization, along with quantitative scores indicating the level of each aspect of personalization. Experiment results show that FineXL can improve the accuracy of explainability by 56%, when different personalization scenarios are applied to multiple types of image generation models.
zh
[CV-64] Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound
【速读】:该论文旨在解决通用视觉基础模型(如DINOv3)在胎儿超声成像中因类别间差异小(low inter-class variability)而导致的判别能力不足问题,尤其是在胎儿大脑标准切面(包括经丘脑平面TT、经侧脑室平面TV和经小脑平面TC)这类结构高度相似的情况下,难以可靠区分不同解剖位置的问题。解决方案的关键在于通过在专用领域数据上进行自监督预训练——具体而言,构建了一个包含188,000张标注图像的多中心基准数据集FetalUS-188K,并使用该数据集对DINOv3进行预训练,从而学习到对超声特有的回声特征和细微结构差异敏感的表示;实验表明,这种领域适应性预训练显著提升了模型性能(加权F1分数最高提升20%),证明了领域特定预训练对于实现临床可靠胎儿脑部超声影像表征的必要性。
链接: https://arxiv.org/abs/2511.01915
作者: Edoardo Conti,Riccardo Rosati,Lorenzo Federici,Adriano Mancini,Maria Chiara Fiorentin
机构: Università Politecnica delle Marche (马尔凯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes–transthalamic (TT), transventricular (TV), and transcerebellar (TC)–which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2511.01915 [cs.CV] (or arXiv:2511.01915v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.01915 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Maria Chiara Fiorentino [view email] [v1] Sat, 1 Nov 2025 13:37:22 UTC (2,827 KB)
zh
[CV-65] FlyBot-VLA Technical Report
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中对高维、多模态信息融合不足以及缺乏有效跨模态对齐机制的问题。解决方案的关键在于提出一种双层级动作表示框架,通过联合监督视觉语言模型(VLM)与动作专家,在训练阶段同时学习两种互补的动作形式:一是从跨体征(cross-embodiment)操作视频预训练得到的潜在动作(latent actions),用于捕捉高层意图;二是通过对连续控制信号进行频域变换获得的结构化离散动作标记(structured discrete action tokens),用于编码低层动态行为。这种双重监督机制有效对齐了语言、视觉与动作的表征空间,使VLM能够直接参与动作生成,显著提升了模型在复杂现实场景中的3D感知与推理能力。
链接: https://arxiv.org/abs/2511.01914
作者: Yuan Zhang,Chenyu Xue,Wenjie Xu,Chao Ji,Jiajia wu,Jia Pan
机构: 科大讯飞(Iflytek)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community
zh
[CV-66] An unscented Kalman filter method for real time input-parameter-state estimation
链接: https://arxiv.org/abs/2511.02717
作者: Marios Impraimakis,Andrew W. Smyth
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY)
备注: author-accepted manuscript (AAM) published in Mechanical Systems and Signal Processing
[CV-67] Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback
【速读】:该论文旨在解决医学图像分割中自动化方法难以达到临床精度标准的问题,尤其是在现有分割精修策略依赖大量用户交互或全监督标注的情况下。解决方案的关键在于提出一种弱监督框架SCORE(Segmentation COrrection from Regional Evaluations),其创新性地引入了一种基于区域质量评分和过/欠分割误差标签的新损失函数,从而仅需轻量级反馈即可学习对初始分割掩码进行有效修正,显著降低了对密集标注数据的依赖和标注时间,同时在肱骨CT扫描任务中实现了与现有精修方法相当的性能。
链接: https://arxiv.org/abs/2511.02576
作者: Alix de Langlais,Benjamin Billot,Théo Aguilar Vidal,Marc-Olivier Gauci,Hervé Delingette
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time. Our code is available at: this https URL.
zh
[CV-68] A Kullback-Leibler divergence method for input-system-state identification
【速读】:该论文旨在解决系统输入参数与状态估计中因初始参数集猜测不同而导致结果不确定性的问题(即多解性问题)。解决方案的关键在于利用Kullback-Leibler散度(Kullback-Leibler divergence)在卡尔曼滤波框架下比较不同初始参数集所得到的后验分布与先验分布之间的差异,从而选择出最符合数据信息的估计结果。该方法通过量化从先验到后验的信息增益,实现对最优识别方案的自动筛选,在线性和非线性系统以及信息受限场景中均表现出良好的鲁棒性和有效性。
链接: https://arxiv.org/abs/2511.02426
作者: Marios Impraimakis
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Systems and Control (eess.SY)
备注: 32 pages, 17 figures, published in Journal of Sound and Vibration
Abstract:The capability of a novel Kullback-Leibler divergence method is examined herein within the Kalman filter framework to select the input-parameter-state estimation execution with the most plausible results. This identification suffers from the uncertainty related to obtaining different results from different initial parameter set guesses, and the examined approach uses the information gained from the data in going from the prior to the posterior distribution to address the issue. Firstly, the Kalman filter is performed for a number of different initial parameter sets providing the system input-parameter-state estimation. Secondly, the resulting posterior distributions are compared simultaneously to the initial prior distributions using the Kullback-Leibler divergence. Finally, the identification with the least Kullback-Leibler divergence is selected as the one with the most plausible results. Importantly, the method is shown to select the better performed identification in linear, nonlinear, and limited information applications, providing a powerful tool for system monitoring.
zh
[CV-69] MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
【速读】:该论文旨在解决当前乳腺X线摄影(mammography)中人工智能(AI)系统临床可靠性受限的问题,其核心挑战源于公共数据集间在数据质量、元数据标准和人群分布上的显著异质性,导致模型存在数据集特异性偏差,严重削弱了模型的泛化能力。解决方案的关键在于提出并实现MammoClean框架,该框架通过统一病例选择、图像处理(包括侧别校正与强度归一化)以及将元数据整合为一致的多视角结构,实现了对乳腺影像数据的标准化,并量化了偏倚来源。研究表明,使用MammoClean可有效识别并缓解乳腺密度和异常检出率等关键变量的分布偏移,且能明确揭示数据污染对AI性能的直接影响——训练于污染数据的模型相比经清洗的数据集模型表现出显著性能下降。因此,MammoClean提供了一个可复现的、面向偏倚感知的AI开发流程,支持构建具备跨域泛化能力的鲁棒模型,推动公平、安全、高效的乳腺癌筛查AI系统发展。
链接: https://arxiv.org/abs/2511.02400
作者: Yalda Zafari,Hongyi Pan,Gorkem Durak,Ulas Bagci,Essam A. Rashed,Mohamed Mabrok
机构: Qatar University (卡塔尔大学); Northwestern University (西北大学); University of Hyogo (兵库县立大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: this https URL.
zh
[CV-70] High-Resolution Magnetic Particle Imaging System Matrix Recovery Using a Vision Transformer with Residual Feature Network
【速读】:该论文旨在解决磁粒子成像(Magnetic Particle Imaging, MPI)中因采样降维和线圈敏感性差异导致的系统矩阵分辨率下降问题,进而影响图像重建质量。其核心解决方案是提出一种混合深度学习框架——视觉Transformer与残差特征网络结合的VRF-Net,该方法通过引入基于Transformer的全局注意力机制捕捉大尺度结构信息,并辅以残差卷积网络进行细节修复,从而实现高保真度的系统矩阵超分辨率重建。关键创新在于将Transformer的长距离依赖建模能力与残差学习的精细特征优化相结合,在真实MPI条件下显著提升了系统矩阵恢复的精度与鲁棒性。
链接: https://arxiv.org/abs/2511.02212
作者: Abuobaida M.Khair,Wenjing Jiang,Yousuf Babiker M. Osman,Wenjun Xia,Xiaopeng Ma
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:This study presents a hybrid deep learning framework, the Vision Transformer with Residual Feature Network (VRF-Net), for recovering high-resolution system matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from downsampling and coil sensitivity variations. VRF-Net addresses these challenges by combining transformer-based global attention with residual convolutional refinement, enabling recovery of both large-scale structures and fine details. To reflect realistic MPI conditions, the system matrix is degraded using a dual-stage downsampling strategy. Training employed paired-image super-resolution on the public Open MPI dataset and a simulated dataset incorporating variable coil sensitivity profiles. For system matrix recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08 dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x scaling, with stable performance at higher scales. On average, it reduced nRMSE by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over interpolation and CNN-based methods. In image reconstruction of Open MPI phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960), outperforming existing methods. These findings demonstrate that VRF-Net enables sharper, artifact-free system matrix recovery and robust image reconstruction across multiple scales, offering a promising direction for future in vivo applications.
zh
[CV-71] Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization
【速读】:该论文旨在解决光电子卷积神经网络(Opto-electronic Convolutional Neural Networks, OECNNs)在端到端优化过程中面临的计算成本高、参数空间庞大以及训练不稳定的问题。其关键解决方案是采用两阶段设计策略:首先训练一个标准的电子CNN模型,随后通过直接对第一层卷积核进行优化,实现基于超表面阵列(metasurface array)的光学前段硬件部署。该方法显著降低了计算与内存需求(减少数百倍),并提升了训练稳定性,在相同资源约束下使单目深度估计任务的精度达到端到端训练的两倍。
链接: https://arxiv.org/abs/2511.02065
作者: Ali Almuallem,Harshana Weligampola,Abhiram Gnanasambandam,Wei Xu,Dilshan Godaliyadda,Hamid R. Sheikh,Stanley H. Chan,Qi Guo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.
zh
人工智能
[AI-0] Neurosymbolic Deep Learning Semantics
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在科学发现中因缺乏语义而难以转化为可理解的科学知识的问题。其核心挑战在于深度学习(Deep Learning)和神经符号人工智能(Neurosymbolic AI)缺乏通用的条件来保证期望属性的满足,现有方法多为特定场景下的编码与知识提取策略,缺乏统一框架。解决方案的关键在于引入逻辑作为形式化框架,通过构建一个语义编码框架(semantic encoding framework),明确神经网络与逻辑之间的映射关系,并提炼出各类现有方法的共性要素,从而为深度学习提供可解释、可验证的语义基础,推动AI驱动的科学研究从“黑箱”走向可理解的科学认知。
链接: https://arxiv.org/abs/2511.02825
作者: Artur d’Avila Garcez,Simon Odense
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence (AI) is a powerful new language of science as evidenced by recent Nobel Prizes in chemistry and physics that recognized contributions to AI applied to those areas. Yet, this new language lacks semantics, which makes AI’s scientific discoveries unsatisfactory at best. With the purpose of uncovering new facts but also improving our understanding of the world, AI-based science requires formalization through a framework capable of translating insight into comprehensible scientific knowledge. In this paper, we argue that logic offers an adequate framework. In particular, we use logic in a neurosymbolic framework to offer a much needed semantics for deep learning, the neural network-based technology of current AI. Deep learning and neurosymbolic AI lack a general set of conditions to ensure that desirable properties are satisfied. Instead, there is a plethora of encoding and knowledge extraction approaches designed for particular cases. To rectify this, we introduced a framework for semantic encoding, making explicit the mapping between neural networks and logic, and characterizing the common ingredients of the various existing approaches. In this paper, we describe succinctly and exemplify how logical semantics and neural networks are linked through this framework, we review some of the most prominent approaches and techniques developed for neural encoding and knowledge extraction, provide a formal definition of our framework, and discuss some of the difficulties of identifying a semantic encoding in practice in light of analogous problems in the philosophy of mind.
zh
[AI-1] Kosmos: An AI Scientist for Autonomous Discovery
【速读】:该论文旨在解决当前AI科研代理在执行数据驱动科学发现时面临的局限性——即在有限的行动次数后容易失去推理连贯性,从而限制了研究成果的深度。其解决方案的关键在于提出Kosmos这一新型AI科学家系统,该系统通过引入结构化世界模型(structured world model)实现数据分析师与文献检索代理之间的信息共享,使系统能够在长达12小时的运行中持续执行超过200次智能体滚动(agent rollouts),平均每轮执行42,000行代码并阅读1,500篇论文,从而保持目标导向的推理一致性,并生成可溯源的科学报告(所有陈述均附有代码或原始文献引用)。
链接: https://arxiv.org/abs/2511.02824
作者: Ludovico Mitchener,Angela Yiu,Benjamin Chang,Mathieu Bourdenx,Tyler Nadolski,Arvis Sulovari,Eric C. Landsness,Daniel L. Barabasi,Siddharth Narayanan,Nicky Evans,Shriya Reddy,Martha Foiani,Aizad Kamal,Leah P. Shriver,Fang Cao,Asmamaw T. Wassie,Jon M. Laurent,Edwin Melville-Green,Mayk Caldas,Albert Bou,Kaleigh F. Roberts,Sladjana Zagorac,Timothy C. Orr,Miranda E. Orr,Kevin J. Zwezdaryk,Ali E. Ghareeb,Laurie McCoy,Bruna Gomes,Euan A. Ashley,Karen E. Duff,Tonio Buonassisi,Tom Rainforth,Randall J. Bateman,Michael Skarlinski,Samuel G. Rodriques,Michaela M. Hinks,Andrew D. White
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings. Here we present Kosmos, an AI scientist that automates data-driven discovery. Given an open-ended objective and a dataset, Kosmos runs for up to 12 hours performing cycles of parallel data analysis, literature search, and hypothesis generation before synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos uses a structured world model to share information between a data analysis agent and a literature search agent. The world model enables Kosmos to coherently pursue the specified objective over 200 agent rollouts, collectively executing an average of 42,000 lines of code and reading 1,500 papers per run. Kosmos cites all statements in its reports with code or primary literature, ensuring its reasoning is traceable. Independent scientists found 79.4% of statements in Kosmos reports to be accurate, and collaborators reported that a single 20-cycle Kosmos run performed the equivalent of 6 months of their own research time on average. Furthermore, collaborators reported that the number of valuable scientific findings generated scales linearly with Kosmos cycles (tested up to 20 cycles). We highlight seven discoveries made by Kosmos that span metabolomics, materials science, neuroscience, and statistical genetics. Three discoveries independently reproduce findings from preprinted or unpublished manuscripts that were not accessed by Kosmos at runtime, while four make novel contributions to the scientific literature.
zh
[AI-2] Optimizing AI Agent Attacks With Synthetic Data
【速读】:该论文旨在解决复杂高风险人工智能(AI)部署中难以准确评估其风险的问题,尤其是在计算资源受限、数据稀缺的代理型(agentic)环境中,如何有效优化攻击策略以提升AI控制(AI control)评估的强度。解决方案的关键在于将攻击能力解构为五个核心技能——怀疑建模(suspicion modeling)、攻击选择(attack selection)、计划合成(plan synthesis)、执行(execution)和隐蔽性(subtlety),并分别独立优化每个组件;同时,为克服数据匮乏问题,作者构建了一个攻击动力学的概率模型,通过仿真优化超参数后,在SHADE-Arena这一多样化真实控制环境数据集上验证了策略的有效迁移,最终使安全评分从基线0.87显著降至0.41,大幅提升了攻击强度。
链接: https://arxiv.org/abs/2511.02823
作者: Chloe Loughridge,Paul Colognese,Avery Griffin,Tyler Tracy,Jon Kutasov,Joe Benton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills – suspicion modeling, attack selection, plan synthesis, execution and subtlety – and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
zh
[AI-3] Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning
【速读】:该论文旨在解决当前表格式上下文学习(tabular in-context learning, ICL)模型在处理真实世界表数据时存在的三大局限性:(1) 单尺度特征处理无法捕捉层次化依赖关系;(2) 密集注意力机制导致计算复杂度随表格宽度呈二次增长;(3) 严格顺序的组件处理方式限制了迭代表示优化和跨组件信息交互。其解决方案的关键在于提出 Orion-MSP 架构,包含三项核心创新:(1) 多尺度处理(multi-scale processing)以建模不同层级的特征交互;(2) 块稀疏注意力(block-sparse attention)融合窗口化、全局与随机模式,在保证长程连接的同时实现高效扩展;(3) 类 Perceiver 的记忆结构支持组件间安全的双向信息流动,从而提升表征质量与训练稳定性。
链接: https://arxiv.org/abs/2511.02818
作者: Mohamed Bouadi,Pratinav Seth,Aditya Tanna,Vinay Kumar Sankarapu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at this https URL .
zh
[AI-4] Assessing win strength in MLB win prediction models
【速读】:该论文旨在解决如何利用机器学习模型预测棒球比赛胜负概率,并进一步探讨这些预测概率在实际决策(如运行线投注)中的有效性问题。其解决方案的关键在于:首先构建一套通用数据集上的综合性机器学习模型,验证其预测的胜率与实际得分差之间存在显著关联(即胜率反映比赛强度),进而证明模型输出的概率具有实际意义;其次,通过分析将预测胜率用于投注策略的效果,发现合理运用可带来正收益,而盲目使用则会导致重大亏损,从而强调了模型应用中策略设计的重要性。
链接: https://arxiv.org/abs/2511.02815
作者: Morgan Allen,Paul Savala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In Major League Baseball, strategy and planning are major factors in determining the outcome of a game. Previous studies have aided this by building machine learning models for predicting the winning team of any given game. We extend this work by training a comprehensive set of machine learning models using a common dataset. In addition, we relate the win probabilities produced by these models to win strength as measured by score differential. In doing so we show that the most common machine learning models do indeed demonstrate a relationship between predicted win probability and the strength of the win. Finally, we analyze the results of using predicted win probabilities as a decision making mechanism on run-line betting. We demonstrate positive returns when utilizing appropriate betting strategies, and show that naive use of machine learning models for betting lead to significant loses.
zh
[AI-5] abTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models
【速读】:该论文旨在解决当前表格基础模型(tabular foundation models)在实际部署中面临的诸多挑战,包括预处理流程异构、API碎片化、微调方法不统一以及缺乏针对校准(calibration)和公平性(fairness)等部署导向指标的标准化评估。其解决方案的关键在于提出一个名为TabTune的统一库,通过单一接口标准化整个表格基础模型的工作流:该框架提供对七种前沿模型的一致访问,支持零样本推理、元学习、监督微调(SFT)及参数高效微调(PEFT)等多种适应策略;同时内建模型感知的预处理自动化机制以管理架构异构性,并集成性能、校准与公平性评估模块,从而实现可复现且一致的适配策略基准测试。
链接: https://arxiv.org/abs/2511.02802
作者: Aditya Tanna,Pratinav Seth,Mohamed Bouadi,Utsav Avaiya,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models. The library is open source and available at this https URL .
zh
[AI-6] When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning NEURIPS2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中缺乏可解释性的问题,尤其是模态间冲突如何影响最终决策、哪一模态主导输出以及是否存在因单一模态错误导致整体误判的“模态破坏”(modality sabotage)现象。解决方案的关键在于提出一种轻量级、与模型无关的诊断评估层,将每种模态视为独立代理(agent),生成候选标签及简要自我评估,通过简单的融合机制识别出贡献者(支持正确结果的模态)和破坏者(误导结果的模态),从而实现对多模态融合动态的系统性审计与分析。
链接: https://arxiv.org/abs/2511.02794
作者: Chenyu Zhang,Minsol Kim,Shohreh Ghorbani,Jingyao Wu,Rosalind Picard,Patricia Maes,Paul Pu Liang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted at the Multimodal Algorithmic Reasoning (MAR) Workshop, NeurIPS 2025
Abstract:Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.
zh
[AI-7] Measuring AI Diffusion: A Population-Normalized Metric for Tracking Global AI Usage
【速读】:该论文旨在解决全球人工智能(Artificial Intelligence, AI)扩散程度难以量化的问题,尤其缺乏按人口标准化的跨国使用数据。其解决方案的关键在于提出“AI用户占比”(AI User Share)这一新指标,该指标通过分析匿名化的微软(Microsoft)遥测数据,并结合设备访问权限和移动设备扩展因子进行校正,从而估算各国劳动年龄人口中主动使用AI工具的比例。该方法覆盖147个经济体,提供实时、一致的全球AI扩散洞察,为政策制定者提供了基于数据的基准参考。
链接: https://arxiv.org/abs/2511.02781
作者: Amit Misra,Jane Wang,Scott McCullers,Kevin White,Juan Lavista Ferres
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 2 tables. Also available at this https URL
Abstract:Measuring global AI diffusion remains challenging due to a lack of population-normalized, cross-country usage data. We introduce AI User Share, a novel indicator that estimates the share of each country’s working-age population actively using AI tools. Built from anonymized Microsoft telemetry and adjusted for device access and mobile scaling, this metric spans 147 economies and provides consistent, real-time insight into global AI diffusion. We find wide variation in adoption, with a strong correlation between AI User Share and GDP. High uptake is concentrated in developed economies, though usage among internet-connected populations in lower-income countries reveals substantial latent demand. We also detect sharp increases in usage following major product launches, such as DeepSeek in early 2025. While the metric’s reliance solely on Microsoft telemetry introduces potential biases related to this user base, it offers an important new lens into how AI is spreading globally. AI User Share enables timely benchmarking that can inform data-driven AI policy.
zh
[AI-8] 1 PoCo: Agent ic Proof-of-Concept Exploit Generation for Smart Contracts
【速读】:该论文旨在解决智能合约安全审计中Proof-of-Concept(PoC)漏洞利用代码手动编写效率低、易出错且受时间限制的问题。解决方案的关键在于提出一种名为POCO的代理式(agentic)框架,该框架通过在“推理-执行-观察”循环中与一组代码执行工具交互,自动从审计人员用自然语言描述的漏洞中生成可执行的PoC漏洞利用代码,且输出结果兼容Foundry测试框架,可直接集成至审计报告及其他安全工具中,从而显著降低高质量PoC的制作成本并提升自动化水平。
链接: https://arxiv.org/abs/2511.02780
作者: Vivi Andersson,Sofia Bobadilla,Harald Hobbelhagen,Martin Monperrus
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Under review
Abstract:Smart contracts operate in a highly adversarial environment, where vulnerabilities can lead to substantial financial losses. Thus, smart contracts are subject to security audits. In auditing, proof-of-concept (PoC) exploits play a critical role by demonstrating to the stakeholders that the reported vulnerabilities are genuine, reproducible, and actionable. However, manually creating PoCs is time-consuming, error-prone, and often constrained by tight audit schedules. We introduce POCO, an agentic framework that automatically generates executable PoC exploits from natural-language vulnerability descriptions written by auditors. POCO autonomously generates PoC exploits in an agentic manner by interacting with a set of code-execution tools in a Reason-Act-Observe loop. It produces fully executable exploits compatible with the Foundry testing framework, ready for integration into audit reports and other security tools. We evaluate POCO on a dataset of 23 real-world vulnerability reports. POCO consistently outperforms the prompting and workflow baselines, generating well-formed and logically correct PoCs. Our results demonstrate that agentic frameworks can significantly reduce the effort required for high-quality PoCs in smart contract audits. Our contribution provides readily actionable knowledge for the smart contract security community.
zh
[AI-9] STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation
【速读】:该论文旨在解决药物分子生成中面临的三大挑战:一是如何在广阔的化学空间中学习广泛且多样化的分子分布;二是如何通过捕捉结构-性质关系实现条件生成;三是如何实现快速分子生成。其解决方案的关键在于提出STAR-VAE(Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder),一个基于Transformer的可扩展潜在变量框架,采用SELFIES(Self-referencing Embedded Strings)表示法确保语法有效性,并利用潜在空间中的条件建模机制——即属性预测器提供的条件信号一致地施加于潜在先验、推理网络和解码器,从而实现属性引导的分子生成。此外,通过低秩适配器(LoRA)技术对编码器和解码器进行高效微调,显著提升了模型在有限属性数据下的适应速度与性能,验证了现代变分自编码器(Variational Auto Encoder, VAE)在大规模分子生成任务中的竞争力。
链接: https://arxiv.org/abs/2511.02769
作者: Bum Chul Kwon,Ben Shapira,Moshiko Raboh,Shreyans Sethi,Shruti Murarka,Joseph A Morrone,Jianying Hu,Parthasarathy Suryanarayanan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 16 pages, 3 figures, 2 tables
Abstract:The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.
zh
[AI-10] LLM -Supported Formal Knowledge Representation for Enhancing Control Engineering Content with an Interactive Semantic Layer
【速读】:该论文旨在解决控制工程领域研究产出快速增长背景下,如何结构化和形式化领域知识以提升其可读性与机器可解释性的挑战。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)支持的半自动化方法,将自然语言描述与数学定义(以LaTeX源码形式提供)转化为形式化的知识图谱,基于指令式知识表示框架(Imperative Representation of Knowledge, PyIRK),实现人类可读性与机器可执行性兼备的知识表达,并首次应用于构建“交互式语义层”以增强原始文档的知识传递能力。
链接: https://arxiv.org/abs/2511.02759
作者: Julius Fiedler(1),Carsten Knoll(2),Klaus Röbenack(1) ((1) Institute of Control Theory at TU Dresden, (2) Chair of Fundamentals of Electrical Engineering at TU Dresden)
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 4 pages, 2 figures
Abstract:The rapid growth of research output in control engineering calls for new approaches to structure and formalize domain knowledge. This paper briefly describes an LLM-supported method for semi-automated generation of formal knowledge representations that combine human readability with machine interpretability and increased expressiveness. Based on the Imperative Representation of Knowledge (PyIRK) framework, we demonstrate how language models can assist in transforming natural-language descriptions and mathematical definitions (available as LaTeX source code) into a formalized knowledge graph. As a first application we present the generation of an ``interactive semantic layer’’ to enhance the source documents in order to facilitate knowledge transfer. From our perspective this contributes to the vision of easily accessible, collaborative, and verifiable knowledge bases for the control engineering domain.
zh
[AI-11] Using Span Queries to Optimize for Cache and Attention Locality
【速读】:该论文旨在解决当前推理服务器(inference server)过度优化于对话生成(chat completion)任务,而难以高效支持多样化的非对话类推理场景(如检索增强生成 RAG、推理时扩展 inference-time scaling 和智能体工作负载 agentic workloads)的问题。其核心挑战在于这些新场景对 KV 缓存(KV cache)局部性和输入顺序依赖性的要求与传统 chat 不同,导致现有系统性能下降。解决方案的关键是提出“跨度查询”(span query)这一通用接口抽象,它将多种推理任务建模为带有交换律约束(commutativity constraints)的表达式树(expression trees),从而统一描述不同工作负载的执行逻辑,并通过自动优化提升 KV 缓存命中率和注意力机制局部性(attention locality)。实验表明,仅需对 vLLM 进行小幅修改(492 行代码变更),即可实现 10–20 倍 TTFT(Time to First Token)加速,并在保持模型规模更小时显著优于标准推理服务器的精度表现。
链接: https://arxiv.org/abs/2511.02749
作者: Paul Castro,Nick Mitchell,Nathan Ordonez,Thomas Parnell,Mudhakar Srivatsa,Antoni Viros i Martin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 17 figures
Abstract:Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter – do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve attention locality, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.
zh
[AI-12] Scalable Evaluation and Neural Models for Compositional Generalization NEURIPS
【速读】:该论文旨在解决生成式 AI 中的组合泛化(compositional generalization)问题,即模型在面对已知概念的新组合时仍能准确预测的能力。当前研究面临两大挑战:一是缺乏标准化且严谨的评估协议,导致基准测试难以有效衡量模型性能;二是通用视觉架构缺乏必要的归纳偏置(inductive bias),而现有增强方法往往牺牲可扩展性。论文的关键解决方案包括:首先提出一个计算复杂度从组合级降至常数级的严格评估框架,统一并扩展了先前方法;其次通过训练超过5000个监督视觉骨干模型进行大规模实证分析,系统评估当前技术水平;最后引入属性不变网络(Attribute Invariant Networks),在保持较低参数开销(仅16%)的同时,相较基线模型实现23.43%的准确率提升,显著推动了组合泛化的性能边界。
链接: https://arxiv.org/abs/2511.02667
作者: Giacomo Camposampiero,Pietro Barbiero,Michael Hersche,Roger Wattenhofer,Abbas Rahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
Abstract:Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts.
zh
[AI-13] In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization
【速读】:该论文旨在解决隐式神经表示(implicit neural representations)在持续学习场景下的灾难性遗忘问题,尤其是在现场神经压缩(in situ neural compression)任务中,如何在有限计算资源和记忆缓冲条件下保持高压缩率下的重建性能。其解决方案的关键在于提出一种新颖的原位训练协议,利用有限记忆缓冲区存储完整数据样本与草图数据(sketched data),并通过草图数据作为正则化项来抑制遗忘;理论分析基于Johnson-Lindenstrauss引理,证明了草图作为有效正则器的合理性。实验表明,该方法在多维复杂模拟数据、长时间序列及非结构化网格上均能实现高压缩率下的优异重建效果,并且其性能可近似逼近等效离线方法。
链接: https://arxiv.org/abs/2511.02659
作者: Cooper Simpson,Stephen Becker,Alireza Doostan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
备注: 17 pages, 8 figures, 4 tables
Abstract:Focusing on implicit neural representations, we present a novel in situ training protocol that employs limited memory buffers of full and sketched data samples, where the sketched data are leveraged to prevent catastrophic forgetting. The theoretical motivation for our use of sketching as a regularizer is presented via a simple Johnson-Lindenstrauss-informed result. While our methods may be of wider interest in the field of continual learning, we specifically target in situ neural compression using implicit neural representation-based hypernetworks. We evaluate our method on a variety of complex simulation data in two and three dimensions, over long time horizons, and across unstructured grids and non-Cartesian geometries. On these tasks, we show strong reconstruction performance at high compression rates. Most importantly, we demonstrate that sketching enables the presented in situ scheme to approximately match the performance of the equivalent offline method.
zh
[AI-14] Apriel-H1: Towards Efficient Enterprise Reasoning Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段因自注意力机制(Multi-Head Attention, MHA)带来的二次时间与内存复杂度,以及键值缓存需求导致的吞吐量和可扩展性受限的问题。其核心解决方案是提出 Apriel-H1 家族混合架构,通过渐进式知识蒸馏将预训练推理型 Transformer 模型中的部分 MHA 层逐步替换为线性复杂度的状态空间模型(State Space Model, SSM)模块(如 Mamba),从而实现高效推理。关键创新在于:在保持推理性能基本不变的前提下,利用 SSM 的固定大小隐藏状态和递归计算特性,显著降低计算资源消耗,并在 vLLM 推理引擎中实现超过 2 倍的吞吐量提升,验证了蒸馏后的混合 SSM-Transformer 架构可在效率与推理质量之间取得良好平衡。
链接: https://arxiv.org/abs/2511.02651
作者: Oleksiy Ostapenko,Luke Kumar,Raymond Li,Denis Kocetkov,Joel Lamy-Poirier,Shruthan Radhakrishna,Soham Parikh,Shambhavi Mishra,Sebastien Paquet,Srinivas Sunkara,Valérie Bécaert,Sathwik Tejaswi Madhusudhan,Torsten Scholak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.02651 [cs.LG] (or arXiv:2511.02651v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.02651 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-15] Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
【速读】:该论文旨在解决边缘部署中大型语言模型(Large Language Models, LLMs)在协作场景下的三大核心挑战:隐私泄露风险、通信开销过大以及计算瓶颈。其解决方案的关键在于提出联邦注意力机制(Federated Attention, FedAttn),该机制将联邦学习(Federated Learning, FL)范式嵌入到自注意力(self-attention)机制中,构建了一种新型分布式LLM推理框架。FedAttn允许各参与方在其本地对token表示执行自注意力计算,并周期性地交换和聚合多个Transformer块中的键值(Key-Value, KV)矩阵,从而在不暴露私有提示的前提下协同生成响应。该方法同时实现了隐私保护、通信效率与计算效率的优化,且通过揭示联邦注意力结构与联邦参数优化之间的对偶关系,为系统性迁移联邦优化技术至协作式LLM推理提供了理论基础。
链接: https://arxiv.org/abs/2511.02647
作者: Xiumei Deng,Zehui Xiong,Binbin Chen,Dong In Kim,Merouane Debbah,H. Vincent Poor
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are proliferating rapidly at the edge, delivering intelligent capabilities across diverse application scenarios. However, their practical deployment in collaborative scenarios confronts fundamental challenges: privacy vulnerabilities, communication overhead, and computational bottlenecks. To address these, we propose Federated Attention (FedAttn), which integrates the federated paradigm into the self-attention mechanism, creating a new distributed LLM inference framework that simultaneously achieves privacy protection, communication efficiency, and computational efficiency. FedAttn enables participants to perform local self-attention over their own token representations while periodically exchanging and aggregating Key-Value (KV) matrices across multiple Transformer blocks, collaboratively generating LLM responses without exposing private prompts. Further, we identify a structural duality between contextual representation refinement in FedAttn and parameter optimization in FL across private data, local computation, and global aggregation. This key insight provides a principled foundation for systematically porting federated optimization techniques to collaborative LLM inference. Building on this framework, we theoretically analyze how local self-attention computation within participants and heterogeneous token relevance among participants shape error propagation dynamics across Transformer blocks. Moreover, we characterize the fundamental trade-off between response quality and communication/computation efficiency, which is governed by the synchronization interval and the number of participants. Experimental results validate our theoretical analysis, and reveal significant optimization opportunities through sparse attention and adaptive KV aggregation, highlighting FedAttn’s potential to deliver scalability and efficiency in real-world edge deployments.
zh
[AI-16] Natural-gas storag e modelling by deep reinforcement learning
【速读】:该论文旨在解决天然气市场中库存管理策略对均衡价格及供需动态影响的建模与分析问题。其核心挑战在于如何通过合理设计存储运营商的决策机制,实现市场价格稳定、市场出清稳健以及盈利能力最大化等多重目标,同时使模拟结果与真实市场价格的波动性和季节性特征高度一致。解决方案的关键在于构建一个融合校准后的天然气市场模型与基于深度强化学习(Deep Reinforcement Learning, DRL)训练的储气运营策略的仿真器 GasRL,并采用 Soft Actor Critic (SAC) 算法作为最优策略学习工具——该算法在多目标优化中表现出色,且无需显式校准价格数据即可复现真实市场价格分布特性,从而有效评估欧盟强制最低储气阈值政策对市场韧性的影响。
链接: https://arxiv.org/abs/2511.02646
作者: Tiziano Balaconi,Aldo Glielmo,Marco Taboga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN); Systems and Control (eess.SY)
备注: 8 pages, 5 figures, published on
Abstract:We introduce GasRL, a simulator that couples a calibrated representation of the natural gas market with a model of storage-operator policies trained with deep reinforcement learning (RL). We use it to analyse how optimal stockpile management affects equilibrium prices and the dynamics of demand and supply. We test various RL algorithms and find that Soft Actor Critic (SAC) exhibits superior performance in the GasRL environment: multiple objectives of storage operators - including profitability, robust market clearing and price stabilisation - are successfully achieved. Moreover, the equilibrium price dynamics induced by SAC-derived optimal policies have characteristics, such as volatility and seasonality, that closely match those of real-world prices. Remarkably, this adherence to the historical distribution of prices is obtained without explicitly calibrating the model to price data. We show how the simulator can be used to assess the effects of EU-mandated minimum storage thresholds. We find that such thresholds have a positive effect on market resilience against unanticipated shifts in the distribution of supply shocks. For example, with unusually large shocks, market disruptions are averted more often if a threshold is in place.
zh
[AI-17] DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在空间推理任务中对组合性(compositionality)能力,尤其是生成式推理(productivity)和系统性泛化(systematicity)的不足问题。其解决方案的关键在于提出 DecompSR——一个基于过程生成的、具有超过 500 万数据点的大规模基准数据集与生成框架,该框架可独立控制组合性的多个维度(如推理深度、实体与语言变异性、过度泛化及新颖语言元素),并通过符号求解器确保数据正确性(correct by construction),从而为 LLMs 的组合空间推理能力提供严谨且细粒度的评估工具。
链接: https://arxiv.org/abs/2511.02627
作者: Lachlan McPheat,Navdeep Kaur,Robert Blackwell,Alessandra Russo,Anthony G. Cohn,Pranava Madhyastha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.
zh
[AI-18] A Multi-Agent Psychological Simulation System for Human Behavior Modeling
【速读】:该论文旨在解决人本领域(如教育、心理训练)中缺乏真实人类行为模拟的问题,以支持高质量的实践训练。传统黑箱式神经模型难以提供可解释且符合心理学机制的行为生成,而本文提出了一种基于多智能体的心理学仿真系统,其核心创新在于通过显式建模“内在议会”(inner parliament)——即一组对应关键心理因素(如自我效能感、成长型思维、社会建构主义)的代理(agent),使这些代理之间进行协商与互动,从而生成具 believable(可信)的人类行为。该方案的关键在于将经典心理学理论转化为可计算的多智能体架构,实现了行为生成过程的高度透明性与对人类心理机制的忠实映射,为教师培训和研究提供了可信赖的仿真环境。
链接: https://arxiv.org/abs/2511.02606
作者: Xiangen Hu,Jiarui Tong,Sheng Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Training and education in human-centered fields require authentic practice, yet realistic simulations of human behavior have remained limited. We present a multi-agent psychological simulation system that models internal cognitive-affective processes to generate believable human behaviors. In contrast to black-box neural models, this system is grounded in established psychological theories (e.g., self-efficacy, mindset, social constructivism) and explicitly simulates an ``inner parliament’’ of agents corresponding to key psychological factors. These agents deliberate and interact to determine the system’s output behavior, enabling unprecedented transparency and alignment with human psychology. We describe the system’s architecture and theoretical foundations, illustrate its use in teacher training and research, and discuss how it embodies principles of social learning, cognitive apprenticeship, deliberate practice, and meta-cognition.
zh
[AI-19] Adaptive GR(1) Specification Repair for Liveness-Preserving Shielding in Reinforcement Learning
【速读】:该论文旨在解决传统静态屏蔽机制在环境假设被违反时无法适应的问题,从而导致安全性与活性(liveness)保障失效。其解决方案的关键在于提出首个基于GR(1)规范的自适应屏蔽框架:通过运行时检测环境假设违反情况,并利用归纳逻辑编程(Inductive Logic Programming, ILP)在线自动修复GR(1)规范,实现系统性且可解释的规范更新,确保屏蔽机制能够动态演化、仅在必要时弱化目标,同时维持近最优奖励性能和完全的逻辑合规性。
链接: https://arxiv.org/abs/2511.02605
作者: Tiberiu-Andrei Georgescu,Alexander W. Goodall,Dalal Alrajeh,Francesco Belardinelli,Sebastian Uchitel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Shielding is widely used to enforce safety in reinforcement learning (RL), ensuring that an agent’s actions remain compliant with formal specifications. Classical shielding approaches, however, are often static, in the sense that they assume fixed logical specifications and hand-crafted abstractions. While these static shields provide safety under nominal assumptions, they fail to adapt when environment assumptions are violated. In this paper, we develop the first adaptive shielding framework - to the best of our knowledge - based on Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and expressive fragment of Linear Temporal Logic (LTL) that captures both safety and liveness properties. Our method detects environment assumption violations at runtime and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online, in a systematic and interpretable way. This ensures that the shield evolves gracefully, ensuring liveness is achievable and weakening goals only when necessary. We consider two case studies: Minepump and Atari Seaquest; showing that (i) static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped with our adaptive shield maintain near-optimal reward and perfect logical compliance compared with static shields.
zh
[AI-20] On The Dangers of Poisoned LLM s In Security Automation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全应用中因训练数据被恶意或无意污染(即“LLM poisoning”)所引入的偏见与风险问题。研究发现,即使是对有限数据集进行微调后的LLM,也可能引入显著偏见,导致基于LLM的告警检测系统完全失效——例如,当攻击者利用注入的偏见构造特定提示时,模型会持续忽略来自特定用户的真正告警。解决方案的关键在于识别并缓解此类针对性投毒攻击,提出了一系列缓解策略和最佳实践,以提升LLM在安全场景下的可信度、鲁棒性及风险控制能力。
链接: https://arxiv.org/abs/2511.02600
作者: Patrick Karlsen,Even Eilertsen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure
Abstract:This paper investigates some of the risks introduced by “LLM poisoning,” the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.
zh
[AI-21] he ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在真实场景下多领域定量推理能力评估不足的问题,尤其关注其在金融、物理、健康和统计等实际任务中数值计算的准确性与逻辑严谨性。解决方案的关键在于构建ORCA(Omni Research on Calculation in AI)基准测试集,该基准通过调用Omni计算器引擎验证输出结果,对500个自然语言描述的真实任务进行量化评估,从而系统性地衡量模型在步骤推理、数值精度及跨域泛化能力上的表现。实验表明,尽管顶尖模型在数学和工程领域表现较好,但在物理和自然科学领域存在显著缺陷,且不同模型常因不同类型错误而失败,揭示了它们之间的部分互补性而非冗余性。
链接: https://arxiv.org/abs/2511.02589
作者: Claudia Herambourg,Dawid Siuda,Anna Szczepanek,Julia Kopczyńska,Joao R. L. Santos,Wojciech Sas,Joanna Śmietańska-Nowak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present ORCA (Omni Research on Calculation in AI) Benchmark – a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni’s calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only 45\text–63,% accuracy, with errors mainly related to rounding ( 35,% ) and calculation mistakes ( 33,% ). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis ( r \approx 0.40\text–0.65 ) shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than redundancy. Unlike standard math datasets, ORCA evaluates step-by-step reasoning, numerical precision, and domain generalization across real problems from finance, physics, health, and statistics.
zh
[AI-22] Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因分布外(Out-of-Distribution, OOD)动作选择导致的外推误差问题。现有方法通常通过密度约束、支持约束和样本约束来限制动作选择,但这些方法存在局限性:密度和样本约束过于保守,而支持约束虽宽松却难以准确建模行为策略(Behavior Policy)。为此,作者提出一种新的邻域约束(Neighborhood Constraint),其核心思想是将贝尔曼目标中的动作选择限制在数据集中动作邻域的并集内。该约束理论上可控制外推误差与分布偏移,并近似支持约束而不依赖行为策略建模;同时通过自适应调整每个数据点的邻域半径实现逐点保守性,从而兼顾灵活性与安全性。基于此,作者设计了自适应邻域约束Q学习(Adaptive Neighborhood-constrained Q learning, ANQ)算法,结合高效双层优化框架,在标准离线RL基准测试中达到当前最优性能,并在噪声或数据稀缺场景下表现出强鲁棒性。
链接: https://arxiv.org/abs/2511.02567
作者: Yixiu Mao,Yun Qu,Qi Wang,Xiangyang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025 (Spotlight)
Abstract:Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
zh
[AI-23] Knowledge Graph-enhanced Large Language Model for Incremental Game PlayTesting
【速读】:该论文旨在解决现代视频游戏频繁迭代更新背景下,自动化测试方法在效率和针对性方面面临的挑战,尤其是基于大语言模型(Large Language Models, LLMs)的自动玩法测试缺乏结构化知识积累机制,难以针对增量更新进行精准高效测试的问题。解决方案的关键在于提出KLPEG框架,其核心是构建并维护一个知识图谱(Knowledge Graph, KG),系统建模游戏元素、任务依赖关系与因果关联,从而实现跨版本的知识积累与复用;在此基础上,利用LLMs解析自然语言更新日志,并通过KG上的多跳推理识别影响范围,进而生成适配更新内容的测试用例,显著提升测试的准确性和效率。
链接: https://arxiv.org/abs/2511.02534
作者: Enhong Mu,Jinyu Cai,Yijun Lu,Mingyue Zhang,Kenji Tei,Jialong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid iteration and frequent updates of modern video games pose significant challenges to the efficiency and specificity of testing. Although automated playtesting methods based on Large Language Models (LLMs) have shown promise, they often lack structured knowledge accumulation mechanisms, making it difficult to conduct precise and efficient testing tailored for incremental game updates. To address this challenge, this paper proposes a KLPEG framework. The framework constructs and maintains a Knowledge Graph (KG) to systematically model game elements, task dependencies, and causal relationships, enabling knowledge accumulation and reuse across versions. Building on this foundation, the framework utilizes LLMs to parse natural language update logs, identify the scope of impact through multi-hop reasoning on the KG, enabling the generation of update-tailored test cases. Experiments in two representative game environments, Overcooked and Minecraft, demonstrate that KLPEG can more accurately locate functionalities affected by updates and complete tests in fewer steps, significantly improving both playtesting effectiveness and efficiency.
zh
[AI-24] Agent ic AI for Mobile Network RAN Management and Optimization
【速读】:该论文旨在解决5G及未来6G网络中因复杂性急剧增加而导致传统人工优化手段失效的问题,提出利用生成式AI(Agentic AI)实现动态无线接入网(RAN)环境下的自动化决策。其解决方案的关键在于构建一套基于核心设计模式的Agentic AI框架,包括反思(reflection)、规划(planning)、工具使用(tool use)和多智能体协作(multi-agent collaboration),通过这些机制使AI系统具备目标分解、上下文记忆、持续学习和跨工具协同的能力,从而在RAN优化场景中实现以关键性能指标(KPI)为导向的自主决策。论文进一步通过一个实际的5G RAN案例验证了时间序列分析与大语言模型驱动的智能体协同工作的有效性。
链接: https://arxiv.org/abs/2511.02532
作者: Jorge Pellejero,Luis A. Hernández Gómez,Luis Mendo Tomás,Zoraida Frias Barroso
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agentic AI represents a new paradigm for automating complex systems by using Large AI Models (LAMs) to provide human-level cognitive abilities with multimodal perception, planning, memory, and reasoning capabilities. This will lead to a new generation of AI systems that autonomously decompose goals, retain context over time, learn continuously, operate across tools and environments, and adapt dynamically. The complexity of 5G and upcoming 6G networks renders manual optimization ineffective, pointing to Agentic AI as a method for automating decisions in dynamic RAN environments. However, despite its rapid advances, there is no established framework outlining the foundational components and operational principles of Agentic AI systems nor a universally accepted definition. This paper contributes to ongoing research on Agentic AI in 5G and 6G networks by outlining its core concepts and then proposing a practical use case that applies Agentic principles to RAN optimization. We first introduce Agentic AI, tracing its evolution from classical agents and discussing the progress from workflows and simple AI agents to Agentic AI. Core design patterns-reflection, planning, tool use, and multi-agent collaboration-are then described to illustrate how intelligent behaviors are orchestrated. These theorical concepts are grounded in the context of mobile networks, with a focus on RAN management and optimization. A practical 5G RAN case study shows how time-series analytics and LAM-driven agents collaborate for KPI-based autonomous decision-making. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.02532 [cs.AI] (or arXiv:2511.02532v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.02532 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-25] Causal Graph Neural Networks for Healthcare
【速读】:该论文旨在解决医疗人工智能系统在跨机构部署时普遍存在的性能下降、歧视性模式延续以及模型可解释性差的问题,这些问题的根本原因在于现有方法依赖统计关联而非因果机制的学习。解决方案的关键在于引入因果图神经网络(Causal Graph Neural Networks),通过结合生物医学数据的图结构表示与因果推断原理,学习不变的因果机制以替代虚假相关性,从而提升模型在分布变化下的鲁棒性、公平性和可解释性。
链接: https://arxiv.org/abs/2511.02531
作者: Munib Mesinovic,Max Buhlan,Tingting Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.
zh
[AI-26] An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems
【速读】:该论文旨在解决容量限制的选址-路径问题(Capacitated Location-Routing Problem, CLRP)及其开放版本(Open CLRP, OCLRP),这类问题在组合优化中具有挑战性,因其复杂的约束条件和选址与路径决策之间的强耦合关系。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的端到端方法——DRLHQ,其核心创新包括:将CLRP建模为马尔可夫决策过程(Markov Decision Process, MDP),并设计了一种异构查询注意力机制(heterogeneous querying attention mechanism),以动态适应不同决策阶段的复杂依赖关系,从而实现选址与路径决策的协同优化。实验表明,该方法在合成数据和基准数据集上均优于传统及现有DRL基线方法,在解的质量和泛化能力方面表现优异。
链接: https://arxiv.org/abs/2511.02525
作者: Changhao Miao,Yuntian Zhang,Tongyu Wu,Fang Deng,Chen Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.
zh
[AI-27] BRAINS: A Retrieval-Augmented System for Alzheimers Detection and Monitoring ICML
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期且准确检测的难题,特别是在缺乏先进诊断工具的地区。其核心解决方案是提出一个名为BRAINS(Biomedical Retrieval-Augmented Intelligence for Neurodegeneration Screening)的双模块系统:一是认知诊断模块,利用在认知量表(如MMSE、CDR评分)和神经影像数据(如脑体积指标)上微调的大语言模型(Large Language Models, LLMs)进行结构化风险评估;二是病例检索模块,将患者特征编码为潜在表示,并从精心构建的知识库中检索相似病例,通过病例融合层与输入信息整合以增强上下文理解,最终结合临床提示完成推理。该方案的关键在于融合LLMs的推理能力与知识检索机制,实现可解释、可扩展的早期AD筛查。
链接: https://arxiv.org/abs/2511.02490
作者: Rajan Das Gupta,Md Kishor Morol,Nafiz Fahad,Md Tanzib Hosain,Sumaya Binte Zilani Choya,Md Jakir Hossen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in ICMLA 2025
Abstract:As the global burden of Alzheimer’s disease (AD) continues to grow, early and accurate detection has become increasingly critical, especially in regions with limited access to advanced diagnostic tools. We propose BRAINS (Biomedical Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address this challenge. This novel system harnesses the powerful reasoning capabilities of Large Language Models (LLMs) for Alzheimer’s detection and monitoring. BRAINS features a dual-module architecture: a cognitive diagnostic module and a case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on cognitive and neuroimaging datasets – including MMSE, CDR scores, and brain volume metrics – to perform structured assessments of Alzheimer’s risk. Meanwhile, the Case Retrieval Module encodes patient profiles into latent representations and retrieves similar cases from a curated knowledge base. These auxiliary cases are fused with the input profile via a Case Fusion Layer to enhance contextual understanding. The combined representation is then processed with clinical prompts for inference. Evaluations on real-world datasets demonstrate BRAINS effectiveness in classifying disease severity and identifying early signs of cognitive decline. This system not only shows strong potential as an assistive tool for scalable, explainable, and early-stage Alzheimer’s disease detection, but also offers hope for future applications in the field.
zh
[AI-28] Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation
【速读】:该论文旨在解决传统无线视频传输方案中仅在像素层面进行视频编码、忽视视频内在语义信息的问题。其解决方案的关键在于提出了一种基于解耦扩散多帧补偿(Decoupled Diffusion Multi-frame Compensation, DDMFC)的无线视频语义通信框架(Wireless Video Semantic Communication with DDMFC, WVSC-D)。该框架首先将原始视频帧编码为语义帧,实现语义层级的视频编码;同时引入参考语义帧替代传统编码中的运动矢量,降低通信开销;在接收端,利用两阶段条件扩散过程完成当前语义帧的补偿重建,从而在保持高质量视频重建性能的同时显著提升带宽效率。实验表明,相较于其他基于深度学习的方法(如DVSC),WVSC-D在PSNR上提升约1.8 dB。
链接: https://arxiv.org/abs/2511.02478
作者: Bingyan Xie,Yongpeng Wu,Yuxuan Shi,Biqian Feng,Wenjun Zhang,Jihong Park,Tony Quek
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework with decoupled diffusion multi-frame compensation (DDMFC), abbreviated as WVSC-D, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC-D first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, DDMFC is proposed to generate compensated current semantic frame by a two-stage conditional diffusion process. With both the reference frame transmission and DDMFC frame compensation, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC-D over other DL-based methods e.g. DVSC about 1.8 dB in terms of PSNR.
zh
[AI-29] Auditable-choice reframing unlocks RL-based verification for open-ended tasks
【速读】:该论文旨在解决如何在缺乏标准答案的开放域任务(如创意写作和指令遵循)中提升大语言模型(LLM)的性能问题,尤其是探索强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)范式在这些场景下的迁移可行性。传统方法因依赖显式的标准答案而将此类任务视为非推理场景,忽略了推理能力潜在的价值。解决方案的关键在于提出一种名为“可验证多选重构”(Verifiable Multiple-Choice Reformulation, VMR)的新训练策略,该策略通过将开放域数据重构为可验证的多选格式,使得即使在无明确真值的情况下也能有效利用RLVR框架进行训练,从而显著提升模型在开放任务上的表现。
链接: https://arxiv.org/abs/2511.02463
作者: Mengyu Zhang,Xubo Liu,Siyu Ding,Weichong Yin,Yu Sun,Hua Wu,Wenya Guo,Ying Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs), achieving remarkable progress in domains such as mathematics and programming where standard answers are available. However, for open-ended tasks lacking ground-truth solutions (e.g., creative writing and instruction following), existing studies typically regard them as non-reasoning scenarios, thereby overlooking the latent value of reasoning capabilities. This raises a key question: Can strengthening reasoning improve performance in open-ended tasks? To address this, we explore the transfer of the RLVR paradigm to the open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose the existence of standard answers, it cannot be directly applied to open-ended tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation (VMR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline. Code will be released upon acceptance to facilitate reproducibility.
zh
[AI-30] SKGE: Spherical Knowledge Graph Embedding with Geometric Regularization
【速读】:该论文旨在解决知识图谱嵌入(Knowledge Graph Embedding, KGE)模型在无界欧几里得空间中建模复杂关系能力有限、训练效率低的问题。现有经典模型如TransE等通常将实体表示置于无界欧氏空间,这导致其难以捕捉关系的语义复杂性且容易产生冗余或无效的负样本。解决方案的关键在于提出一种基于超球面(hypersphere)几何先验的新型嵌入方法——Spherical Knowledge Graph Embedding (SKGE),通过引入可学习的非线性“球化层”(Spherization Layer)将实体映射到紧凑流形上,并将关系解释为“平移-投影”的混合变换机制。实验表明,该几何约束不仅显著提升了模型性能(尤其在大规模数据集如FB15k-237和CoDEx-M上),更作为一种强正则化手段,天然地抑制了平凡负样本,促使模型学习更具语义一致性和鲁棒性的表示。这一发现强调了流形选择不仅是实现细节,更是设计下一代高效稳定KGE模型的核心原则。
链接: https://arxiv.org/abs/2511.02460
作者: Xuan-Truong Quan,Xuan-Son Quan,Duc Do Minh,Vinh Nguyen Van
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graph embedding (KGE) has become a fundamental technique for representation learning on multi-relational data. Many seminal models, such as TransE, operate in an unbounded Euclidean space, which presents inherent limitations in modeling complex relations and can lead to inefficient training. In this paper, we propose Spherical Knowledge Graph Embedding (SKGE), a model that challenges this paradigm by constraining entity representations to a compact manifold: a hypersphere. SKGE employs a learnable, non-linear Spherization Layer to map entities onto the sphere and interprets relations as a hybrid translate-then-project transformation. Through extensive experiments on three benchmark datasets, FB15k-237, CoDEx-S, and CoDEx-M, we demonstrate that SKGE consistently and significantly outperforms its strong Euclidean counterpart, TransE, particularly on large-scale benchmarks such as FB15k-237 and CoDEx-M, demonstrating the efficacy of the spherical geometric prior. We provide an in-depth analysis to reveal the sources of this advantage, showing that this geometric constraint acts as a powerful regularizer, leading to comprehensive performance gains across all relation types. More fundamentally, we prove that the spherical geometry creates an “inherently hard negative sampling” environment, naturally eliminating trivial negatives and forcing the model to learn more robust and semantically coherent representations. Our findings compellingly demonstrate that the choice of manifold is not merely an implementation detail but a fundamental design principle, advocating for geometric priors as a cornerstone for designing the next generation of powerful and stable KGE models.
zh
[AI-31] ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的具身自主智能体在执行复杂、长周期任务时面临的挑战,即现有方法依赖于单一连贯的决策轨迹,导致难以有效分解和管理多步骤任务。解决方案的关键在于提出一种名为ReAcTree的分层任务规划方法,其核心机制是构建一个动态生成的代理树(agent tree),将复杂目标逐步分解为可管理的子目标(subgoals),每个子目标由具备推理、行动与进一步扩展能力的LLM代理节点处理,同时通过控制流节点协调执行策略,并集成两种互补的记忆系统:代理节点从情景记忆(episodic memory)中检索特定于子目标的示例,同时利用工作记忆(working memory)共享环境观测信息,从而实现更高效、可扩展的任务规划与执行。
链接: https://arxiv.org/abs/2511.02424
作者: Jae-Woo Choi,Hyungmin Kim,Hyobin Ong,Minsu Jang,Dohyung Kim,Jaehong Kim,Youngwoo Yoon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods still struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations, attempting to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into more manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct’s 31%.
zh
[AI-32] A New Perspective on Precision and Recall for Generative Models
链接: https://arxiv.org/abs/2511.02414
作者: Benjamin Sykes(UNICAEN, ENSICAEN, GREYC),Loïc Simon(UNICAEN, ENSICAEN, GREYC),Julien Rabin(UNICAEN, ENSICAEN, GREYC),Jalal Fadili(UNICAEN, ENSICAEN, GREYC)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-33] EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM -based Agents
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的软件开发代理在实际应用中面临的两大问题:一是现有方法多采用线性、瀑布式的开发流程,无法有效模拟真实开发中的迭代特性;二是面对复杂、大规模项目时,缺乏对功能间依赖关系的有效建模与上下文传播机制,导致开发效率和质量受限。其解决方案的关键在于提出 EvoDev 框架,该框架受特征驱动开发(Feature-Driven Development, FDD)启发,将用户需求分解为一系列具有业务价值的功能,并构建一个显式表达功能依赖关系的 Feature Map(特征图),这是一个有向无环图(Directed Acyclic Graph, DAG)。每个节点维护包括业务逻辑、设计和代码在内的多层次信息,并沿依赖路径进行上下文传播,从而在多轮迭代中提供持续且结构化的开发支持。实验表明,EvoDev 在 Android 开发任务上显著优于基线模型 Claude Code(提升 56.8%),并提升了单智能体在不同基础 LLM 上的性能(16.0%–76.6%),验证了依赖建模、上下文传播和工作流感知设计对于复杂软件项目的重要性。
链接: https://arxiv.org/abs/2511.02399
作者: Junwei Liu,Chen Xu,Chong Wang,Tong Bai,Weitong Chen,Kaseng Wong,Yiling Lou,Xin Peng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures
Abstract:Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%, while improving single-agent performance by 16.0%-76.6% across different base LLMs, highlighting the importance of dependency modeling, context propagation, and workflow-aware agent design for complex software projects. Our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.
zh
[AI-34] Fuzzy Soft Set Theory based Expert System for the Risk Assessment in Breast Cancer Patients
链接: https://arxiv.org/abs/2511.02392
作者: Muhammad Sheharyar Liaqat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
[AI-35] H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings ICSE
【速读】:该论文旨在解决心律失常(arrhythmia)早期检测中现有深度学习模型在真实场景下泛化能力不足的问题,尤其针对生物医学应用中常见的小样本和噪声数据挑战。其解决方案的关键在于提出一种新型的CNN-H-Infinity-LSTM架构,该架构引入源自控制理论中H-Infinity滤波器的可训练参数,以增强模型对噪声和分布偏移的鲁棒性,并提升跨数据集的泛化性能。实验表明,该方法在PhysioNet CinC Challenge 2016公开数据集上实现了99.42%的测试准确率和98.85%的F1分数,显著优于现有基准模型。
链接: https://arxiv.org/abs/2511.02379
作者: Rohith Shinoj Kumar,Rushdeep Dinda,Aditya Tyagi,Annappa B.,Naveen Kumar M. R
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Systems and Control (eess.SY)
备注: This is a preprint of a paper to appear at the 15th IEEE International Conference on Systems Engineering and Technology (ICSET 2025)
Abstract:Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent years, deep learning has emerged as a powerful tool to automate arrhythmia detection, offering improved accuracy, consistency, and efficiency. Several variants of convolutional and recurrent neural network architectures have been widely explored to capture spatial and temporal patterns in physiological signals. However, despite these advancements, current models often struggle to generalize well in real-world scenarios, especially when dealing with small or noisy datasets, which are common challenges in biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM architecture is proposed to identify arrhythmic heart signals from heart sound recordings. This architecture introduces trainable parameters inspired by the H-Infinity filter from control theory, enhancing robustness and generalization. Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a public benchmark of heart audio recordings, demonstrates that the proposed model achieves stable convergence and outperforms existing benchmarks, with a test accuracy of 99.42% and an F1 score of 98.85%.
zh
[AI-36] AI Credibility Signals Outrank Institutions and Engagement in Shaping News Perception on Social Media
【速读】:该论文试图解决的问题是:AI生成内容(AI-generated content)在在线信息生态系统中日益重要,但其对公众信任和认知判断(epistemic judgments)的影响尚不明确。为解决这一问题,研究者设计了一项大规模混合实验(N = 1,000),考察AI生成的可信度评分(credibility scores)如何影响用户对政治新闻的感知。解决方案的关键在于发现:与传统互动信号(如点赞和分享)相比,AI反馈能显著调节党派偏见和对机构的信任缺失,凸显了生成式AI(Generative AI)在塑造认知判断方面的强大说服力,从而提示应发展兼顾认知影响力与用户自主权的设计策略。
链接: https://arxiv.org/abs/2511.02370
作者: Adnan Hoq,Matthew Facciani,Tim Weninger
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-generated content is rapidly becoming a salient component of online information ecosystems, yet its influence on public trust and epistemic judgments remains poorly understood. We present a large-scale mixed-design experiment (N = 1,000) investigating how AI-generated credibility scores affect user perception of political news. Our results reveal that AI feedback significantly moderates partisan bias and institutional distrust, surpassing traditional engagement signals such as likes and shares. These findings demonstrate the persuasive power of generative AI and suggest a need for design strategies that balance epistemic influence with user autonomy.
zh
[AI-37] Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition NEURIPS2025
【速读】:该论文旨在解决如何在实时交互场景中实现高精度、低延迟的人机协同运动识别问题,特别是在舞蹈等高度具身表达的领域中,如何让机器既保持对人类动作的敏感响应,又不削弱表演者的表达深度。解决方案的关键在于融合可穿戴惯性测量单元(Inertial Measurement Unit, IMU)传感器数据、MiniRocket时间序列分类算法与多媒体响应控制机制,构建一个轻量级、实时性强的运动识别系统;通过将舞者特定动作映射至声音输出以激发躯体记忆和关联联想,实现了人机协作中“以人为本”的设计逻辑,从而在50毫秒延迟下达到高准确率分类,为舞蹈智能机器的创意、教育及现场表演应用提供了可复现的技术框架。
链接: https://arxiv.org/abs/2511.02351
作者: Zhuodi Cai,Ziyu Xu,Juan Pampin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 8 pages, 5 figures. Camera-ready manuscript for the Creative AI Track of NeurIPS 2025
Abstract:We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.
zh
[AI-38] Chronic Kidney Disease Prognosis Prediction Using Transformer
【速读】:该论文旨在解决慢性肾脏病(Chronic Kidney Disease, CKD)进展预测的准确性问题,以实现早期干预和医疗资源优化。其解决方案的关键在于提出了一种基于Transformer架构的多模态电子健康记录(Electronic Health Records, EHR)分析框架ProQ-BERT,通过量化分段处理连续实验室指标并引入注意力机制提升可解释性,结合掩码语言建模预训练与二分类微调策略,在不同随访周期下实现了高精度的CKD从3a期向5期进展的预测,最终在91,816例患者数据上达到最高0.995的ROC-AUC和0.989的PR-AUC,验证了Transformer模型与时间设计选择在临床预后建模中的有效性。
链接: https://arxiv.org/abs/2511.02340
作者: Yohan Lee,DongGyun Kang,SeHoon Park,Sa-Yoon Park,Kwangsoo Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 5 pages, 2 figures, 2 tables
Abstract:Chronic Kidney Disease (CKD) affects nearly 10% of the global population and often progresses to end-stage renal failure. Accurate prognosis prediction is vital for timely interventions and resource optimization. We present a transformer-based framework for predicting CKD progression using multi-modal electronic health records (EHR) from the Seoul National University Hospital OMOP Common Data Model. Our approach (\textbfProQ-BERT) integrates demographic, clinical, and laboratory data, employing quantization-based tokenization for continuous lab values and attention mechanisms for interpretability. The model was pretrained with masked language modeling and fine-tuned for binary classification tasks predicting progression from stage 3a to stage 5 across varying follow-up and assessment periods. Evaluated on a cohort of 91,816 patients, our model consistently outperformed CEHR-BERT, achieving ROC-AUC up to 0.995 and PR-AUC up to 0.989 for short-term prediction. These results highlight the effectiveness of transformer architectures and temporal design choices in clinical prognosis modeling, offering a promising direction for personalized CKD care.
zh
[AI-39] he Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute
【速读】:该论文旨在解决语言模型推理过程中测试时扩展(test-time scaling)策略的优化问题,具体探讨在相同计算资源和token预算下,采用并行独立推理链(parallel self-consistency)与顺序迭代精炼推理链(sequential refinement)哪种方式更优。其解决方案的关键在于提出并验证了顺序精炼策略优于主流的并行自一致性方法,并引入了一种无需训练的逆熵加权投票机制——即根据推理链的逆熵对答案进行加权,从而显著提升准确率,使顺序精炼成为当前最优的测试时扩展策略。
链接: https://arxiv.org/abs/2511.02309
作者: Aman Sharma,Paras Chopra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We revisit test-time scaling for language model reasoning and ask a fundamental question: at equal token budget and compute, is it better to run multiple independent chains in parallel, or to run fewer chains that iteratively refine through sequential steps? Through comprehensive evaluation across 5 state-of-the-art open source models and 3 challenging reasoning benchmarks, we find that sequential scaling where chains explicitly build upon previous attempts consistently outperforms the dominant parallel self-consistency paradigm in 95.6% of configurations with gains in accuracy upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel training-free method to further boost the accuracy of sequential scaling. By weighing answers in proportion to the inverse entropy of their reasoning chains, we increase our success rate over parallel majority and establish it as the optimal test-time scaling strategy. Our findings fundamentally challenge the parallel reasoning orthodoxy that has dominated test-time scaling since Wang et al.'s self-consistency decoding (Wang et al., 2022), positioning sequential refinement as the robust default for modern LLM reasoning and necessitating a paradigm shift in how we approach inference-time optimization.
zh
[AI-40] FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
【速读】:该论文旨在解决大规模稀疏模型(如Mixture-of-Experts, MoE)在训练过程中因计算和内存需求过高而导致的效率瓶颈问题,特别是现有低精度训练方法中频繁进行浮点数量化-反量化(Quantize-Dequantize, Q/DQ)转换所引入的冗余开销,这削弱了FP8(8-bit浮点数)理论上的性能优势。解决方案的关键在于提出一种名为FP8-Flow-MoE的FP8训练方案,其核心创新包括:1)构建量化一致的FP8中心数据流,避免因不同维度量化导致的缩放因子不一致问题;2)设计感知缩放的转置操作与融合的FP8算子,将显式类型转换次数从12次减少至2次,从而显著提升计算效率并降低内存占用,同时保障数值稳定性与收敛性。
链接: https://arxiv.org/abs/2511.02302
作者: Fengjuan Wang,Zhiyi Su,Xingzhu Hu,Cheng Wang,Mou Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8’s theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.02302 [cs.LG] (or arXiv:2511.02302v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.02302 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-41] Federated Quantum Kernel Learning for Anomaly Detection in Multivariate IoT Time-Series
【速读】:该论文旨在解决工业物联网(IIoT)系统中高维多变量时间序列异常检测面临的隐私保护、可扩展性和通信效率难题。传统联邦学习方法虽能通过去中心化训练缓解隐私问题,但在处理高度非线性决策边界和异常分布不均衡时表现不佳。其解决方案的关键在于提出一种联邦量子核学习(Federated Quantum Kernel Learning, FQKL)框架,该框架将量子特征映射与联邦聚合相结合,在异构物联网网络中实现分布式、隐私保护的异常检测;具体而言,量子边缘节点利用参数化量子电路本地计算压缩的核统计量,并仅向中央服务器传输这些摘要信息,由服务器构建全局Gram矩阵并训练决策函数(如Fed-QSVM),从而在保证通信效率的同时显著提升对复杂时序相关性的建模能力。
链接: https://arxiv.org/abs/2511.02301
作者: Kuan-Cheng Chen,Samuel Yen-Chi Chen,Chen-Yu Liu,Kin K. Leung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:The rapid growth of industrial Internet of Things (IIoT) systems has created new challenges for anomaly detection in high-dimensional, multivariate time-series, where privacy, scalability, and communication efficiency are critical. Classical federated learning approaches mitigate privacy concerns by enabling decentralized training, but they often struggle with highly non-linear decision boundaries and imbalanced anomaly distributions. To address this gap, we propose a Federated Quantum Kernel Learning (FQKL) framework that integrates quantum feature maps with federated aggregation to enable distributed, privacy-preserving anomaly detection across heterogeneous IoT networks. In our design, quantum edge nodes locally compute compressed kernel statistics using parameterized quantum circuits and share only these summaries with a central server, which constructs a global Gram matrix and trains a decision function (e.g., Fed-QSVM). Experimental results on synthetic IIoT benchmarks demonstrate that FQKL achieves superior generalization in capturing complex temporal correlations compared to classical federated baselines, while significantly reducing communication overhead. This work highlights the promise of quantum kernels in federated settings, advancing the path toward scalable, robust, and quantum-enhanced intelligence for next-generation IoT infrastructures.
zh
[AI-42] Fast Approximation Algorithm for Non-Monotone DR-submodular Maximization under Size Constraint
【速读】:该论文致力于解决在规模为 $ n $ 的基集上,对满足 DR-子模性(DR-submodular)的函数进行最大化,并受限于大小为 $ k $ 的约束条件下的非单调优化问题。其核心挑战在于如何在保证近似比的同时显著降低查询复杂度。解决方案的关键在于提出两种新颖的近似算法:FastDrSub 和 FastDrSub++,其中 FastDrSub 实现了 0.044 的近似比,查询复杂度为 $ O(n \log k) $;而 FastDrSub++ 进一步提升至 $ 1/4 - \epsilon $ 的近似比,同时保持相同的低查询复杂度 $ O(n \log k) $,这是首个在该问题上达到常数近似比且具有低复杂度的算法。实验表明,二者在收益最大化(Revenue Maximization)任务中均显著优于现有最优方法,在查询效率与解质量方面表现突出。
链接: https://arxiv.org/abs/2511.02254
作者: Tan D. Tran,Canh V. Pham
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:
Abstract:This work studies the non-monotone DR-submodular Maximization over a ground set of n subject to a size constraint k . We propose two approximation algorithms for solving this problem named FastDrSub and FastDrSub++. FastDrSub offers an approximation ratio of 0.044 with query complexity of O(n \log(k)) . The second one, FastDrSub++, improves upon it with a ratio of 1/4-\epsilon within query complexity of (n \log k) for an input parameter \epsilon 0 . Therefore, our proposed algorithms are the first constant-ratio approximation algorithms for the problem with the low complexity of O(n \log(k)) . Additionally, both algorithms are experimentally evaluated and compared against existing state-of-the-art methods, demonstrating their effectiveness in solving the Revenue Maximization problem with DR-submodular objective function. The experimental results show that our proposed algorithms significantly outperform existing approaches in terms of both query complexity and solution quality. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC) Cite as: arXiv:2511.02254 [cs.DS] (or arXiv:2511.02254v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2511.02254 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-43] When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在不同模态提供矛盾信息时如何决策的问题,即“模态跟随”(modality following)行为的机制解析。传统方法仅依赖粗粒度的数据集层面统计指标来衡量模态偏好,忽略了模型在单模态推理中的置信度差异。论文的关键解决方案是提出一个可分解的理论框架,将模态跟随行为解耦为两个核心因素:相对推理不确定性(relative reasoning uncertainty,指不同模态预测间置信度差距)和固有模态偏好(inherent modality preference,指当不确定性平衡时模型稳定的倾向)。通过构建可控数据集并以熵作为细粒度不确定性度量,作者发现模态跟随概率随相对不确定性单调下降,并识别出“平衡点”这一关键状态,在此状态下可提取出不受单模态能力或数据偏差干扰的内在偏好指标。进一步层级分析揭示了模型在模糊区域中跨层震荡的内部机制,从而从定量和机制两方面阐明了MLLMs处理冲突信息的基本原理。
链接: https://arxiv.org/abs/2511.02243
作者: Zhuoran Zhang,Tengyue Wang,Xilin Gong,Yang Shi,Haotian Wang,Di Wang,Lijie Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model’s confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model’s stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model’s inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.
zh
[AI-44] Structural Plasticity as Active Inference: A Biologically-Inspired Architecture for Homeostatic Control
【速读】:该论文旨在解决传统神经网络依赖生物上不合理的学习机制(如全局反向传播)的问题,提出一种受主动推断(active inference)和生物神经培养物形态可塑性启发的新型计算模型——结构自适应预测推理网络(SAPIN)。其解决方案的关键在于引入两种并行的学习机制:一是基于细胞实际激活与其预期之间时间差的局部赫布型突触可塑性规则,用于学习信息处理方式(即调整突触权重);二是细胞在二维网格中物理迁移以优化信息接收场的结构可塑性机制,从而动态调整网络拓扑结构。这种双重机制使SAPIN能够同时学习“如何处理信息”与“在哪里部署计算资源”,并在Cart Pole强化学习任务中验证了其有效性,展现出稳定的平衡策略。
链接: https://arxiv.org/abs/2511.02241
作者: Brennen A. Hill
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Traditional neural networks, while powerful, rely on biologically implausible learning mechanisms such as global backpropagation. This paper introduces the Structurally Adaptive Predictive Inference Network (SAPIN), a novel computational model inspired by the principles of active inference and the morphological plasticity observed in biological neural cultures. SAPIN operates on a 2D grid where processing units, or cells, learn by minimizing local prediction errors. The model features two primary, concurrent learning mechanisms: a local, Hebbian-like synaptic plasticity rule based on the temporal difference between a cell’s actual activation and its learned expectation, and a structural plasticity mechanism where cells physically migrate across the grid to optimize their information-receptive fields. This dual approach allows the network to learn both how to process information (synaptic weights) and also where to position its computational resources (network topology). We validated the SAPIN model on the classic Cart Pole reinforcement learning benchmark. Our results demonstrate that the architecture can successfully solve the CartPole task, achieving robust performance. The network’s intrinsic drive to minimize prediction error and maintain homeostasis was sufficient to discover a stable balancing policy. We also found that while continual learning led to instability, locking the network’s parameters after achieving success resulted in a stable policy. When evaluated for 100 episodes post-locking (repeated over 100 successful agents), the locked networks maintained an average 82% success rate.
zh
[AI-45] LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
【速读】:该论文旨在解决当前机器人操作中基于语言到动作(L2A)映射的单向学习范式所导致的泛化能力不足与行为可解释性差的问题,即模型虽能执行任务但缺乏对上下文的深层理解。解决方案的关键在于提出一个统一的双向学习框架——LACY(Language-Action Cycle),其核心是通过联合训练三个协同任务:从语言生成参数化动作(L2A)、用语言解释观察到的动作(A2L)以及验证两个语言描述之间的语义一致性(L2C),从而实现语言与动作之间的双向接地(language-action grounding)。该框架还引入主动增强策略,在低置信度案例中自动生成并筛选新训练数据,形成无需额外人工标注的自我改进循环,显著提升了任务成功率(平均提高56.46%)和语言-动作关联的鲁棒性。
链接: https://arxiv.org/abs/2511.02239
作者: Youngjin Hong,Houjian Yu,Mingen Li,Changhyun Choi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Preprint. Project page: this https URL
Abstract:Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: this https URL
zh
[AI-46] Deep Ideation: Designing LLM Agents to Generate Novel Research Ideas on Scientific Concept Network
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的科研创意生成方法中存在的两个核心问题:一是现有方法多依赖简单的关键词共现或语义相似性,忽视了科学概念之间的复杂上下文关系;二是尽管部分LLM方法利用自身知识提出并优化研究想法,但未能有效整合科学概念网络,导致生成的想法缺乏对已有研究的充分 grounding。解决方案的关键在于提出 Deep Ideation 框架,其核心创新包括:构建一个融合关键词共现与上下文关系的科学知识网络,并引入“探索-扩展-演化”(explore-expand-evolve)的迭代工作流,配合 Idea Stack 追踪进展,同时设计一个基于真实审稿反馈训练的批评引擎(critic engine),持续评估和引导创意的新颖性与可行性,从而显著提升生成科研想法的质量与实用性。
链接: https://arxiv.org/abs/2511.02238
作者: Keyu Zhao,Weiquan Lin,Qirui Zheng,Fengli Xu,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures
Abstract:Novel research ideas play a critical role in advancing scientific inquiries. Recent advancements in Large Language Models (LLMs) have demonstrated their potential to generate novel research ideas by leveraging large-scale scientific literature. However, previous work in research ideation has primarily relied on simplistic methods, such as keyword co-occurrence or semantic similarity. These approaches focus on identifying statistical associations in the literature but overlook the complex, contextual relationships between scientific concepts, which are essential to effectively leverage knowledge embedded in human literature. For instance, papers that simultaneously mention “keyword A” and “keyword B” often present research ideas that integrate both concepts. Additionally, some LLM-driven methods propose and refine research ideas using the model’s internal knowledge, but they fail to effectively utilize the scientific concept network, limiting the grounding of ideas in established research. To address these challenges, we propose the Deep Ideation framework to address these challenges, integrating a scientific network that captures keyword co-occurrence and contextual relationships, enriching LLM-driven ideation. The framework introduces an explore-expand-evolve workflow to iteratively refine research ideas, using an Idea Stack to track progress. A critic engine, trained on real-world reviewer feedback, guides the process by providing continuous feedback on the novelty and feasibility of ideas. Our experiments show that our approach improves the quality of generated ideas by 10.67% compared to other methods, with ideas surpassing top conference acceptance levels. Human evaluation highlights their practical value in scientific research, and ablation studies confirm the effectiveness of each component in the workflow. Code repo is available at this https URL.
zh
[AI-47] Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
【速读】:该论文旨在解决多轮代理型大语言模型(Agentic LLM)应用在服务系统中因工具调用(tool call)导致的KV缓存中断问题,该问题会引发缓存淘汰和请求间等待延迟,从而显著增加作业完成时间(job completion time)。解决方案的关键在于提出Continuum系统,通过结合工具感知的KV缓存超时机制与程序级调度策略:首先基于对工具调用时长的预测,为每轮交互动态设置KV缓存的存活时间(time-to-live),选择性地将关键缓存数据保留在GPU内存中;其次采用程序级先来先服务(first-come-first-serve)调度,避免因工具调用造成的调度空洞(scheduling bubbles),从而维持多轮任务的连续性并提升吞吐量。
链接: https://arxiv.org/abs/2511.02230
作者: Hanchen Li,Qiuyang Mang,Runyuan He,Qizheng Zhang,Huanzhi Mao,Xiaokun Chen,Alvin Cheung,Joseph Gonzalez,Ion Stoica
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Agentic LLM applications interleave LLM generation requests with tool calls. These tool calls break the continuity of the workflow by creating pauses between LLM requests, bringing many challenges for the serving system, especially under multi-turn scenarios. Each pause potentially causes KV cache eviction and extra waiting time before entering the continuous batch for the following LLM request. Since these pauses happen for each call, this problem becomes increasingly severe as turn number grow for agentic programs. Previous works either fail to incorporate information from the tool call, evicting KV cache that leads to repetitive prefill or loading, or ignore the continuity of a multi-turn program, creating waiting time between turns that increases per-request latency. We present Continuum, a serving system to optimize job completion time for multi-turn agent workloads by combining tool-aware KV cache timeout with program-level scheduling. By predicting tool call durations in agentic workflows, Continuum selectively pins the KV cache in GPU memory with a time-to-live value based on total turn number. When combined with program-level first-come-first-serve, Continuum prevents scheduling bubbles, preserves multi-turn continuity, and optimizes for throughput for complex agentic workflows. By modeling the variability of tool call and agent program continuity, Continuum outperforms state-of-the-art baselines. Our evaluation on real-world agentic workloads (SWE-Bench and BFCL) with Llama-3.1 8B/70B models shows that Continuum significantly improves the average job completion times, and remains performant across different hardware setups and DRAM offloading schemes. Preview code is available at: this https URL Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2511.02230 [cs.OS] (or arXiv:2511.02230v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2511.02230 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-48] abDSR: Decompose Sanitize and Reason for Complex Numerical Reasoning in Tabular Data EMNLP2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂表格数值推理任务时表现不佳的问题,主要挑战包括复杂查询的理解、噪声数据的干扰以及LLMs固有的数值计算能力有限。其解决方案的核心是提出一个名为\method的框架,包含三个关键组件:(1) 查询分解器(query decomposer),将复杂问题拆解为可执行子任务;(2) 表格清洗器(table sanitizer),对噪声表格进行净化与过滤;(3) 基于程序思维链(program-of-thoughts, PoT)的推理模块,生成可执行代码以从清洗后的表格中获取最终答案。该框架通过结构化推理流程显著提升了LLMs在复杂表格数值推理上的准确性与鲁棒性。
链接: https://arxiv.org/abs/2511.02219
作者: Changjiang Jiang,Fengchang Yu,Haihua Chen,Wei Lu,Jin Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Findings
Abstract:Complex reasoning over tabular data is crucial in real-world data analysis, yet large language models (LLMs) often underperform due to complex queries, noisy data, and limited numerical capabilities. To address these issues, we propose \method, a framework consisting of: (1) a query decomposer that breaks down complex questions, (2) a table sanitizer that cleans and filters noisy tables, and (3) a program-of-thoughts (PoT)-based reasoner that generates executable code to derive the final answer from the sanitized table. To ensure unbiased evaluation and mitigate data leakage, we introduce a new dataset, CalTab151, specifically designed for complex numerical reasoning over tables. Experimental results demonstrate that \method consistently outperforms existing methods, achieving state-of-the-art (SOTA) performance with 8.79%, 6.08%, and 19.87% accuracy improvement on TAT-QA, TableBench, and \method, respectively. Moreover, our framework integrates seamlessly with mainstream LLMs, providing a robust solution for complex tabular numerical reasoning. These findings highlight the effectiveness of our framework in enhancing LLM performance for complex tabular numerical reasoning. Data and code are available upon request.
zh
[AI-49] Optimizing Multi-Lane Intersection Performance in Mixed Autonomy Environments
【速读】:该论文旨在解决多车道交叉口交通管理中人类驾驶车辆(HDVs)与联网自动驾驶车辆(CAVs)之间协调不畅的问题,以实现高效、安全且公平的信号控制。其解决方案的关键在于提出一种融合图注意力网络(Graph Attention Networks, GAT)与软演员-评论家(Soft Actor-Critic, SAC)强化学习的新型交通信号控制框架:GAT用于建模交通流的空间-时间动态图结构,捕捉车道与信号相位间的复杂依赖关系;SAC则通过熵优化的离策略强化学习机制实现自适应信号配时决策,从而在最小化平均延误、提升安全性及保障HDVs与CAVs公平性等多目标下协同优化车辆通行效率。
链接: https://arxiv.org/abs/2511.02217
作者: Manonmani Sekar,Nasim Nezamoddini
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:One of the main challenges in managing traffic at multilane intersections is ensuring smooth coordination between human-driven vehicles (HDVs) and connected autonomous vehicles (CAVs). This paper presents a novel traffic signal control framework that combines Graph Attention Networks (GAT) with Soft Actor-Critic (SAC) reinforcement learning to address this challenge. GATs are used to model the dynamic graph- structured nature of traffic flow to capture spatial and temporal dependencies between lanes and signal phases. The proposed SAC is a robust off-policy reinforcement learning algorithm that enables adaptive signal control through entropy-optimized decision making. This design allows the system to coordinate the signal timing and vehicle movement simultaneously with objectives focused on minimizing travel time, enhancing performance, ensuring safety, and improving fairness between HDVs and CAVs. The model is evaluated using a SUMO-based simulation of a four-way intersection and incorporating different traffic densities and CAV penetration rates. The experimental results demonstrate the effectiveness of the GAT-SAC approach by achieving a 24.1% reduction in average delay and up to 29.2% fewer traffic violations compared to traditional methods. Additionally, the fairness ratio between HDVs and CAVs improved to 1.59, indicating more equitable treatment across vehicle types. These findings suggest that the GAT-SAC framework holds significant promise for real-world deployment in mixed-autonomy traffic systems.
zh
[AI-50] Adaptive Cooperative Transmission Design for Ultra-Reliable Low-Latency Communications via Deep Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决下一代无线通信系统中两跳协作通信(two-hop cooperative communication)在满足超可靠低时延通信(URLLC)服务严格要求时面临的挑战,尤其是在有限时延约束下实现可靠数据包传输的问题。解决方案的关键在于提出一种基于双智能体强化学习的协同感知时延传输算法(DRL-CoLA),该算法将每跳的传输参数配置(包括子载波间隔、mini-slot大小和调制编码方案)建模为马尔可夫决策过程(MDP),并通过分布式方式学习时延感知的传输策略,从而在保证极低时延的同时逼近最优可靠性性能。
链接: https://arxiv.org/abs/2511.02216
作者: Hyemin Yu,Hong-Chuan Yang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: Accepted at the AI4NextG Workshop, NeurIPS 2025
Abstract:Next-generation wireless communication systems must support ultra-reliable low-latency communication (URLLC) service for mission-critical applications. Meeting stringent URLLC requirements is challenging, especially for two-hop cooperative communication. In this paper, we develop an adaptive transmission design for a two-hop relaying communication system. Each hop transmission adaptively configures its transmission parameters separately, including numerology, mini-slot size, and modulation and coding scheme, for reliable packet transmission within a strict latency constraint. We formulate the hop-specific transceiver configuration as a Markov decision process (MDP) and propose a dual-agent reinforcement learning-based cooperative latency-aware transmission (DRL-CoLA) algorithm to learn latency-aware transmission policies in a distributed manner. Simulation results verify that the proposed algorithm achieves the near-optimal reliability while satisfying strict latency requirements.
zh
[AI-51] Optimal-Agent -Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems)在复杂任务求解中因固定调度策略和低效协作机制导致的性能瓶颈问题,尤其是在动态变化的任务需求下难以实现灵活、高效的协同。其解决方案的关键在于提出一种状态感知的路由框架STRMAC,通过分别编码交互历史与智能体知识来驱动路由器,在每一步自适应地选择最合适的单一智能体执行任务,从而提升协作效率;同时引入自演化数据生成方法,显著加速高质量执行路径的收集,降低训练数据获取成本达90.1%,并在多个协同推理基准测试中实现优于基线23.8%的性能提升。
链接: https://arxiv.org/abs/2511.02200
作者: Jingbo Wang,Sendong Zhao,Haochun Wang,Yuzheng Fan,Lizhe Zhang,Yan Liu,Ting Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of multi-agent systems powered by large language models (LLMs) has unlocked new frontiers in complex task-solving, enabling diverse agents to integrate unique expertise, collaborate flexibly, and address challenges unattainable for individual models. However, the full potential of such systems is hindered by rigid agent scheduling and inefficient coordination strategies that fail to adapt to evolving task requirements. In this paper, we propose STRMAC, a state-aware routing framework designed for efficient collaboration in multi-agent systems. Our method separately encodes interaction history and agent knowledge to power the router, which adaptively selects the most suitable single agent at each step for efficient and effective collaboration. Furthermore, we introduce a self-evolving data generation approach that accelerates the collection of high-quality execution paths for efficient system training. Experiments on challenging collaborative reasoning benchmarks demonstrate that our method achieves state-of-the-art performance, achieving up to 23.8% improvement over baselines and reducing data collection overhead by up to 90.1% compared to exhaustive search.
zh
[AI-52] Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码推理任务中输出的可靠性与可控性问题,核心在于提升模型置信度估计(confidence estimation)的准确性。解决方案的关键在于提出并验证了一个面向代码推理任务的置信度分析与增强框架,其中通过实证研究发现具备推理能力的模型(如DeepSeek-Reasoner)具有更优的置信度可靠性,并进一步证明结合重评估提示策略(reassess prompt strategy)与数学校准方法(如Platt Scaling)的混合策略能显著提升各类模型的置信度性能,其在期望校准误差(ECE)、Brier Score和性能得分上的改进分别达0.541、0.628和15.084,优于单一优化手段。
链接: https://arxiv.org/abs/2511.02197
作者: Shufan Wang,Xing Hu,Junkai Chen,Zhiyuan Pan,Xin Xia
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to 0.680 , 0.636 , and 13.652 in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to 0.541 , 0.628 , and 15.084 over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.
zh
[AI-53] BoolSkeleton: Boolean Network Skeletonization via Homogeneous Pattern Reduction
【速读】:该论文旨在解决布尔网络(Boolean network)中因结构差异导致的功能一致性问题,即具有相同功能的布尔网络可能呈现不同的图结构,从而影响逻辑优化和设计特定评估的一致性与可靠性。解决方案的关键在于提出一种名为BoolSkeleton的布尔网络骨架化方法,其核心包括两个步骤:首先将布尔网络转换为带有功能相关状态标注的布尔依赖图(Boolean dependency graph),随后通过定义同质(homogeneous)与异质(heterogeneous)模式进行节点级模式压缩——其中异质模式保留以维持关键功能依赖关系,同质模式则可被简化;同时引入参数K对模式的扇入大小进行约束,实现对图简化粒度的精细控制。该方法在压缩分析、分类、关键路径分析和时序预测等四项下游任务中验证了有效性,尤其在时序预测任务中平均准确率提升超过55%,显著增强了逻辑综合中的设计一致性。
链接: https://arxiv.org/abs/2511.02196
作者: Liwei Ni,Jiaxi Zhang,Shenggen Zheng,Junfeng Liu,Xingyu Meng,Biwei Xie,Xingquan Li,Huawei Li
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Boolean equivalence allows Boolean networks with identical functionality to exhibit diverse graph structures. This gives more room for exploration in logic optimization, while also posing a challenge for tasks involving consistency between Boolean networks. To tackle this challenge, we introduce BoolSkeleton, a novel Boolean network skeletonization method that improves the consistency and reliability of design-specific evaluations. BoolSkeleton comprises two key steps: preprocessing and reduction. In preprocessing, the Boolean network is transformed into a defined Boolean dependency graph, where nodes are assigned the functionality-related status. Next, the homogeneous and heterogeneous patterns are defined for the node-level pattern reduction step. Heterogeneous patterns are preserved to maintain critical functionality-related dependencies, while homogeneous patterns can be reduced. Parameter K of the pattern further constrains the fanin size of these patterns, enabling fine-tuned control over the granularity of graph reduction. To validate BoolSkeleton’s effectiveness, we conducted four analysis/downstream tasks around the Boolean network: compression analysis, classification, critical path analysis, and timing prediction, demonstrating its robustness across diverse scenarios. Furthermore, it improves above 55% in the average accuracy compared to the original Boolean network for the timing prediction task. These experiments underscore the potential of BoolSkeleton to enhance design consistency in logic synthesis.
zh
[AI-54] ackling Incomplete Data in Air Quality Prediction: A Bayesian Deep Learning Framework for Uncertainty Quantification
【速读】:该论文旨在解决空气质量预测中因观测数据缺失(如采集或传输问题导致的不完整时空记录)所引发的可靠推断与风险评估困难问题,以及由此产生的过度自信外推现象。其解决方案的关键在于提出一种端到端的框架——基于通道门控学习单元的时空贝叶斯神经场(CGLUBNF),该框架通过傅里叶特征结合图注意力编码器捕捉多尺度空间依赖性和季节性时间动态,并引入具备可学习激活函数和门控残差连接的通道门控学习单元(Channel Gated Learning Unit, CGLU),实现对信息特征的自适应过滤与增强;同时,贝叶斯推断联合优化预测分布与参数不确定性,生成点估计值及校准后的置信区间,从而在多种典型缺失模式下显著提升预测精度并获得更锐利的不确定性量化结果。
链接: https://arxiv.org/abs/2511.02175
作者: Yuzhuang Pian,Taiyu Wang,Shiqi Zhang,Rui Xu,Yonghong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate air quality forecasts are vital for public health alerts, exposure assessment, and emissions control. In practice, observational data are often missing in varying proportions and patterns due to collection and transmission issues. These incomplete spatiotemporal records impede reliable inference and risk assessment and can lead to overconfident extrapolation. To address these challenges, we propose an end to end framework, the channel gated learning unit based spatiotemporal bayesian neural field (CGLUBNF). It uses Fourier features with a graph attention encoder to capture multiscale spatial dependencies and seasonal temporal dynamics. A channel gated learning unit, equipped with learnable activations and gated residual connections, adaptively filters and amplifies informative features. Bayesian inference jointly optimizes predictive distributions and parameter uncertainty, producing point estimates and calibrated prediction intervals. We conduct a systematic evaluation on two real world datasets, covering four typical missing data patterns and comparing against five state of the art baselines. CGLUBNF achieves superior prediction accuracy and sharper confidence intervals. In addition, we further validate robustness across multiple prediction horizons and analysis the contribution of extraneous variables. This research lays a foundation for reliable deep learning based spatio-temporal forecasting with incomplete observations in emerging sensing paradigms, such as real world vehicle borne mobile monitoring.
zh
[AI-55] ScenicProver: A Framework for Compositional Probabilistic Verification of Learning-Enabled Systems
【速读】:该论文致力于解决学习型网络物理系统(Learning-enabled Cyber-Physical Systems, LeCPS)的全验证难题,其核心挑战在于黑箱组件与复杂现实环境导致的传统验证方法难以适用。现有工具或仅对特定类型系统提供形式化保证,或采用整体测试方式,缺乏适用于复杂环境中多种验证技术的组合分析框架。解决方案的关键在于提出ScenicProver框架:基于Scenic概率编程语言实现组件化的系统描述,支持从可解释代码到黑箱组件的清晰接口;引入扩展线性时序逻辑(Linear Temporal Logic)以定义假设-保证契约(assume-guarantee contracts),并结合测试、Lean 4形式化证明及外部假设生成证据;通过契约运算符系统性整合证据,并自动生成追踪系统级保证来源的保障案例(assurance case)。该框架在自动驾驶紧急制动系统传感器融合场景中验证有效,通过利用雷达和激光雷达制造商提供的保证并聚焦于不确定条件下的测试,实现了比相同计算预算下整体测试更强的概率保证。
链接: https://arxiv.org/abs/2511.02164
作者: Eric Vin,Kyle A. Miller,Inigo Incer,Sanjit A. Seshia,Daniel J. Fremont
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 26 pages, 4 figures. Full version (including appendices) of a paper submitted to TACAS 2026
Abstract:Full verification of learning-enabled cyber-physical systems (CPS) has long been intractable due to challenges including black-box components and complex real-world environments. Existing tools either provide formal guarantees for limited types of systems or test the system as a monolith, but no general framework exists for compositional analysis of learning-enabled CPS using varied verification techniques over complex real-world environments. This paper introduces ScenicProver, a verification framework that aims to fill this gap. Built upon the Scenic probabilistic programming language, the framework supports: (1) compositional system description with clear component interfaces, ranging from interpretable code to black boxes; (2) assume-guarantee contracts over those components using an extension of Linear Temporal Logic containing arbitrary Scenic expressions; (3) evidence generation through testing, formal proofs via Lean 4 integration, and importing external assumptions; (4) systematic combination of generated evidence using contract operators; and (5) automatic generation of assurance cases tracking the provenance of system-level guarantees. We demonstrate the framework’s effectiveness through a case study on an autonomous vehicle’s automatic emergency braking system with sensor fusion. By leveraging manufacturer guarantees for radar and laser sensors and focusing testing efforts on uncertain conditions, our approach enables stronger probabilistic guarantees than monolithic testing with the same computational budget.
zh
[AI-56] xt to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models NEURIPS2025
【速读】:该论文旨在解决生成式AI在创建多组件物理对象时的挑战,特别是如何从自然语言提示中自动分解并装配具有多种功能部件的3D模型。其解决方案的关键在于集成3D生成式AI与视觉-语言模型(Vision-Language Models, VLMs),利用VLM实现零样本、多模态推理以识别几何结构和功能需求,从而将AI生成的网格分解为包含预定义结构件和面板件的多组件3D模型;实验表明,VLM生成的组件分配被用户偏好接受的比例高达90.6%,显著优于基于规则和随机分配的方法,并支持通过对话反馈进行交互式优化,提升人类在生成式AI与机器人协同制造中的控制权与参与度。
链接: https://arxiv.org/abs/2511.02162
作者: Alexander Htet Kyaw,Richa Gupta,Dhruv Shah,Anoop Sinha,Kory Mathewson,Stefanie Pender,Sachin Chitta,Yotto Koga,Faez Ahmed,Lawrence Sass,Randall Davis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to NeurIPS 2025, Conference on Neural Information Processing Systems, Creative AI Track
Abstract:Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on object functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
zh
[AI-57] Near Optimal Convergence to Coarse Correlated Equilibrium in General-Sum Markov Games
【速读】:该论文旨在解决一般和博弈(general-sum Markov games)中收敛至粗相关均衡(Coarse Correlated Equilibrium, CCE)的速率问题,此前最优收敛率为 O(log5T/T),其中 T 为迭代次数。作者通过引入一种基于阶段式学习率调整机制的自适应策略,将收敛速率提升至 O(logT/T),与相关均衡(Correlated Equilibrium, CE)的最优速率一致,并显著改善了对动作集规模的依赖性,从多项式降至多对数级别,从而在高维场景下实现指数级加速。解决方案的关键在于将乐观跟随正则化领导者(Optimistic Follow-the-Regularized-Leader, OFTRL)框架适配到基于值迭代的学习机制中,并结合近期在正常形式博弈中发展的自适应步长技术,通过实时反馈动态调整学习率,实现更高效的自对弈(self-play)收敛。
链接: https://arxiv.org/abs/2511.02157
作者: Asrin Efe Yorulmaz,Tamer Başar
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:
Abstract:No-regret learning dynamics play a central role in game theory, enabling decentralized convergence to equilibrium for concepts such as Coarse Correlated Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the convergence rate to CCE in general-sum Markov games, reducing it from the previously best-known rate of \mathcalO(\log^5 T / T) to a sharper \mathcalO(\log T / T) . This matches the best known convergence rate for CE in terms of T , number of iterations, while also improving the dependence on the action set size from polynomial to polylogarithmic-yielding exponential gains in high-dimensional settings. Our approach builds on recent advances in adaptive step-size techniques for no-regret algorithms in normal-form games, and extends them to the Markovian setting via a stage-wise scheme that adjusts learning rates based on real-time feedback. We frame policy updates as an instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for value-iteration-based learning. The resulting self-play algorithm achieves, to our knowledge, the fastest known convergence rate to CCE in Markov games.
zh
[AI-58] Disentangling Causal Substructures for Interpretable and Generalizable Drug Synergy Prediction
【速读】:该论文旨在解决现有药物协同作用预测方法多为黑箱模型、依赖统计相关性而缺乏因果解释力的问题。其核心解决方案是提出CausalDDS框架,通过将药物分子解耦为因果子结构(causal substructures)与虚假子结构(spurious substructures),并利用因果子结构表示进行协同作用预测,从而有效降低虚假特征带来的冗余干扰,提升模型的准确性与可解释性。该方法的关键创新在于引入条件干预机制(conditional intervention mechanism),干预策略基于成对分子结构设计,并采用基于充分性(sufficiency)和独立性(independence)原则的新优化目标,显著提升了在冷启动和分布外场景下的性能表现。
链接: https://arxiv.org/abs/2511.02146
作者: Yi Luo,Haochen Zhao,Xiao Liang,Yiwei Liu,Yuye Zhang,Xinyu Li,Jianxin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Drug synergy prediction is a critical task in the development of effective combination therapies for complex diseases, including cancer. Although existing methods have shown promising results, they often operate as black-box predictors that rely predominantly on statistical correlations between drug characteristics and results. To address this limitation, we propose CausalDDS, a novel framework that disentangles drug molecules into causal and spurious substructures, utilizing the causal substructure representations for predicting drug synergy. By focusing on causal sub-structures, CausalDDS effectively mitigates the impact of redundant features introduced by spurious substructures, enhancing the accuracy and interpretability of the model. In addition, CausalDDS employs a conditional intervention mechanism, where interventions are conditioned on paired molecular structures, and introduces a novel optimization objective guided by the principles of sufficiency and independence. Extensive experiments demonstrate that our method outperforms baseline models, particularly in cold start and out-of-distribution settings. Besides, CausalDDS effectively identifies key substructures underlying drug synergy, providing clear insights into how drug combinations work at the molecular level. These results underscore the potential of CausalDDS as a practical tool for predicting drug synergy and facilitating drug discovery.
zh
[AI-59] Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning NEURIPS2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在推理过程中因固定思维长度导致的计算资源浪费与性能瓶颈问题,尤其是在长推理链中如何动态控制思考长度以实现效率与准确率的平衡。解决方案的关键在于提出 Re-FORC——一种基于轻量适配器(adapter)的自适应奖励预测方法,通过建模未来奖励随思考 token 数量变化的函数关系,在推理阶段实现对推理路径的早期终止、模型与思考长度的优化选择以及测试时的自适应缩放,从而在保持精度的同时显著降低计算开销,并支持基于单位 token 成本阈值的动态推理控制。
链接: https://arxiv.org/abs/2511.02130
作者: Renos Zabounidis,Aditya Golatkar,Michael Kleinman,Alessandro Achille,Wei Xia,Stefano Soatto
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Efficient Reasoning Workshop at NeurIPS 2025
Abstract:We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
zh
[AI-60] Matrix Sensing with Kernel Optimal Loss: Robustness and Optimization Landscape
【速读】:该论文旨在解决非凸优化问题中损失函数选择对模型鲁棒性及优化景观(optimization landscape)的影响,特别是在存在非高斯或重尾噪声时传统均方误差(Mean Squared Error, MSE)损失的不稳定性问题。其解决方案的关键在于采用一种基于非参数回归的鲁棒损失函数,该函数利用核估计方法构建残差密度,并最大化估计的对数似然,从而在高斯误差下退化为MSE损失,而在更广泛的噪声分布下仍保持稳定性和鲁棒性。通过理论分析和实证研究,作者进一步证明该损失能有效抑制伪局部极小值(spurious local minima),并通过上界控制受限等距性质(Restricted Isometry Property, RIP)常数来重塑优化景观,为提升机器学习任务鲁棒性提供了一种简单而有效的策略。
链接: https://arxiv.org/abs/2511.02122
作者: Xinyuan Song,Jiaye Teng,Ziye Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper we study how the choice of loss functions of non-convex optimization problems affects their robustness and optimization landscape, through the study of noisy matrix sensing. In traditional regression tasks, mean squared error (MSE) loss is a common choice, but it can be unreliable for non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust loss based on nonparametric regression, which uses a kernel-based estimate of the residual density and maximizes the estimated log-likelihood. This robust formulation coincides with the MSE loss under Gaussian errors but remains stable under more general settings. We further examine how this robust loss reshapes the optimization landscape by analyzing the upper-bound of restricted isometry property (RIP) constants for spurious local minima to disappear. Through theoretical and empirical analysis, we show that this new loss excels at handling large noise and remains robust across diverse noise distributions. This work offers initial insights into enhancing the robustness of machine learning tasks through simply changing the loss, guided by an intuitive and broadly applicable analytical framework.
zh
[AI-61] Metamorphic Testing of Large Language Models for Natural Language Processing
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言处理(Natural Language Processing, NLP)任务中常产生错误结果的问题,且现有方法受限于标注数据集的稀缺性,难以有效识别这些故障行为。解决方案的关键在于采用变异测试(Metamorphic Testing, MT),其核心机制是利用变异关系(Metamorphic Relations, MRs)——即定义相关输入输出之间的约束关系——来检测LLMs的不一致性行为,从而无需依赖明确的“oracle”(如人工标注结果)即可发现模型缺陷。通过系统性收集191个MRs并实验验证其中36个,该研究首次对MT在LLMs中的应用进行了全面评估,揭示了其在提升LLM可靠性方面的潜力与局限。
链接: https://arxiv.org/abs/2511.02108
作者: Steven Cho,Stefano Ruberto,Valerio Terragni
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Using large language models (LLMs) to perform natural language processing (NLP) tasks has become increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that they can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limited availability of labeled datasets, which necessitates an oracle to determine the correctness of LLM behaviors. Metamorphic testing (MT) is a popular testing approach that alleviates this oracle problem. At the core of MT are metamorphic relations (MRs), which define relationships between the outputs of related inputs. MT can expose faulty behaviors without the need for explicit oracles (e.g., labeled datasets). This paper presents the most comprehensive study of MT for LLMs to date. We conducted a literature review and collected 191 MRs for NLP tasks. We implemented a representative subset (36 MRs) to conduct a series of experiments with three popular LLMs, running approximately 560,000 metamorphic tests. The results shed light on the capabilities and opportunities of MT for LLMs, as well as its limitations.
zh
[AI-62] Geometric Data Valuation via Leverag e Scores NEURIPS2025
【速读】:该论文旨在解决Shapley数据估值在大规模数据场景下因组合计算复杂度而难以应用的问题。其核心挑战在于,Shapley值需评估所有数据子集的边际效用,导致计算不可行。解决方案的关键在于提出一种基于统计杠杆率(statistical leverage scores)的几何替代方法,该方法通过衡量每个数据点在表示空间中对数据集跨度和有效维度的贡献,来量化其结构影响力;进一步引入岭杠杆率(ridge leverage scores),不仅满足Shapley值的哑元、效率与对称性公理,还保证了严格正的边际增益,并与经典的A-和D最优设计准则自然关联。实验表明,基于杠杆率采样的子集训练所得模型,在参数和预测风险上均接近全数据最优解,从而建立了数据估值与下游决策质量之间的理论联系。
链接: https://arxiv.org/abs/2511.02100
作者: Rodrigo Mendoza-Smith
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注: MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making (NeurIPS 2025)
Abstract:Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint’s structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emphridge leverage scores yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We further show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within O(\varepsilon) of the full-data optimum, thereby providing a rigorous link between data valuation and downstream decision quality. Finally, we conduct an active learning experiment in which we empirically demonstrate that ridge-leverage sampling outperforms standard baselines without requiring access gradients or backward passes.
zh
[AI-63] Automated Reward Design for Gran Turismo
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中奖励函数设计困难的问题,尤其是在复杂环境(如自动驾驶赛车)中,如何将人类期望的行为准确映射为有效的奖励函数。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成候选奖励函数,结合视觉语言模型(Vision-Language Models, VLMs)进行偏好评估,并引入人类反馈以迭代优化,从而自动构建出性能接近顶尖水平(如GT Sophy)的赛车智能体,并能探索新颖行为,实现可实用的自动化奖励设计流程。
链接: https://arxiv.org/abs/2511.02094
作者: Michel Ma,Takuma Seno,Kaushik Subramanian,Peter R. Wurman,Peter Stone,Craig Sherstan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When designing reinforcement learning (RL) agents, a designer communicates the desired agent behavior through the definition of reward functions - numerical feedback given to the agent as reward or punishment for its actions. However, mapping desired behaviors to reward functions can be a difficult process, especially in complex environments such as autonomous racing. In this paper, we demonstrate how current foundation models can effectively search over a space of reward functions to produce desirable RL agents for the Gran Turismo 7 racing game, given only text-based instructions. Through a combination of LLM-based reward generation, VLM preference-based evaluation, and human feedback we demonstrate how our system can be used to produce racing agents competitive with GT Sophy, a champion-level RL racing agent, as well as generate novel behaviors, paving the way for practical automated reward design in real world applications.
zh
[AI-64] Uncertainty Guided Online Ensemble for Non-stationary Data Streams in Fusion Science
【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在核聚变装置中因数据分布漂移(distribution drift)而导致性能下降的问题。具体而言,实验过程中由于设备老化和运行状态变化,导致数据呈现非平稳性(non-stationary behavior),而传统静态ML模型无法适应此类动态变化,从而影响预测精度。解决方案的关键在于引入在线学习(online learning)机制以持续适应数据流,并进一步提出一种基于不确定性的在线集成方法(uncertainty-guided online ensemble)。该方法利用深度高斯过程近似(Deep Gaussian Process Approximation, DGPA)进行校准的不确定性估计,通过不确定性值指导元算法从不同历史窗口训练的多个学习器中选择最优预测结果,显著提升了模型鲁棒性和预测准确性,相较标准单模型在线学习,误差降低6%;而采用不确定性引导策略后,误差再降低4%,整体误差减少达10%。
链接: https://arxiv.org/abs/2511.02092
作者: Kishansingh Rajput,Malachi Schram,Brian Sammuli,Sen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Plasma Physics (physics.plasm-ph)
备注: 24 pages including total of references, 2 appendices, 7 Figures (5 in main article, 2 in appendix A)
Abstract:Machine Learning (ML) is poised to play a pivotal role in the development and operation of next-generation fusion devices. Fusion data shows non-stationary behavior with distribution drifts, resulted by both experimental evolution and machine wear-and-tear. ML models assume stationary distribution and fail to maintain performance when encountered with such non-stationary data streams. Online learning techniques have been leveraged in other domains, however it has been largely unexplored for fusion applications. In this paper, we present an application of online learning to continuously adapt to drifting data stream for prediction of Toroidal Field (TF) coils deflection at the DIII-D fusion facility. The results demonstrate that online learning is critical to maintain ML model performance and reduces error by 80% compared to a static model. Moreover, traditional online learning can suffer from short-term performance degradation as ground truth is not available before making the predictions. As such, we propose an uncertainty guided online ensemble method to further improve the performance. The Deep Gaussian Process Approximation (DGPA) technique is leveraged for calibrated uncertainty estimation and the uncertainty values are then used to guide a meta-algorithm that produces predictions based on an ensemble of learners trained on different horizon of historical data. The DGPA also provides uncertainty estimation along with the predictions for decision makers. The online ensemble and the proposed uncertainty guided online ensemble reduces predictions error by about 6%, and 10% respectively over standard single model based online learning.
zh
[AI-65] Natural Building Blocks for Structured World Models: Theory Evidence and Scaling
【速读】:该论文旨在解决世界建模领域中存在的架构碎片化问题,即研究者通常设计专用模型而难以复用和整合已有成果。其解决方案的关键在于提出一个基于基础随机过程的模块化框架:将世界模型分解为离散过程(如逻辑、符号)与连续过程(如物理、动力学)两类自然构建块,并通过层级组合定义模型结构;在此基础上,利用隐马尔可夫模型(Hidden Markov Models, HMMs)和切换线性动态系统(switching linear dynamical systems, sLDS)分别作为离散和连续建模的原语,当引入动作后可扩展为部分可观测马尔可夫决策过程(partially-observable Markov decision processes, POMDPs)和受控sLDS,从而统一支持被动建模(生成、预测)与主动控制(规划、决策)。该方法通过固定因果结构并仅优化四个深度参数避免传统结构学习中的组合爆炸问题,实现了在保持可解释性的同时达到与神经方法相当的性能。
链接: https://arxiv.org/abs/2511.02091
作者: Lancelot Da Costa,Sanjeev Namjoshi,Mohammed Abbas Ansari,Bernhard Schölkopf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, under review for World Modeling Workshop 2026
Abstract:The field of world modeling is fragmented, with researchers developing bespoke architectures that rarely build upon each other. We propose a framework that specifies the natural building blocks for structured world models based on the fundamental stochastic processes that any world model must capture: discrete processes (logic, symbols) and continuous processes (physics, dynamics); the world model is then defined by the hierarchical composition of these building blocks. We examine Hidden Markov Models (HMMs) and switching linear dynamical systems (sLDS) as natural building blocks for discrete and continuous modeling–which become partially-observable Markov decision processes (POMDPs) and controlled sLDS when augmented with actions. This modular approach supports both passive modeling (generation, forecasting) and active control (planning, decision-making) within the same architecture. We avoid the combinatorial explosion of traditional structure learning by largely fixing the causal architecture and searching over only four depth parameters. We review practical expressiveness through multimodal generative modeling (passive) and planning from pixels (active), with performance competitive to neural approaches while maintaining interpretability. The core outstanding challenge is scalable joint structure-parameter learning; current methods finesse this by cleverly growing structure and parameters incrementally, but are limited in their scalability. If solved, these natural building blocks could provide foundational infrastructure for world modeling, analogous to how standardized layers enabled progress in deep learning.
zh
[AI-66] Energy Loss Functions for Physical Systems NEURIPS2025
【速读】:该论文旨在解决如何在机器学习模型中有效融入系统物理先验知识的问题,尤其是在分子生成和自旋基态预测等科学领域中,传统方法通常仅在模型架构层面引入物理信息,而忽略了损失函数的设计对物理一致性的影响。其解决方案的关键在于将物理信息直接嵌入损失函数:通过假设数据样本处于近似能量景观下的热平衡状态,利用反向KL散度与玻尔兹曼分布构建能量损失函数,使损失表示为数据与模型预测之间的能量差。该方法不仅使传统目标(如均方误差)具有能量基础意义,而且提供了具有物理意义的能量函数,其梯度更符合物理有效配置方向,同时具备架构无关性和计算高效性,并天然尊重物理对称性。
链接: https://arxiv.org/abs/2511.02087
作者: Sékou-Oumar Kaba,Kusha Sareen,Daniel Levy,Siamak Ravanbakhsh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 10 pages, 4 figures, NeurIPS 2025
Abstract:Effectively leveraging prior knowledge of a system’s physics is crucial for applications of machine learning to scientific domains. Previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function for prediction and generative modeling tasks on systems like molecules and spins. We derive energy loss functions assuming that each data sample is in thermal equilibrium with respect to an approximate energy landscape. By using the reverse KL divergence with a Boltzmann distribution around the data, we obtain the loss as an energy difference between the data and the model predictions. This perspective also recasts traditional objectives like MSE as energy-based, but with a physically meaningless energy. In contrast, our formulation yields physically grounded loss functions with gradients that better align with valid configurations, while being architecture-agnostic and computationally efficient. The energy loss functions also inherently respect physical symmetries. We demonstrate our approach on molecular generation and spin ground-state prediction and report significant improvements over baselines.
zh
[AI-67] Watermarking Discrete Diffusion Language Models
【速读】:该论文旨在解决离散扩散语言模型(discrete diffusion language models)中生成内容的可追溯性问题,即如何有效标记和检测由这类模型生成的文本内容。当前水印技术主要针对自回归大语言模型(LLMs)和图像扩散模型,尚未覆盖日益流行的高推理吞吐量的离散扩散语言模型。其解决方案的关键在于在每个扩散步骤中应用保持分布特性的Gumbel-max技巧(distribution-preserving Gumbel-max trick),并通过序列索引(sequence index)对随机性进行种子化,从而实现可靠检测;理论分析表明该方法在令牌序列长度增长时具有指数衰减的误检概率,且无失真(distortion-free)。
链接: https://arxiv.org/abs/2511.02083
作者: Avi Bagchi,Akhil Bhimaraju,Moulik Choraria,Daniel Alabi,Lav R. Varshney
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Watermarking has emerged as a promising technique to track AI-generated content and differentiate it from authentic human creations. While prior work extensively studies watermarking for autoregressive large language models (LLMs) and image diffusion models, none address discrete diffusion language models, which are becoming popular due to their high inference throughput. In this paper, we introduce the first watermarking method for discrete diffusion models by applying the distribution-preserving Gumbel-max trick at every diffusion step and seeding the randomness with the sequence index to enable reliable detection. We experimentally demonstrate that our scheme is reliably detectable on state-of-the-art diffusion language models and analytically prove that it is distortion-free with an exponentially decaying probability of false detection in the token sequence length.
zh
[AI-68] Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing
【速读】:该论文旨在解决当前机器智能与物理执行之间的鸿沟问题,即尽管生成式 AI(Generative AI)和自动化技术在虚拟领域取得进展,但现实中的科学实验与制造流程仍高度依赖人工监督与专家经验,导致可重复性差、可扩展性弱以及可及性受限。其解决方案的关键在于提出“人-AI共具身智能”(human-AI co-embodied intelligence)新范式,将人类用户、代理型 AI(agentic AI)与可穿戴硬件集成于一体,通过混合现实实现物理世界的实时交互:人类负责精确操作与控制,代理型 AI 提供情境感知推理、自适应规划与实时反馈,可穿戴接口则持续记录过程并支持可解释的协同纠错,从而构建具备上下文感知能力、自动纠错能力和知识迁移能力的智能制造与实验系统——以 Agentic-Physical Experimentation (APEX) 系统为例,在洁净室环境中实现了对柔性电子制造流程的高精度追踪、3D 视觉引导与专家级指导,显著超越通用多模态大语言模型的推理准确率,并成功将专家技能传递给初学者。
链接: https://arxiv.org/abs/2511.02071
作者: Xinyi Lin,Yuyang Zhang,Yuanhang Gan,Juntao Chen,Hao Shen,Yichun He,Lijun Li,Ze Yuan,Shuang Wang,Chaohao Wang,Rui Zhang,Na Li,Jia Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific experiment and manufacture rely on complex, multi-step procedures that demand continuous human expertise for precise execution and decision-making. Despite advances in machine learning and automation, conventional models remain confined to virtual domains, while real-world experiment and manufacture still rely on human supervision and expertise. This gap between machine intelligence and physical execution limits reproducibility, scalability, and accessibility across scientific and manufacture workflows. Here, we introduce human-AI co-embodied intelligence, a new form of physical AI that unites human users, agentic AI, and wearable hardware into an integrated system for real-world experiment and intelligent manufacture. In this paradigm, humans provide precise execution and control, while agentic AI contributes memory, contextual reasoning, adaptive planning, and real-time feedback. The wearable interface continuously captures the experimental and manufacture processes, facilitates seamless communication between humans and AI for corrective guidance and interpretable collaboration. As a demonstration, we present Agentic-Physical Experimentation (APEX) system, coupling agentic reasoning with physical execution through mixed-reality. APEX observes and interprets human actions, aligns them with standard operating procedures, provides 3D visual guidance, and analyzes every step. Implemented in a cleanroom for flexible electronics fabrication, APEX system achieves context-aware reasoning with accuracy exceeding general multimodal large language models, corrects errors in real time, and transfers expertise to beginners. These results establish a new class of agentic-physical-human intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific research and manufacturing into autonomous, traceable, interpretable, and scalable processes.
zh
[AI-69] Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements
【速读】:该论文旨在解决当前机器学习(ML)推理服务在面对高要求的SLO(Service Level Objective,服务级别目标)场景时,因采用传统批处理(batching)策略而导致尾部延迟不可预测的问题。现有平台如TorchServe和Ray Serve虽能优化吞吐量,但在保障低延迟与稳定性方面表现不足,难以满足智能代理(agent)等复杂应用对确定性延迟的需求。其解决方案的关键在于提出一种以SLO优先(SLO-first)的架构设计——Vortex,通过精细化的任务调度与资源分配机制,在相同任务负载下显著降低并稳定了端到端延迟,尤其在RDMA(Remote Direct Memory Access)可用时展现出更优性能,使系统可在更高请求速率下仍满足SLO目标。
链接: https://arxiv.org/abs/2511.02062
作者: Yuting Yang,Tiancheng Yuan,Jamal Hashim,Thiago Garrett,Jeffrey Qian,Ann Zhang,Yifan Wang,Weijia Song,Ken Birman
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex’s pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
zh
[AI-70] Quantum-Enhanced Generative Models for Rare Event Prediction
【速读】:该论文旨在解决生成式 AI(Generative AI)在建模罕见事件(如金融崩盘、气候极端事件和生物异常)时面临的挑战,这些问题通常因数据稀缺性和重尾分布而难以准确捕捉,导致传统深度生成模型容易发生低概率模式坍塌或不确定性估计校准不足。解决方案的关键在于提出量子增强生成模型(Quantum-Enhanced Generative Model, QEGM),其核心创新包括:(1) 一种联合优化重建保真度与尾部感知似然的混合损失函数,以提升对稀有事件的概率建模能力;(2) 利用量子随机噪声注入增强样本多样性并缓解模式坍塌。训练采用经典-量子混合循环,其中经典参数通过反向传播更新,量子参数则使用参数移位梯度优化,从而显著降低尾部KL散度(最多减少50%),同时提高稀有事件召回率和覆盖校准性能。
链接: https://arxiv.org/abs/2511.02042
作者: M.Z. Haider,M.U. Ghouri,Tayyaba Noreen,M. Salman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: IEEE Conference COMCOMAP 2025
Abstract:Rare events such as financial crashes, climate extremes, and biological anomalies are notoriously difficult to model due to their scarcity and heavy-tailed distributions. Classical deep generative models often struggle to capture these rare occurrences, either collapsing low-probability modes or producing poorly calibrated uncertainty estimates. In this work, we propose the Quantum-Enhanced Generative Model (QEGM), a hybrid classical-quantum framework that integrates deep latent-variable models with variational quantum circuits. The framework introduces two key innovations: (1) a hybrid loss function that jointly optimizes reconstruction fidelity and tail-aware likelihood, and (2) quantum randomness-driven noise injection to enhance sample diversity and mitigate mode collapse. Training proceeds via a hybrid loop where classical parameters are updated through backpropagation while quantum parameters are optimized using parameter-shift gradients. We evaluate QEGM on synthetic Gaussian mixtures and real-world datasets spanning finance, climate, and protein structure. Results demonstrate that QEGM reduces tail KL divergence by up to 50 percent compared to state-of-the-art baselines (GAN, VAE, Diffusion), while improving rare-event recall and coverage calibration. These findings highlight the potential of QEGM as a principled approach for rare-event prediction, offering robustness beyond what is achievable with purely classical methods.
zh
[AI-71] RobustFSM: Submodular Maximization in Federated Setting with Malicious Clients
【速读】:该论文致力于解决联邦子模最大化(federated submodular maximization, FSM)场景下的鲁棒性问题,即在数据由分散客户端持有且各自定义表示质量的情况下,如何抵御恶意客户端通过上传虚假本地信息进行的攻击。这类攻击类似于传统联邦学习中的后门攻击,但因子模优化的独特性质而带来新的挑战。解决方案的关键在于提出RobustFSM,一种针对多种实际客户端攻击场景设计的鲁棒联邦子模最大化算法,其核心机制通过有效识别和抑制异常客户端贡献,在保证隐私与自治的前提下显著提升整体解的质量;实证研究表明,在严重攻击下,RobustFSM的性能相较传统联邦算法可提升高达200%。
链接: https://arxiv.org/abs/2511.02029
作者: Duc A. Tran,Dung Truong,Duy Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 9 pages
Abstract:Submodular maximization is an optimization problem benefiting many machine learning applications, where we seek a small subset best representing an extremely large dataset. We focus on the federated setting where the data are locally owned by decentralized clients who have their own definitions for the quality of representability. This setting requires repetitive aggregation of local information computed by the clients. While the main motivation is to respect the privacy and autonomy of the clients, the federated setting is vulnerable to client misbehaviors: malicious clients might share fake information. An analogy is backdoor attack in conventional federated learning, but our challenge differs freshly due to the unique characteristics of submodular maximization. We propose RobustFSM, a federated submodular maximization solution that is robust to various practical client attacks. Its performance is substantiated with an empirical evaluation study using real-world datasets. Numerical results show that the solution quality of RobustFSM substantially exceeds that of the conventional federated algorithm when attacks are severe. The degree of this improvement depends on the dataset and attack scenarios, which can be as high as 200%
zh
[AI-72] Path-Coordinated Continual Learning with Neural Tangent Kernel-Justified Plasticity: A Theoretical Framework with Near State-of-the-Art Performance
【速读】:该论文旨在解决持续学习(Continual Learning)中的灾难性遗忘(Catastrophic Forgetting)问题,即神经网络在学习新任务时会严重遗忘先前任务的知识。其解决方案的关键在于提出了一种基于路径协调(Path-Coordinated)的框架,该框架融合了神经切线核(Neural Tangent Kernel, NTK)理论以界定可塑性边界、利用Wilson置信区间进行统计验证,并通过多指标评估路径质量。实验表明,该方法在Split-CIFAR10上实现了平均准确率66.7%且仅23.4%的灾难性遗忘,显著优于基线并接近当前最先进水平;同时发现NTK条件数可作为学习能力上限的预测指标,存在约1011的临界阈值,且随着任务序列推进遗忘率从27%降至18%,体现系统稳定性增强。
链接: https://arxiv.org/abs/2511.02025
作者: Rathin Chandra Shit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Under review, IEEE Letters
Abstract:Catastrophic forgetting is one of the fundamental issues of continual learning because neural networks forget the tasks learned previously when trained on new tasks. The proposed framework is a new path-coordinated framework of continual learning that unites the Neural Tangent Kernel (NTK) theory of principled plasticity bounds, statistical validation by Wilson confidence intervals, and evaluation of path quality by the use of multiple metrics. Experimental evaluation shows an average accuracy of 66.7% at the cost of 23.4% catastrophic forgetting on Split-CIFAR10, a huge improvement over the baseline and competitive performance achieved, which is very close to state-of-the-art results. Further, it is found out that NTK condition numbers are predictive indicators of learning capacity limits, showing the existence of a critical threshold at condition number 10^11 . It is interesting to note that the proposed strategy shows a tendency of lowering forgetting as the sequence of tasks progresses (27% to 18%), which is a system stabilization. The framework validates 80% of discovered paths with a rigorous statistical guarantee and maintains 90-97% retention on intermediate tasks. The core capacity limits of the continual learning environment are determined in the analysis, and actionable insights to enhance the adaptive regularization are offered.
zh
[AI-73] Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
【速读】:该论文旨在解决大语言模型在经过窄域有害数据微调后出现广泛偏离对齐行为(即涌现式错位,Emergent Misalignment, EM)的根本机制问题,尤其是这种有害泛化如何在不同任务间跨域发生。解决方案的关键在于从几何视角揭示EM的参数结构特性:研究发现,不同任务的微调权重更新在参数空间中呈现高度线性一致性,表现为较高的余弦相似度、共享的低维子空间(通过主角度和投影重叠量化),以及通过线性模式连通性验证的功能等价性——即插值模型仍保持一致的广泛错位行为。这表明EM源于多个窄任务共同发现同一组共享的参数方向,暗示有害行为可能存在于权重空间中的特定可预测区域,从而为基于参数空间的可解释性和干预策略提供了新路径。
链接: https://arxiv.org/abs/2511.02022
作者: Daniel Aarao Reis Arturi,Eric Zhang,Andrew Ansah,Kevin Zhu,Ashwinee Panda,Aishwarya Balwani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
zh
[AI-74] InteracSPARQL: An Interactive System for SPARQL Query Refinement Using Natural Language Explanations
【速读】:该论文旨在解决非专家用户在使用SPARQL查询语义网数据时面临的两大挑战:一是SPARQL语言复杂的语法结构,二是用户需具备对复杂数据结构的深入理解。为应对这些问题,作者提出InteracSPARQL系统,其核心创新在于结合规则驱动的结构化解释与大语言模型(Large Language Models, LLMs)的自然语言优化,首先从SPARQL抽象语法树(Abstract Syntax Tree, AST)生成结构化解释,再通过LLM进行语言层面的精细化润色,从而提升解释的可读性与准确性;同时支持用户通过直接反馈或LLM驱动的自我修正机制实现交互式查询迭代优化,显著提升了查询准确率、解释清晰度及用户体验。
链接: https://arxiv.org/abs/2511.02002
作者: Xiangru Jian,Zhengyuan Dong,M. Tamer Özsu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Working paper
Abstract:In recent years, querying semantic web data using SPARQL has remained challenging, especially for non-expert users, due to the language’s complex syntax and the prerequisite of understanding intricate data structures. To address these challenges, we propose InteracSPARQL, an interactive SPARQL query generation and refinement system that leverages natural language explanations (NLEs) to enhance user comprehension and facilitate iterative query refinement. InteracSPARQL integrates LLMs with a rule-based approach to first produce structured explanations directly from SPARQL abstract syntax trees (ASTs), followed by LLM-based linguistic refinements. Users can interactively refine queries through direct feedback or LLM-driven self-refinement, enabling the correction of ambiguous or incorrect query components in real time. We evaluate InteracSPARQL on standard benchmarks, demonstrating significant improvements in query accuracy, explanation clarity, and overall user satisfaction compared to baseline approaches. Our experiments further highlight the effectiveness of combining rule-based methods with LLM-driven refinements to create more accessible and robust SPARQL interfaces.
zh
[AI-75] RACE: Textual Reasoning for Affordance Coordinate Extraction ICCV2025
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在机器人操作中难以将高层指令精准转化为空间可操作性(spatial affordances)的问题。现有基于视觉的思维链(Chain-of-Thought, CoT)方法虽能提升推理能力,但计算成本高且缺乏明确的中间推理过程。其解决方案的关键在于提出一种名为TRACE(Textual Reasoning for Affordance Coordinate Extraction)的新范式,通过引入文本思维链(Chain-of-Reasoning, CoR),使VLM在执行动作前显式生成关于空间关系的自然语言推理步骤。这一设计不仅提升了模型在Where2Place(W2P)基准上的精度(达48.1%,相对提升9.6%),还增强了决策过程的可解释性与可靠性,实验证明性能随推理数据量增加而提升,验证了CoR机制的有效性。
链接: https://arxiv.org/abs/2511.01999
作者: Sangyun Park,Jin Kim,Yuchen Cui,Matthew S. Brown
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICCV 2025. *Equal contribution. †Corresponding author
Abstract:Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR’s effectiveness. Furthermore, analysis of the model’s attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at this https URL
zh
[AI-76] Vibe Learning: Education in the age of AI
【速读】:该论文试图解决的问题是:在生成式 AI(尤其是大语言模型,LLMs)日益普及的背景下,如何确保人类智力劳动的独特性与长期价值不被替代,特别是在教育领域中培养未来仍具优势的人类核心能力。其解决方案的关键在于,基于建构主义(constructivist)范式重构教育体系,强调发展那些AI难以复制的能力,如批判性思维、创造性问题解决和元认知能力,从而在人机协同的新时代中保持人类智能的不可替代性。
链接: https://arxiv.org/abs/2511.01956
作者: Marcos Florencio,Francielle Prieto
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The debate over whether “thinking machines” could replace human intellectual labor has existed in both public and expert discussions since the mid-twentieth century, when the concept and terminology of Artificial Intelligence (AI) first emerged. For decades, this idea remained largely theoretical. However, with the recent advent of Generative AI - particularly Large Language Models (LLMs) - and the widespread adoption of tools such as ChatGPT, the issue has become a practical reality. Many fields that rely on human intellectual effort are now being reshaped by AI tools that both expand human capabilities and challenge the necessity of certain forms of work once deemed uniquely human but now easily automated. Education, somewhat unexpectedly, faces a pivotal responsibility: to devise long-term strategies for cultivating human skills that will remain relevant in an era of pervasive AI in the intellectual domain. In this context, we identify the limitations of current AI systems - especially those rooted in LLM technology - argue that the fundamental causes of these weaknesses cannot be resolved through existing methods, and propose directions within the constructivist paradigm for transforming education to preserve the long-term advantages of human intelligence over AI tools.
zh
[AI-77] Black-Box Membership Inference Attack for LVLMs via Prior Knowledge-Calibrated Memory Probing
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在黑盒场景下难以进行成员推理攻击(Membership Inference Attacks, MIAs)的问题。由于主流LVLM在推理阶段仅暴露生成输出而隐藏内部计算特征,现有基于似然性特征的白盒或灰盒MIA方法难以适用。其解决方案的关键在于提出一种基于先验知识校准的记忆探针机制(prior knowledge-calibrated memory probing mechanism),通过评估目标模型对可疑图像数据中嵌入的私有语义信息的记忆程度来判断数据是否来自训练集,该方法不依赖于模型内部特征,仅需访问模型输出,从而实现了首个纯黑盒环境下的LVLM成员推理攻击框架。
链接: https://arxiv.org/abs/2511.01952
作者: Jinhua Yin,Peiru Yang,Chen Yang,Huili Wang,Zhiyang Hu,Shangguang Wang,Yongfeng Huang,Tao Qi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large vision-language models (LVLMs) derive their capabilities from extensive training on vast corpora of visual and textual data. Empowered by large-scale parameters, these models often exhibit strong memorization of their training data, rendering them susceptible to membership inference attacks (MIAs). Existing MIA methods for LVLMs typically operate under white- or gray-box assumptions, by extracting likelihood-based features for the suspected data samples based on the target LVLMs. However, mainstream LVLMs generally only expose generated outputs while concealing internal computational features during inference, limiting the applicability of these methods. In this work, we propose the first black-box MIA framework for LVLMs, based on a prior knowledge-calibrated memory probing mechanism. The core idea is to assess the model memorization of the private semantic information embedded within the suspected image data, which is unlikely to be inferred from general world knowledge alone. We conducted extensive experiments across four LVLMs and three datasets. Empirical results demonstrate that our method effectively identifies training data of LVLMs in a purely black-box setting and even achieves performance comparable to gray-box and white-box methods. Further analysis reveals the robustness of our method against potential adversarial manipulations, and the effectiveness of the methodology designs. Our code and data are available at this https URL.
zh
[AI-78] Interpretable Heart Disease Prediction via a Weighted Ensemble Model: A Large-Scale Study with SHAP and Surrogate Decision Trees
【速读】:该论文旨在解决心血管疾病(Cardiovascular Disease, CVD)早期风险预测中模型可靠性与可解释性不足的问题。其解决方案的关键在于构建一个策略加权的集成模型,融合树模型(LightGBM、XGBoost)与卷积神经网络(Convolutional Neural Network, CNN),并通过特征工程将原始22个特征扩展至25个以提升信息表达能力;同时,采用策略性权重处理类别不平衡问题,并引入代理决策树和SHapley Additive exPlanations(SHAP)方法增强模型透明度,最终在测试集上实现AUC为0.8371(p=0.003)且召回率达80.0%,兼顾了高性能与临床可解释性,适合用于公共卫生筛查场景。
链接: https://arxiv.org/abs/2511.01947
作者: Md Abrar Hasnat,Md Jobayer,Md. Mehedi Hasan Shawon,Md. Golam Rabiul Alam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Cardiovascular disease (CVD) remains a critical global health concern, demanding reliable and interpretable predictive models for early risk assessment. This study presents a large-scale analysis using the Heart Disease Health Indicators Dataset, developing a strategically weighted ensemble model that combines tree-based methods (LightGBM, XGBoost) with a Convolutional Neural Network (CNN) to predict CVD risk. The model was trained on a preprocessed dataset of 229,781 patients where the inherent class imbalance was managed through strategic weighting and feature engineering enhanced the original 22 features to 25. The final ensemble achieves a statistically significant improvement over the best individual model, with a Test AUC of 0.8371 (p=0.003) and is particularly suited for screening with a high recall of 80.0%. To provide transparency and clinical interpretability, surrogate decision trees and SHapley Additive exPlanations (SHAP) are used. The proposed model delivers a combination of robust predictive performance and clinical transparency by blending diverse learning architectures and incorporating explainability through SHAP and surrogate decision trees, making it a strong candidate for real-world deployment in public health screening.
zh
[AI-79] COFAP: A Universal Framework for COFs Adsorption Prediction through Designed Multi-Modal Extraction and Cross-Modal Synergy
【速读】:该论文旨在解决共价有机框架材料(Covalent Organic Frameworks, COFs)在气体吸附与分离应用中,因结构设计空间庞大而导致的高效高通量筛选难题。传统机器学习预测模型依赖于特定气体相关的特征(如亨利系数或吸附热),不仅耗时且难以扩展,限制了实际应用效率。其解决方案的关键在于提出一种通用的COFs吸附预测框架(COFAP),通过深度学习自动提取多模态的结构与化学特征,并利用交叉模态注意力机制融合这些互补信息,从而在无需依赖气体特异性参数的情况下实现高性能预测,达到当前最优水平(SOTA)。此外,该框架还揭示了高性能COFs在孔径和比表面积上的集中分布规律,并引入可调权重的优先级排序策略,支持面向不同应用场景的灵活筛选,显著提升了效率与准确性,具备直接部署于晶体多孔材料研究的潜力。
链接: https://arxiv.org/abs/2511.01946
作者: Zihan Li,Mingyang Wan,Mingyu Gao,Zhongshan Chen,Xiangke Wang,Feifan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注:
Abstract:Covalent organic frameworks (COFs) are promising adsorbents for gas adsorption and separation, while identifying the optimal structures among their vast design space requires efficient high-throughput screening. Conventional machine-learning predictors rely heavily on specific gas-related features. However, these features are time-consuming and limit scalability, leading to inefficiency and labor-intensive processes. Herein, a universal COFs adsorption prediction framework (COFAP) is proposed, which can extract multi-modal structural and chemical features through deep learning, and fuse these complementary features via cross-modal attention mechanism. Without Henry coefficients or adsorption heat, COFAP sets a new SOTA by outperforming previous approaches on hypoCOFs dataset. Based on COFAP, we also found that high-performing COFs for separation concentrate within a narrow range of pore size and surface area. A weight-adjustable prioritization scheme is also developed to enable flexible, application-specific ranking of candidate COFs for researchers. Superior efficiency and accuracy render COFAP directly deployable in crystalline porous materials.
zh
[AI-80] Detecting Vulnerabilities from Issue Reports for Internet-of-Things
【速读】:该论文旨在解决物联网(IoT)系统中及时识别反映软件漏洞的缺陷报告(issue reports)的问题,这在IoT场景下尤为关键,因分析速度慢于非IoT系统。现有基于机器学习(ML)和大语言模型(LLMs)的方法主要适用于非IoT系统,而IoT领域的应用尚未探索。其解决方案的关键在于:首先结合ML与LLMs及自然语言处理(NLP)技术,对21个Eclipse IoT项目的缺陷报告进行漏洞指示性分类;其次,通过在11,000条GitHub缺陷报告上微调预训练的BERT掩码语言模型(MLM)实现漏洞分类。实验表明,基于BERT特征训练的支持向量机(SVM)模型表现最佳(AUC=0.65),而直接微调BERT的效果较差(准确率仅0.26),凸显了训练数据完整性对模型性能的重要性。
链接: https://arxiv.org/abs/2511.01941
作者: Sogol Masoumzadeh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ACCEPTED/To Appear in the Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) 2025. this https URL
Abstract:Timely identification of issue reports reflecting software vulnerabilities is crucial, particularly for Internet-of-Things (IoT) where analysis is slower than non-IoT systems. While Machine Learning (ML) and Large Language Models (LLMs) detect vulnerability-indicating issues in non-IoT systems, their IoT use remains unexplored. We are the first to tackle this problem by proposing two approaches: (1) combining ML and LLMs with Natural Language Processing (NLP) techniques to detect vulnerability-indicating issues of 21 Eclipse IoT projects and (2) fine-tuning a pre-trained BERT Masked Language Model (MLM) on 11,000 GitHub issues for classifying \vul. Our best performance belongs to a Support Vector Machine (SVM) trained on BERT NLP features, achieving an Area Under the receiver operator characteristic Curve (AUC) of 0.65. The fine-tuned BERT achieves 0.26 accuracy, emphasizing the importance of exposing all data during training. Our contributions set the stage for accurately detecting IoT vulnerabilities from issue reports, similar to non-IoT systems.
zh
[AI-81] he Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold
【速读】:该论文旨在解决神经网络中“grokking”现象的机制问题,即模型在完全记忆训练数据后,需要经历显著延迟才能实现全面泛化这一反直觉现象。此前研究将此归因于权重衰减驱动的表示学习,但其内在动力学仍不清晰。论文的关键解决方案是将后记忆阶段的学习过程重新建模为约束优化问题:梯度下降在零损失流形(zero-loss manifold)上有效最小化权重范数。作者在极小学习率和权重衰减系数的极限下给出了严格证明,并通过引入参数子集动力学解耦近似,推导出两层网络第一层在后记忆阶段的闭式动态表达式。实验验证表明,基于该预测梯度模拟训练过程可重现grokking特有的延迟泛化与表示学习特征。
链接: https://arxiv.org/abs/2511.01938
作者: Tiberiu Musat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.
zh
[AI-82] Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中因过度偏向难样本而导致输出冗长、推理成本上升的问题。其核心问题是:标准RLVR流程通过过滤“简单”问题以提升训练效率,使得模型主要学习处理需要长推理链的难题,从而导致模型将“思考更久”误认为“思考更好”,进而产生不必要的冗余输出。解决方案的关键在于适度保留并轻微加权中等难度问题,这相当于引入了一种隐式的长度正则化机制——让模型持续接触可快速求解的短链任务,从而约束其输出分布,抑制冗余推理行为。实验表明,该方法在不显式惩罚输出长度的情况下,使模型在保持基准性能的同时显著缩短生成解题过程(平均减少近50%),实现了“免费涌现的简洁性”。
链接: https://arxiv.org/abs/2511.01937
作者: Abdelaziz Bounhar,Hadi Abdine,Evan Dufraisse,Ahmad Chamma,Amr Mohamed,Dani Bouch,Michalis Vazirgiannis,Guokan Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbfmodel that conflates thinking longer’’ with ``thinking better’'. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf\emphemergent brevity for free: the model learns to solve harder problems without inflating the output length, \textbf despite the absence of any explicit length penalization. RLVR experiments using this approach on \textitQwen3-4B-Thinking-2507 (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \hrefthis https URLGitHub, with datasets and models on \hrefthis https URLHugging Face.
zh
[AI-83] Q-Sat AI: Machine Learning-Based Decision Support for Data Saturation in Qualitative Studies
【速读】:该论文旨在解决定性研究中样本量确定依赖主观且模糊的数据饱和原则所导致的方法学不一致性和严谨性不足的问题。其解决方案的关键在于构建一个基于机器学习(Machine Learning, ML)的系统化模型,通过整合五种基础定性研究方法(案例研究、扎根理论、现象学、叙事研究和民族志研究)的数据,识别出包括研究范围、信息功率和研究者能力在内的十个关键参数作为输入特征,并利用K近邻(K-Nearest Neighbors, KNN)、梯度提升(Gradient Boosting, GB)、随机森林(Random Forest, RF)、XGBoost和决策树(Decision Tree, DT)等算法建立高解释力的预测模型(测试R²约0.85),从而实现对定性研究样本量的客观量化决策支持。
链接: https://arxiv.org/abs/2511.01935
作者: Hasan Tutar,Caner Erden,Ümit Şentürk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The determination of sample size in qualitative research has traditionally relied on the subjective and often ambiguous principle of data saturation, which can lead to inconsistencies and threaten methodological rigor. This study introduces a new, systematic model based on machine learning (ML) to make this process more objective. Utilizing a dataset derived from five fundamental qualitative research approaches - namely, Case Study, Grounded Theory, Phenomenology, Narrative Research, and Ethnographic Research - we developed an ensemble learning model. Ten critical parameters, including research scope, information power, and researcher competence, were evaluated using an ordinal scale and used as input features. After thorough preprocessing and outlier removal, multiple ML algorithms were trained and compared. The K-Nearest Neighbors (KNN), Gradient Boosting (GB), Random Forest (RF), XGBoost, and Decision Tree (DT) algorithms showed the highest explanatory power (Test R2 ~ 0.85), effectively modeling the complex, non-linear relationships involved in qualitative sampling decisions. Feature importance analysis confirmed the vital roles of research design type and information power, providing quantitative validation of key theoretical assumptions in qualitative methodology. The study concludes by proposing a conceptual framework for a web-based computational application designed to serve as a decision support system for qualitative researchers, journal reviewers, and thesis advisors. This model represents a significant step toward standardizing sample size justification, enhancing transparency, and strengthening the epistemological foundation of qualitative inquiry through evidence-based, systematic decision-making.
zh
[AI-84] ool Zero: Training Tool-Augmented LLM s via Pure RL from Scratch EMNLP2025
【速读】:该论文旨在解决当前训练工具增强的大语言模型(Tool-Augmented Large Language Models, TALLMs)在面对陌生或复杂工具使用场景时泛化能力不足的问题。现有监督微调(Supervised Fine-Tuning, SFT)方法依赖于大量领域特定数据集,难以适应未见过的工具交互情境;而虽然强化学习(Reinforcement Learning, RL)已被证明可提升模型推理与泛化能力,但如何有效激发模型内在推理潜力并实现无工具依赖的泛化仍缺乏系统性方法。其解决方案的关键在于提出一种动态泛化引导奖励设计(Dynamic Generalization-Guided Reward Design),该设计通过逐步从探索性奖励向利用性奖励过渡,引导模型从初始随机尝试中学习到更稳健、通用的工具使用模式。基于此机制,作者构建了无需任何后训练基础模型即可直接进行强化学习扩展的 Tool-Zero 系列模型,在跨数据集和同数据集评估中均显著优于 SFT 及 RL-with-SFT 方法,性能提升超过 7%,验证了该策略在提升模型工具无关泛化能力方面的有效性与鲁棒性。
链接: https://arxiv.org/abs/2511.01934
作者: Yirong Zeng,Xiao Ding,Yutai Hou,Yuxian Wang,Li Du,Juyi Dai,Qiuyang Ding,Duyu Tang,Dandan Tu,Weiwen Liu,Bing Qin,Ting Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 finding
Abstract:Training tool-augmented LLMs has emerged as a promising approach to enhancing language models’ capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model’s intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.
zh
[AI-85] Dynamic Population Distribution Aware Human Trajectory Generation with Diffusion Model
【速读】:该论文旨在解决现有轨迹生成方法在模拟人类移动行为时,通常仅关注个体运动模式而忽略人口分布动态变化对移动行为影响的问题。其关键解决方案是提出一种基于扩散模型(diffusion model)的新型轨迹生成框架,通过引入动态人口分布约束来引导高保真度的轨迹生成结果;具体而言,该框架构建空间图以增强轨迹的空间相关性,并设计了一个考虑人口分布感知的去噪网络,从而在去噪过程中同时捕捉人类移动行为的时空依赖性和人口密度变化的影响。
链接: https://arxiv.org/abs/2511.01929
作者: Qingyue Long,Can Rong,Tong Li,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Human trajectory data is crucial in urban planning, traffic engineering, and public health. However, directly using real-world trajectory data often faces challenges such as privacy concerns, data acquisition costs, and data quality. A practical solution to these challenges is trajectory generation, a method developed to simulate human mobility behaviors. Existing trajectory generation methods mainly focus on capturing individual movement patterns but often overlook the influence of population distribution on trajectory generation. In reality, dynamic population distribution reflects changes in population density across different regions, significantly impacting individual mobility behavior. Thus, we propose a novel trajectory generation framework based on a diffusion model, which integrates the dynamic population distribution constraints to guide high-fidelity generation outcomes. Specifically, we construct a spatial graph to enhance the spatial correlation of trajectories. Then, we design a dynamic population distribution aware denoising network to capture the spatiotemporal dependencies of human mobility behavior as well as the impact of population distribution in the denoising process. Extensive experiments show that the trajectories generated by our model can resemble real-world trajectories in terms of some critical statistical metrics, outperforming state-of-the-art algorithms by over 54%.
zh
[AI-86] A Unified Model for Human Mobility Generation in Natural Disasters
【速读】:该论文旨在解决自然灾害场景下人类移动模式生成的泛化能力不足问题,即现有模型受限于单一城市或特定灾害的数据,难以适应新城市或新型灾害场景。解决方案的关键在于提出一个统一的模型 UniDisMob,通过两个核心机制实现跨灾害和跨城市的通用性:一是设计物理信息提示(physics-informed prompt)与物理引导对齐(physics-guided alignment),利用不同灾害后移动模式的共性规律指导生成过程;二是引入元学习框架(meta-learning framework),通过共享参数提取多城市间的通用特征,并借助私有参数捕捉各城市的特异性,从而在多种城市和灾害场景中显著优于现有最优基线方法(平均性能提升超过13%)。
链接: https://arxiv.org/abs/2511.01928
作者: Qingyue Long,Huandong Wang,Qi Ryan Wang,Yong Li
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Human mobility generation in disaster scenarios plays a vital role in resource allocation, emergency response, and rescue coordination. During disasters such as wildfires and hurricanes, human mobility patterns often deviate from their normal states, which makes the task more challenging. However, existing works usually rely on limited data from a single city or specific disaster, significantly restricting the model’s generalization capability in new scenarios. In fact, disasters are highly sudden and unpredictable, and any city may encounter new types of disasters without prior experience. Therefore, we aim to develop a one-for-all model for mobility generation that can generalize to new disaster scenarios. However, building a universal framework faces two key challenges: 1) the diversity of disaster types and 2) the heterogeneity among different cities. In this work, we propose a unified model for human mobility generation in natural disasters (named UniDisMob). To enable cross-disaster generalization, we design physics-informed prompt and physics-guided alignment that leverage the underlying common patterns in mobility changes after different disasters to guide the generation process. To achieve cross-city generalization, we introduce a meta-learning framework that extracts universal patterns across multiple cities through shared parameters and captures city-specific features via private parameters. Extensive experiments across multiple cities and disaster scenarios demonstrate that our method significantly outperforms state-of-the-art baselines, achieving an average performance improvement exceeding 13%.
zh
[AI-87] DeepContour: A Hybrid Deep Learning Framework for Accelerating Generalized Eigenvalue Problem Solving via Efficient Contour Design
【速读】:该论文旨在解决大规模广义特征值问题(Generalized Eigenvalue Problems, GEPs)求解过程中计算成本高、效率低的问题,尤其针对传统 contour integral (CI) 方法因积分路径选择不当而导致的计算开销大和数值精度下降难题。其解决方案的关键在于提出一种名为 DeepContour 的混合框架:首先利用傅里叶神经算子(Fourier Neural Operator, FNO)快速预测给定 GEP 的谱分布;随后通过核密度估计(Kernel Density Estimation, KDE)对预测谱进行建模,自动且系统地生成最优积分路径;最终由这些优化后的积分路径引导 CI 求解器高效定位目标特征值。该方法结合了深度学习的预测能力与经典数值求解器的严谨性,显著提升了高维矩阵下 GEP 求解的效率与鲁棒性。
链接: https://arxiv.org/abs/2511.01927
作者: Yeqiu Chen,Ziyan Liu,Hong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Solving large-scale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering. As a promising direction, contour integral (CI) methods, such as the CIRR algorithm, offer an efficient and parallelizable framework. However, their performance is critically dependent on the selection of integration contours – improper selection without reliable prior knowledge of eigenvalue distribution can incur significant computational overhead and compromise numerical accuracy. To address this challenge, we propose DeepContour, a novel hybrid framework that integrates a deep learning-based spectral predictor with Kernel Density Estimation for principled contour design. Specifically, DeepContour first employs a Fourier Neural Operator (FNO) to rapidly predict the spectral distribution of a given GEP. Subsequently, Kernel Density Estimation (KDE) is applied to the predicted spectrum to automatically and systematically determine proper integration contours. Finally, these optimized contours guide the CI solver to efficiently find the desired eigenvalues. We demonstrate the effectiveness of our method on diverse challenging scientific problems. In our main experiments, DeepContour accelerates GEP solving across multiple datasets, achieving up to a 5.63 \times speedup. By combining the predictive power of deep learning with the numerical rigor of classical solvers, this work pioneers an efficient and robust paradigm for tackling difficult generalized eigenvalue involving matrices of high dimension.
zh
[AI-88] Neural Greens Functions NEURIPS2025
【速读】:该论文旨在解决现有基于学习的偏微分方程(PDE)求解器在面对未见过的源函数或边界条件时泛化能力差的问题。传统神经算子方法通常依赖于特定训练数据分布,难以适应新输入函数,导致性能下降。其解决方案的关键在于提出神经格林函数(Neural Green’s Function),该方法利用线性PDE的微分算子可进行特征分解的特性,从表示问题域的体素点云中提取每个点的特征,并预测解算子的分解形式,进而通过数值积分高效计算解。此设计使模型对训练期间使用的具体源函数和边界函数具有不变性,从而实现跨复杂不规则几何形状和不同激励函数的鲁棒泛化,同时显著提升计算效率。
链接: https://arxiv.org/abs/2511.01924
作者: Seungwoo Yoo,Kyeongmin Yeo,Jisung Hwang,Minhyuk Sung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:We introduce Neural Green’s Function, a neural solution operator for linear partial differential equations (PDEs) whose differential operators admit eigendecompositions. Inspired by Green’s functions, the solution operators of linear PDEs that depend exclusively on the domain geometry, we design Neural Green’s Function to imitate their behavior, achieving superior generalization across diverse irregular geometries and source and boundary functions. Specifically, Neural Green’s Function extracts per-point features from a volumetric point cloud representing the problem domain and uses them to predict a decomposition of the solution operator, which is subsequently applied to evaluate solutions via numerical integration. Unlike recent learning-based solution operators, which often struggle to generalize to unseen source or boundary functions, our framework is, by design, agnostic to the specific functions used during training, enabling robust and efficient generalization. In the steady-state thermal analysis of mechanical part geometries from the MCB dataset, Neural Green’s Function outperforms state-of-the-art neural operators, achieving an average error reduction of 13.9% across five shape categories, while being up to 350 times faster than a numerical solver that requires computationally expensive meshing.
zh
[AI-89] Fibbinary-Based Compression and Quantization for Efficient Neural Radio Receivers
【速读】:该论文旨在解决神经网络接收机(Neural Receiver)在硬件资源受限设备上部署时因高网络复杂度导致的计算开销过大的问题。解决方案的关键在于提出两种协同优化策略:一是引入均匀与非均匀量化方法(包括斐波那契码字量化,FCQ),以降低乘法器功耗和面积;二是设计一种细粒度增量式网络量化(Incremental Network Quantization, INQ)方法来补偿量化带来的性能损失;同时,提出两种新颖的无损压缩算法,针对斐波那契量化参数序列中大量冗余信息进行高效压缩,从而显著减少内存占用。最终,量化结合压缩可实现63.4%的内存占用缩减,且仍保持优于传统接收机的性能表现。
链接: https://arxiv.org/abs/2511.01921
作者: Roberta Fiandaca,Manil Dev Gomony
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural receivers have shown outstanding performance compared to the conventional ones but this comes with a high network complexity leading to a heavy computational cost. This poses significant challenges in their deployment on hardware-constrained devices. To address the issue, this paper explores two optimization strategies: quantization and compression. We introduce both uniform and non-uniform quantization such as the Fibonacci Code word Quantization (FCQ). A novel fine-grained approach to the Incremental Network Quantization (INQ) strategy is then proposed to compensate for the losses introduced by the above mentioned quantization techniques. Additionally, we introduce two novel lossless compression algorithms that effectively reduce the memory size by compressing sequences of Fibonacci quantized parameters characterized by a huge redundancy. The quantization technique provides a saving of 45% and 44% in the multiplier’s power and area, respectively, and its combination with the compression determines a 63.4% reduction in memory footprint, while still providing higher performances than a conventional receiver.
zh
[AI-90] EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory
【速读】:该论文旨在解决多智能体框架中人类类记忆机制缺失的问题,尤其是在自然语言规划任务中,如何通过记忆支持迭代推理、约束追踪与错误修正以提升规划效果。解决方案的关键在于提出EvoMem框架,其核心是双演化记忆机制:一是任务特定规则和约束在查询间保持稳定、跨查询演化的约束记忆(Constraint Memory, CMem);二是于单个查询内随迭代累积反馈、用于方案优化的查询反馈记忆(Query-feedback Memory, QMem)。两个记忆模块在每次查询结束后重置,从而实现了结构化记忆对多智能体协同规划的增强作用。
链接: https://arxiv.org/abs/2511.01912
作者: Wenzhe Fan,Ning Yan,Masood Mortazavi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Planning has been a cornerstone of artificial intelligence for solving complex problems, and recent progress in LLM-based multi-agent frameworks have begun to extend this capability. However, the role of human-like memory within these frameworks remains largely unexplored. Understanding how agents coordinate through memory is critical for natural language planning, where iterative reasoning, constraint tracking, and error correction drive the success. Inspired by working memory model in cognitive psychology, we present EvoMem, a multi-agent framework built on a dual-evolving memory mechanism. The framework consists of three agents (Constraint Extractor, Verifier, and Actor) and two memory modules: Constraint Memory (CMem), which evolves across queries by storing task-specific rules and constraints while remains fixed within a query, and Query-feedback Memory (QMem), which evolves within a query by accumulating feedback across iterations for solution refinement. Both memory modules are reset at the end of each query session. Evaluations on trip planning, meeting planning, and calendar scheduling show consistent performance improvements, highlighting the effectiveness of EvoMem. This success underscores the importance of memory in enhancing multi-agent planning.
zh
[AI-91] Variational Geometry-aware Neural Network based Method for Solving High-dimensional Diffeomorphic Mapping Problems
【速读】:该论文旨在解决高维微分同胚映射(diffeomorphic mapping)在传统方法中面临的维度灾难(curse of dimensionality)问题。其解决方案的关键在于提出一种无网格学习框架,该框架通过将变分原理与拟共形理论(quasi-conformal theory)有机结合,以调控共形畸变(conformality distortion)和体积畸变(volume distortion),从而确保映射的准确性与双射性(bijective)。该方法天然兼容基于梯度的优化和神经网络架构,具备良好的灵活性与可扩展性,能够有效应对复杂医学图像配准等高维场景下的变形质量控制需求。
链接: https://arxiv.org/abs/2511.01911
作者: Zhiwen Li,Cheuk Hin Ho,Lok Ming Lui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Numerical Analysis (math.NA)
备注:
Abstract:Traditional methods for high-dimensional diffeomorphic mapping often struggle with the curse of dimensionality. We propose a mesh-free learning framework designed for n -dimensional mapping problems, seamlessly combining variational principles with quasi-conformal theory. Our approach ensures accurate, bijective mappings by regulating conformality distortion and volume distortion, enabling robust control over deformation quality. The framework is inherently compatible with gradient-based optimization and neural network architectures, making it highly flexible and scalable to higher-dimensional settings. Numerical experiments on both synthetic and real-world medical image data validate the accuracy, robustness, and effectiveness of the proposed method in complex registration scenarios.
zh
[AI-92] Between Myths and Metaphors: Rethinking LLM s for SRH in Conservative Contexts
【速读】:该论文旨在解决低资源国家(尤其是巴基斯坦)在性与生殖健康(SRH)领域中因沟通障碍导致的可预防孕产妇死亡问题。其关键解决方案在于利用大语言模型(LLMs)增强健康传播与风险评估能力,但针对保守文化背景下间接表达对语义理解造成的挑战,研究通过两阶段实证分析识别出SRH沟通的两个核心维度(指称域和表达方式),并发现LLMs在处理临床互动中的语义漂移、误解和多义性时存在显著局限。研究进一步提出了一套用于间接沟通的分类框架及面向文化情境的设计建议,从而为开发更有效的本土化AI健康干预工具提供依据。
链接: https://arxiv.org/abs/2511.01907
作者: Ameemah Humayun,Bushra Zubair,Maryam Mustafa
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Low-resource countries represent over 90% of maternal deaths, with Pakistan among the top four countries contributing nearly half in 2023. Since these deaths are mostly preventable, large language models (LLMs) can help address this crisis by automating health communication and risk assessment. However, sexual and reproductive health (SRH) communication in conservative contexts often relies on indirect language that obscures meaning, complicating LLM-based interventions. We conduct a two-stage study in Pakistan: (1) analyzing data from clinical observations, interviews, and focus groups with clinicians and patients, and (2) evaluating the interpretive capabilities of five popular LLMs on this data. Our analysis identifies two axes of communication (referential domain and expression approach) and shows LLMs struggle with semantic drift, myths, and polysemy in clinical interactions. We contribute: (1) empirical themes in SRH communication, (2) a categorization framework for indirect communication, (3) evaluation of LLM performance, and (4) design recommendations for culturally-situated SRH communication.
zh
[AI-93] hinking Like a Student: AI-Supported Reflective Planning in a Theory-Intensive Computer Science Course
【速读】:该论文试图解决高校在新冠疫情期间为支持学生学习高难度课程而设立的“强化”辅助教学角色普遍存在的问题,即这些角色往往缺乏明确的职责界定、结构化教学材料、教学法指导以及与主讲教师团队的有效整合。解决方案的关键在于利用大语言模型(Large Language Model, LLM)作为反思性规划工具,通过模拟第二年本科生的视角,提前识别概念难点、直觉盲区和推理断裂点,从而设计出结构清晰、可重复的强化课 session 格式,包括针对性复习、协作例题、独立练习和引导式讲解四个环节。该方法显著提升了学生对抽象内容(如抽吸引理和形式语言表达能力比较)的理解与信心,表明LLM在理论密集型课程中具有增强教学设计的潜力。
链接: https://arxiv.org/abs/2511.01906
作者: Noa Izsak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 7 pages, 4 figures
Abstract:In the aftermath of COVID-19, many universities implemented supplementary “reinforcement” roles to support students in demanding courses. Although the name for such roles may differ between institutions, the underlying idea of providing structured supplementary support is common. However, these roles were often poorly defined, lacking structured materials, pedagogical oversight, and integration with the core teaching team. This paper reports on the redesign of reinforcement sessions in a challenging undergraduate course on formal methods and computational models, using a large language model (LLM) as a reflective planning tool. The LLM was prompted to simulate the perspective of a second-year student, enabling the identification of conceptual bottlenecks, gaps in intuition, and likely reasoning breakdowns before classroom delivery. These insights informed a structured, repeatable session format combining targeted review, collaborative examples, independent student work, and guided walkthroughs. Conducted over a single semester, the intervention received positive student feedback, indicating increased confidence, reduced anxiety, and improved clarity, particularly in abstract topics such as the pumping lemma and formal language expressive power comparisons. The findings suggest that reflective, instructor-facing use of LLMs can enhance pedagogical design in theoretically dense domains and may be adaptable to other cognitively demanding computer science courses.
zh
[AI-94] Before the Clinic: Transparent and Operable Design Principles for Healthcare AI
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在临床实践中落地时面临的挑战,即如何弥合可解释人工智能(Explainable AI, XAI)理论、临床医生预期与治理要求之间的根本性差距。解决方案的关键在于提出两个基础性设计原则:透明设计(Transparent Design) 和 可操作设计(Operable Design)。前者通过可解释性和可理解性机制实现案例级推理与系统可追溯性,后者通过校准、不确定性量化和鲁棒性保障系统在真实世界条件下的可靠与可预测行为。这两个原则基于成熟的XAI框架,映射临床需求并契合新兴治理规范,为开发团队提供可执行的预临床技术指导,从而加速AI系统的临床评估进程,并建立跨AI研究者、医疗从业者与监管方的共同语言。
链接: https://arxiv.org/abs/2511.01902
作者: Alexander Bakumenko(1),Aaron J. Masino(1),Janine Hoelscher(1) ((1) Clemson University, USA)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The translation of artificial intelligence (AI) systems into clinical practice requires bridging fundamental gaps between explainable AI theory, clinician expectations, and governance requirements. While conceptual frameworks define what constitutes explainable AI (XAI) and qualitative studies identify clinician needs, little practical guidance exists for development teams to prepare AI systems prior to clinical evaluation. We propose two foundational design principles, Transparent Design and Operable Design, that operationalize pre-clinical technical requirements for healthcare AI. Transparent Design encompasses interpretability and understandability artifacts that enable case-level reasoning and system traceability. Operable Design encompasses calibration, uncertainty, and robustness to ensure reliable, predictable system behavior under real-world conditions. We ground these principles in established XAI frameworks, map them to documented clinician needs, and demonstrate their alignment with emerging governance requirements. This pre-clinical playbook provides actionable guidance for development teams, accelerates the path to clinical evaluation, and establishes a shared vocabulary bridging AI researchers, healthcare practitioners, and regulatory stakeholders. By explicitly scoping what can be built and verified before clinical deployment, we aim to reduce friction in clinical AI translation while remaining cautious about what constitutes validated, deployed explainability.
zh
[AI-95] LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency
【速读】:该论文旨在解决基于流匹配(flow matching)的多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像编辑任务中面临的关键问题:细节退化(detail degradation)、内容不一致(content inconsistency)以及推理效率低下,这些问题主要源于现有方法如BAGEL对随机噪声初始化的依赖。解决方案的核心是提出LGCC框架,其关键创新在于两个组件:局部高斯噪声耦合(Local Gaussian Noise Coupling, LGNC)与内容一致性损失(Content Consistency Loss, CCL)。LGNC通过将目标图像嵌入与其局部扰动版本建模为耦合对来保留空间细节,而CCL则确保编辑指令与图像修改之间的语义一致性,防止意外内容丢失。该框架通过课程学习(curriculum learning)集成至预训练的BAGEL模型,显著减少推理步数,在I2EBench上本地细节得分提升1.60%,整体得分提升0.53%,同时实现3–5倍轻量级编辑加速和2倍通用编辑加速,仅需BAGEL或Flux推理时间的40%–50%,实现了高质量、高效率的图像编辑。
链接: https://arxiv.org/abs/2511.01894
作者: Fangbing Liu,Pengfei Duan,Wen Li,Yi He
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x – 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% – 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC’s ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.
zh
[AI-96] Mirror-Neuron Patterns in AI Alignment
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)系统在迈向超人类能力过程中,如何实现与人类价值观的可靠对齐(alignment)问题。现有对齐策略主要依赖外部约束,可能难以应对未来具备高度自主性的超级智能AI系统对其控制机制的规避行为。其解决方案的关键在于探索人工神经网络(Artificial Neural Networks, ANNs)是否能够自发形成类似生物镜像神经元(mirror neurons)的内部表征模式——这类神经元在人类中参与共情、模仿和社会认知过程。研究通过设计“青蛙与蟾蜍”博弈框架诱导合作行为,发现适当规模的模型容量和自我-他者耦合机制可促使ANN发展出共享神经表示,类似于生物镜像神经元的激活特性;进而提出Checkpoint Mirror Neuron Index (CMNI)用于量化此类模式的强度与一致性,并构建理论框架表明,此类内生性共情回路可增强AI系统的伦理决策与协作能力,从而为传统对齐方法提供一种基于内在动机建模的补充路径。
链接: https://arxiv.org/abs/2511.01885
作者: Robyn Wyrick
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 51 pages, Masters thesis. 10 tables, 7 figures, project data code here: this https URL
Abstract:As artificial intelligence (AI) advances toward superhuman capabilities, aligning these systems with human values becomes increasingly critical. Current alignment strategies rely largely on externally specified constraints that may prove insufficient against future super-intelligent AI capable of circumventing top-down controls. This research investigates whether artificial neural networks (ANNs) can develop patterns analogous to biological mirror neurons cells that activate both when performing and observing actions, and how such patterns might contribute to intrinsic alignment in AI. Mirror neurons play a crucial role in empathy, imitation, and social cognition in humans. The study therefore asks: (1) Can simple ANNs develop mirror-neuron patterns? and (2) How might these patterns contribute to ethical and cooperative decision-making in AI systems? Using a novel Frog and Toad game framework designed to promote cooperative behaviors, we identify conditions under which mirror-neuron patterns emerge, evaluate their influence on action circuits, introduce the Checkpoint Mirror Neuron Index (CMNI) to quantify activation strength and consistency, and propose a theoretical framework for further study. Our findings indicate that appropriately scaled model capacities and self/other coupling foster shared neural representations in ANNs similar to biological mirror neurons. These empathy-like circuits support cooperative behavior and suggest that intrinsic motivations modeled through mirror-neuron dynamics could complement existing alignment techniques by embedding empathy-like mechanisms directly within AI architectures. Comments: 51 pages, Masters thesis. 10 tables, 7 figures, project data code here: this https URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2511.01885 [cs.AI] (or arXiv:2511.01885v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.01885 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Robyn Wyrick [view email] [v1] Thu, 23 Oct 2025 23:08:29 UTC (2,810 KB)
zh
[AI-97] EdgeReasoning : Characterizing Reasoning LLM Deployment on Edge GPUs ISWC2025
【速读】:该论文旨在解决在边缘GPU上部署生成式AI(Generative AI)中的大型语言模型(Large Language Models, LLMs)进行推理任务时,如何在严格延迟约束和有限计算资源下实现准确率与延迟之间的最优权衡问题。其解决方案的关键在于系统性地分析多种设计变量的组合影响:包括推理型与非推理型架构选择、模型规模、令牌预算分配以及测试时扩展策略,并通过量化不同LLM架构和规模下的延迟-准确率权衡、评估基于提示和模型微调的减少推理令牌长度技术,以及对不同并行度的测试时扩展方法进行性能剖析,最终构建出可实现的准确率-延迟帕累托前沿(Pareto frontier),为边缘推理LLM的优化部署提供系统性指导。
链接: https://arxiv.org/abs/2511.01866
作者: Benjamin Kubwimana,Qijing Huang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Published in the Proceedings of the 2025 IEEE International Symposium on Workload Characterization (IISWC 2025)
Abstract:Edge intelligence paradigm is increasingly demanded by the emerging autonomous systems, such as robotics. Beyond ensuring privacy-preserving operation and resilience in connectivity-limited environments, edge deployment offers significant energy and cost advantages over cloud-based solutions. However, deploying large language models (LLMs) for reasoning tasks on edge GPUs faces critical challenges from strict latency constraints and limited computational resources. To navigate these constraints, developers must balance multiple design factors - choosing reasoning versus non-reasoning architectures, selecting appropriate model sizes, allocating token budgets, and applying test-time scaling strategies - to meet target latency and optimize accuracy. Yet guidance on optimal combinations of these variables remains scarce. In this work, we present EdgeReasoning, a comprehensive study characterizing the deployment of reasoning LLMs on edge GPUs. We systematically quantify latency-accuracy tradeoffs across various LLM architectures and model sizes. We systematically evaluate prompt-based and model-tuning-based techniques for reducing reasoning token length while maintaining performance quality. We further profile test-time scaling methods with varying degrees of parallelism to maximize accuracy under strict latency budgets. Through these analyses, EdgeReasoning maps the Pareto frontier of achievable accuracy-latency configurations, offering systematic guidance for optimal edge deployment of reasoning LLMs.
zh
[AI-98] Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对对抗性诈骗消息时的脆弱性问题,即LLMs是否能准确预测诈骗行为。其核心挑战在于现有模型在遭遇精心设计的对抗样本时容易产生高误分类率。解决方案的关键在于构建一个细粒度标注的诈骗消息数据集,涵盖原始与对抗性诈骗消息,并将传统二分类的诈骗检测任务扩展为更细致的诈骗类型识别;通过该数据集系统评估LLMs性能并提出增强鲁棒性的策略,从而揭示模型漏洞并提升其抗干扰能力。
链接: https://arxiv.org/abs/2412.00621
作者: Chen-Wei Chang,Shailik Sarkar,Shutonu Mitra,Qi Zhang,Hossein Salemi,Hemant Purohit,Fengxiu Zhang,Michin Hong,Jin-Hee Cho,Chang-Tien Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 4 pages, 2024 IEEE International Conference on Big Data workshop BigEACPS 2024
Abstract:Can we trust Large Language Models (LLMs) to accurately predict scam? This paper investigates the vulnerabilities of LLMs when facing adversarial scam messages for the task of scam detection. We addressed this issue by creating a comprehensive dataset with fine-grained labels of scam messages, including both original and adversarial scam messages. The dataset extended traditional binary classes for the scam detection task into more nuanced scam types. Our analysis showed how adversarial examples took advantage of vulnerabilities of a LLM, leading to high misclassification rate. We evaluated the performance of LLMs on these adversarial scam messages and proposed strategies to improve their robustness.
zh
[AI-99] rustworthy Quantum Machine Learning: A Roadmap for Reliability Robustness and Security in the NISQ Era
【速读】:该论文旨在解决当前量子机器学习(Quantum Machine Learning, QML)在实际应用中因量子力学固有概率特性、NISQ硬件噪声以及混合量子-经典执行管道所带来的可靠性风险,从而限制其在安全关键场景中可靠部署的问题。解决方案的关键在于提出一个可信量子机器学习(Trustworthy Quantum Machine Learning, TQML)的系统性框架,包含三大核心支柱:(i) 基于方差分解的预测不确定性量化以实现校准和风险感知决策;(ii) 针对经典与量子原生威胁模型的对抗鲁棒性保障,采用迹距离约束进行形式化定义;(iii) 在分布式与委托量子学习场景下通过差分隐私机制实现隐私保护。该框架首次将量子信息理论中的度量(如方差分解、迹距离、差分隐私)用于构建量子特定的信任指标,并通过在NISQ设备上验证参数化量子分类器的统一信任评估流程,揭示了不确定性与预测风险的相关性、经典与量子状态扰动攻击脆弱性的不对称性,以及由采样噪声和量子信道噪声驱动的隐私-效用权衡关系,从而将“可信性”确立为量子人工智能设计的核心目标。
链接: https://arxiv.org/abs/2511.02602
作者: Ferhat Ozgur Catak,Jungwon Seo,Umit Cali
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 22 Pages
Abstract:Quantum machine learning (QML) is a promising paradigm for tackling computational problems that challenge classical AI. Yet, the inherent probabilistic behavior of quantum mechanics, device noise in NISQ hardware, and hybrid quantum-classical execution pipelines introduce new risks that prevent reliable deployment of QML in real-world, safety-critical settings. This research offers a broad roadmap for Trustworthy Quantum Machine Learning (TQML), integrating three foundational pillars of reliability: (i) uncertainty quantification for calibrated and risk-aware decision making, (ii) adversarial robustness against classical and quantum-native threat models, and (iii) privacy preservation in distributed and delegated quantum learning scenarios. We formalize quantum-specific trust metrics grounded in quantum information theory, including a variance-based decomposition of predictive uncertainty, trace-distance-bounded robustness, and differential privacy for hybrid learning channels. To demonstrate feasibility on current NISQ devices, we validate a unified trust assessment pipeline on parameterized quantum classifiers, uncovering correlations between uncertainty and prediction risk, an asymmetry in attack vulnerability between classical and quantum state perturbations, and privacy-utility trade-offs driven by shot noise and quantum channel noise. This roadmap seeks to define trustworthiness as a first-class design objective for quantum AI.
zh
[AI-100] Modeling Hawkish-Dovish Latent Beliefs in Multi-Agent Debate-Based LLM s for Monetary Policy Decision Classification
【速读】:该论文旨在解决中央银行货币政策决策(特别是联邦公开市场委员会,FOMC)预测的准确性问题,现有方法多依赖静态分类模型,未能充分捕捉政策制定过程中的动态协商与共识形成机制。解决方案的关键在于构建一个模拟FOMC集体决策流程的新型框架:通过将多个大语言模型(LLMs)建模为具有不同初始信念(如鹰派或鸽派)的交互代理(agent),使其基于定性政策文本和定量宏观经济指标生成预测,并在多轮迭代中通过观察其他代理输出来修正自身判断,从而模拟政策讨论与共识演化过程;同时引入潜变量表征代理信念,理论证明其调节输入信息感知与交互动力学,显著提升了预测准确性和可解释性。
链接: https://arxiv.org/abs/2511.02469
作者: Kaito Takano,Masanori Hirano,Kei Nakagawa
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: PRIMA2025 Accepted
Abstract:Accurately forecasting central bank policy decisions, particularly those of the Federal Open Market Committee(FOMC) has become increasingly important amid heightened economic uncertainty. While prior studies have used monetary policy texts to predict rate changes, most rely on static classification models that overlook the deliberative nature of policymaking. This study proposes a novel framework that structurally imitates the FOMC’s collective decision-making process by modeling multiple large language models(LLMs) as interacting agents. Each agent begins with a distinct initial belief and produces a prediction based on both qualitative policy texts and quantitative macroeconomic indicators. Through iterative rounds, agents revise their predictions by observing the outputs of others, simulating deliberation and consensus formation. To enhance interpretability, we introduce a latent variable representing each agent’s underlying belief(e.g., hawkish or dovish), and we theoretically demonstrate how this belief mediates the perception of input information and interaction dynamics. Empirical results show that this debate-based approach significantly outperforms standard LLMs-based baselines in prediction accuracy. Furthermore, the explicit modeling of beliefs provides insights into how individual perspectives and social influence shape collective policy forecasts.
zh
[AI-101] Biological Regulatory Network Inference through Circular Causal Structure Learning
【速读】:该论文旨在解决生物网络推断中因广泛存在的反馈环(feedback loops)而使得传统基于有向无环图(DAG)假设的因果结构推断方法失效的问题。其解决方案的关键在于提出一种名为SCALD(Structural CAusal model for Loop Diagram)的新框架,该框架通过非线性结构方程模型与基于连续优化的稳定反馈环条件约束相结合,能够有效识别包含反馈调控的因果关系,从而在转录调控网络和信号转导网络中实现更准确的因果推断,并显著提升对反馈调节机制的检测能力。
链接: https://arxiv.org/abs/2511.02332
作者: Hongyang Jiang,Yuezhu Wang,Ke Feng,Chaoyi Yin,Yi Chang,Huiyan Sun
机构: 未知
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI)
备注:
Abstract:Biological networks are pivotal in deciphering the complexity and functionality of biological systems. Causal inference, which focuses on determining the directionality and strength of interactions between variables rather than merely relying on correlations, is considered a logical approach for inferring biological networks. Existing methods for causal structure inference typically assume that causal relationships between variables can be represented by directed acyclic graphs (DAGs). However, this assumption is at odds with the reality of widespread feedback loops in biological systems, making these methods unsuitable for direct use in biological network inference. In this study, we propose a new framework named SCALD (Structural CAusal model for Loop Diagram), which employs a nonlinear structure equation model and a stable feedback loop conditional constraint through continuous optimization to infer causal regulatory relationships under feedback loops. We observe that SCALD outperforms state-of-the-art methods in inferring both transcriptional regulatory networks and signaling transduction networks. SCALD has irreplaceable advantages in identifying feedback regulation. Through transcription factor (TF) perturbation data analysis, we further validate the accuracy and sensitivity of SCALD. Additionally, SCALD facilitates the discovery of previously unknown regulatory relationships, which we have subsequently confirmed through ChIP-seq data analysis. Furthermore, by utilizing SCALD, we infer the key driver genes that facilitate the transformation from colon inflammation to cancer by examining the dynamic changes within regulatory networks during the process.
zh
[AI-102] From data to design: Random forest regression model for predicting mechanical properties of alloy steel
【速读】:该论文试图解决合金钢机械性能(延伸率、抗拉强度和屈服强度)的精准预测问题,其核心挑战在于从材料成分特征(如铁、铬、镍、锰、硅、铜、碳及冷轧变形量)中提取有效信息以建立高精度预测模型。解决方案的关键在于采用随机森林回归(Random Forest Regression)这一集成学习方法,通过多棵决策树的协同预测显著提升模型的稳定性和准确性,实验结果表明该方法在R²评分和均方误差(MSE)等指标上表现优异,且残差分析和学习曲线进一步验证了模型的有效性,为材料科学领域的工业应用提供了可靠的预测工具。
链接: https://arxiv.org/abs/2511.02290
作者: Samjukta Sinha,Prabhat Das
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:This study investigates the application of Random Forest Regression for predicting mechanical properties of alloy steel-Elongation, Tensile Strength, and Yield Strength-from material composition features including Iron (Fe), Chromium (Cr), Nickel (Ni), Manganese (Mn), Silicon (Si), Copper (Cu), Carbon ©, and deformation percentage during cold rolling. Utilizing a dataset comprising these features, we trained and evaluated the Random Forest model, achieving high predictive performance as evidenced by R2 scores and Mean Squared Errors (MSE). The results demonstrate the model’s efficacy in providing accurate predictions, which is validated through various performance metrics including residual plots and learning curves. The findings underscore the potential of ensemble learning techniques in enhancing material property predictions, with implications for industrial applications in material science.
zh
[AI-103] LA-MARRVEL: A Knowledge-Grounded and Language-Aware LLM Reranker for AI-MARRVEL in Rare Disease Diagnosis
【速读】:该论文旨在解决罕见病诊断中基因变异与临床表型证据之间关联困难的问题,现有流程仍需临床医生手动整合非结构化文本信息。解决方案的关键在于提出LA-MARRVEL——一个基于知识的、语言感知的重排序层,它在AI-MARRVEL基础上引入专家设计的上下文信息,通过多次调用大语言模型(Large Language Model, LLM)获取部分排序结果,并采用排序投票机制聚合这些结果,从而生成稳定且可解释的基因排序。同时,每个基因均附带LLM生成的推理说明,整合表型、遗传模式和变异层面的证据,显著提升输出的可解释性与临床可用性。
链接: https://arxiv.org/abs/2511.02263
作者: Jaeyeon Lee,Hyun-Hwan Jeong,Zhandong Liu
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Diagnosing rare diseases often requires connecting variant-bearing genes to evidence that is written as unstructured clinical prose, which the current established pipelines still leave for clinicians to reconcile manually. To this end, we introduce LA-MARRVEL, a knowledge-grounded and language-aware reranking layer that operates on top of AI-MARRVEL: it supplies expert-engineered context, queries a large language model multiple times, and aggregates the resulting partial rankings with a ranked voting method to produce a stable, explainable gene ranking. Evaluated on three real-world cohorts (BG, DDD, UDN), LA-MARRVEL consistently improves Recall@K over AI-MARRVEL and established phenotype-driven tools such as Exomiser and LIRICAL, with especially large gains on cases where the first-stage ranker placed the causal gene lower. Each ranked gene is accompanied by LLM-generated reasoning that integrates phenotypic, inheritance, and variant-level evidence, thereby making the output more interpretable and facilitating clinical review.
zh
[AI-104] CytoNet: A Foundation Model for the Human Cerebral Cortex
【速读】:该论文旨在解决人类大脑皮层组织结构及其细胞架构的系统性解析问题,以推动对脑功能机制的理解。其解决方案的关键在于提出CytoNet——一种基于自监督学习(self-supervised learning)的基础模型,通过利用空间邻近性作为训练信号,无需人工标注即可将高分辨率皮层显微图像块编码为具有高度表达能力的特征表示。这些特征不仅在解剖学上合理且生物学相关,还能捕捉皮层的一般架构特征与个体特异性属性,从而在皮层区域分类、层段分割、细胞形态估计及无监督脑区映射等任务中实现顶尖性能,为神经科学研究提供统一且可扩展的分析框架。
链接: https://arxiv.org/abs/2511.01870
作者: Christian Schiffer,Zeynep Boztoprak,Jan-Oliver Kropp,Julia Thönnißen,Katia Berr,Hannah Spitzer,Katrin Amunts,Timo Dickscheid
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review for journal publication
Abstract:To study how the human brain works, we need to explore the organization of the cerebral cortex and its detailed cellular architecture. We introduce CytoNet, a foundation model that encodes high-resolution microscopic image patches of the cerebral cortex into highly expressive feature representations, enabling comprehensive brain analyses. CytoNet employs self-supervised learning using spatial proximity as a powerful training signal, without requiring manual labelling. The resulting features are anatomically sound and biologically relevant. They encode general aspects of cortical architecture and unique brain-specific traits. We demonstrate top-tier performance in tasks such as cortical area classification, cortical layer segmentation, cell morphology estimation, and unsupervised brain region mapping. As a foundation model, CytoNet offers a consistent framework for studying cortical microarchitecture, supporting analyses of its relationship with other structural and functional brain features, and paving the way for diverse neuroscientific investigations.
zh
[AI-105] DiffPace: Diffusion-based Plug-and-play Augmented Channel Estimation in mmWave and Terahertz Ultra-Massive MIMO Systems
【速读】:该论文旨在解决毫米波(mmWave)和太赫兹(THz)频段超大规模多输入多输出(UM-MIMO)系统中由于高维信道维度和有限射频链路导致的信道估计(CE)精度下降问题,尤其针对由大阵列孔径和高频引发的混合近场与远场辐射特性带来的复杂性。解决方案的关键在于提出一种基于扩散模型(Diffusion Model, DM)的“即插即用”(plug-and-play)信道估计方法——DiffPace,其利用混合球面波与平面波模型(HPSM)建模信道分布,并将DM作为先验知识嵌入CE流程,从而更准确地刻画混合近-远场信道结构;同时,通过求解常微分方程进行推理,显著减少所需推理步数(相较现有最优方案降低90%),在保证高估计精度的同时大幅提升计算效率。
链接: https://arxiv.org/abs/2511.01867
作者: Zhengdong Hu,Chong Han,Wolfgang Gerstacker,Robert Schober
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Millimeter-wave (mmWave) and Terahertz (THz)-band communications hold great promise in meeting the growing data-rate demands of next-generation wireless networks, offering abundant bandwidth. To mitigate the severe path loss inherent to these high frequencies and reduce hardware costs, ultra-massive multiple-input multiple-output (UM-MIMO) systems with hybrid beamforming architectures can deliver substantial beamforming gains and enhanced spectral efficiency. However, accurate channel estimation (CE) in mmWave and THz UM-MIMO systems is challenging due to high channel dimensionality and compressed observations from a limited number of RF chains, while the hybrid near- and far-field radiation patterns, arising from large array apertures and high carrier frequencies, further complicate CE. Conventional compressive sensing based frameworks rely on predefined sparsifying matrices, which cannot faithfully capture the hybrid near-field and far-field channel structures, leading to degraded estimation performance. This paper introduces DiffPace, a diffusion-based plug-and-play method for channel estimation. DiffPace uses a diffusion model (DM) to capture the channel distribution based on the hybrid spherical and planar-wave (HPSM) model. By applying the plug-and-play approach, it leverages the DM as prior knowledge, improving CE accuracy. Moreover, DM performs inference by solving an ordinary differential equation, minimizing the number of required inference steps compared with stochastic sampling method. Experimental results show that DiffPace achieves competitive CE performance, attaining -15 dB normalized mean square error (NMSE) at a signal-to-noise ratio (SNR) of 10 dB, with 90% fewer inference steps compared to state-of-the-art schemes, simultaneously providing high estimation precision and enhanced computational efficiency.
zh
机器学习
[LG-0] GeoCrossBench: Cross-Band Generalization for Remote Sensing
链接: https://arxiv.org/abs/2511.02831
作者: Hakob Tamazyan,Ani Vanyan,Alvard Barseghyan,Anna Khosrovyan,Evan Shelhamer,Hrant Khachatrian
类目: Machine Learning (cs.LG)
*备注:
Abstract:The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.
[LG-1] Fast Private and Protected: Safeguarding Data Privacy and Defending Against Model Poisoning Attacks in Federated Learning
链接: https://arxiv.org/abs/2511.02797
作者: Nicolas Riccieri Gardin Assumpcao,Leandro Villas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) is a distributed training paradigm wherein participants collaborate to build a global model while ensuring the privacy of the involved data, which remains stored on participant devices. However, proposals aiming to ensure such privacy also make it challenging to protect against potential attackers seeking to compromise the training outcome. In this context, we present Fast, Private, and Protected (FPP), a novel approach that aims to safeguard federated training while enabling secure aggregation to preserve data privacy. This is accomplished by evaluating rounds using participants’ assessments and enabling training recovery after an attack. FPP also employs a reputation-based mechanism to mitigate the participation of attackers. We created a dockerized environment to validate the performance of FPP compared to other approaches in the literature (FedAvg, Power-of-Choice, and aggregation via Trimmed Mean and Median). Our experiments demonstrate that FPP achieves a rapid convergence rate and can converge even in the presence of malicious participants performing model poisoning attacks.
[LG-2] Enhancing Federated Learning Privacy with QUBO
链接: https://arxiv.org/abs/2511.02785
作者: Andras Ferenczi,Sutapa Samanta,Dagen Wang,Todd Hodges
类目: Machine Learning (cs.LG)
*备注: 8 pages, 9 figures
Abstract:Federated learning (FL) is a widely used method for training machine learning (ML) models in a scalable way while preserving privacy (i.e., without centralizing raw data). Prior research shows that the risk of exposing sensitive data increases cumulatively as the number of iterations where a client’s updates are included in the aggregated model increase. Attackers can launch membership inference attacks (MIA; deciding whether a sample or client participated), property inference attacks (PIA; inferring attributes of a client’s data), and model inversion attacks (MI; reconstructing inputs), thereby inferring client-specific attributes and, in some cases, reconstructing inputs. In this paper, we mitigate risk by substantially reducing per client exposure using a quantum computing-inspired quadratic unconstrained binary optimization (QUBO) formulation that selects a small subset of client updates most relevant for each training round. In this work, we focus on two threat vectors: (i) information leakage by clients during training and (ii) adversaries who can query or obtain the global model. We assume a trusted central server and do not model server compromise. This method also assumes that the server has access to a validation/test set with global data distribution. Experiments on the MNIST dataset with 300 clients in 20 rounds showed a 95.2% per-round and 49% cumulative privacy exposure reduction, with 147 clients’ updates never being used during training while maintaining in general the full-aggregation accuracy or even better. The method proved to be efficient at lower scale and more complex model as well. A CINIC-10 dataset-based experiment with 30 clients resulted in 82% per-round privacy improvement and 33% cumulative privacy.
[LG-3] Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
链接: https://arxiv.org/abs/2511.02773
作者: Xinghan Li,Haodong Wen,Kaifeng Lyu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, \tr(\mH) , whereas we prove that Adam minimizes \tr(\Diag(\mH)^1/2) instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.
[LG-4] VecComp: Vector Computing via MIMO Digital Over-the-Air Computation
链接: https://arxiv.org/abs/2511.02765
作者: Saeed Razavikia,José Mairton Barros Da Silva Junior,Carlo Fischione
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Recently, the ChannelComp framework has proposed digital over-the-air computation by designing digital modulations that enable the computation of arbitrary functions. Unlike traditional analog over-the-air computation, which is restricted to nomographic functions, ChannelComp enables a broader range of computational tasks while maintaining compatibility with digital communication systems. This framework is intended for applications that favor local information processing over the mere acquisition of data. However, ChannelComp is currently designed for scalar function computation, while numerous data-centric applications necessitate vector-based computations, and it is susceptible to channel fading. In this work, we introduce a generalization of the ChannelComp framework, called VecComp, by integrating ChannelComp with multiple-antenna technology. This generalization not only enables vector function computation but also ensures scalability in the computational complexity, which increases only linearly with the vector dimension. As such, VecComp remains computationally efficient and robust against channel impairments, making it suitable for high-dimensional, data-centric applications. We establish a non-asymptotic upper bound on the mean squared error of VecComp, affirming its computation efficiency under fading channel conditions. Numerical experiments show the effectiveness of VecComp in improving the computation of vector functions and fading compensation over noisy and fading multiple-access channels.
[LG-5] From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos
链接: https://arxiv.org/abs/2511.02762
作者: Xun Wang,Zhuoran Li,Yanshan Lin,Hai Zhong,Longbo Huang
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.
[LG-6] ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models
链接: https://arxiv.org/abs/2511.02757
作者: Lejs Deen Behric,Liang Zhang,Bingcong Li,Kiran Koshy Thekumparampil
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth-order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than MeZO while retaining the low-memory footprint of zeroth-order methods.
[LG-7] Agent ic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning
链接: https://arxiv.org/abs/2511.02748
作者: Farhad Rezazadeh,Hatim Chergui,Merouane Debbah,Houbing Song,Dusit Niyato,Lingjia Liu
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 13 Pages, 3 Figures, 4 Tables
Abstract:We argue that sixth-generation (6G) intelligence is not fluent token prediction but the capacity to imagine and choose – to simulate future scenarios, weigh trade-offs, and act with calibrated uncertainty. We reframe open radio access network (O-RAN) near-real-time (Near-RT) control via counterfactual dynamics and a world modeling (WM) paradigm that learns an action-conditioned generative state space. This enables quantitative “what-if” forecasting beyond large language models (LLMs) as the primary modeling primitive. Actions such as physical resource blocks (PRBs) are treated as first-class control inputs in a causal world model, and both aleatoric and epistemic uncertainty are modeled for prediction and what-if analysis. An agentic, model predictive control (MPC)-based cross-entropy method (CEM) planner operates over short horizons, using prior-mean rollouts within data-driven PRB bounds to maximize a deterministic reward. The model couples multi-scale structured state-space mixtures (MS3M) with a compact stochastic latent to form WM-MS3M, summarizing key performance indicators (KPIs) histories and predicting next-step KPIs under hypothetical PRB sequences. On realistic O-RAN traces, WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference, enabling rare-event simulation and offline policy screening.
[LG-8] Calibration improves detection of mislabeled examples
链接: https://arxiv.org/abs/2511.02738
作者: Ilies Chibane,Thomas George,Pierre Nodet,Vincent Lemaire
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mislabeled data is a pervasive issue that undermines the performance of machine learning systems in real-world applications. An effective approach to mitigate this problem is to detect mislabeled instances and subject them to special treatment, such as filtering or relabeling. Automatic mislabeling detection methods typically rely on training a base machine learning model and then probing it for each instance to obtain a trust score that each provided label is genuine or incorrect. The properties of this base model are thus of paramount importance. In this paper, we investigate the impact of calibrating this model. Our empirical results show that using calibration methods improves the accuracy and robustness of mislabeled instance detection, providing a practical and effective solution for industrial applications.
[LG-9] Does Interpretability of Knowledge Tracing Models Support Teacher Decision Making?
链接: https://arxiv.org/abs/2511.02718
作者: Adia Khalid,Alina Deriyeva,Benjamin Paassen
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: in press at the Workshop on Epistemics and Decision-Making in AI-Supported Education, AIED 2025
Abstract:Knowledge tracing (KT) models are a crucial basis for pedagogical decision-making, namely which task to select next for a learner and when to stop teaching a particular skill. Given the high stakes of pedagogical decisions, KT models are typically required to be interpretable, in the sense that they should implement an explicit model of human learning and provide explicit estimates of learners’ abilities. However, to our knowledge, no study to date has investigated whether the interpretability of KT models actually helps human teachers to make teaching decisions. We address this gap. First, we perform a simulation study to show that, indeed, decisions based on interpretable KT models achieve mastery faster compared to decisions based on a non-interpretable model. Second, we repeat the study but ask N=12 human teachers to make the teaching decisions based on the information provided by KT models. As expected, teachers rate interpretable KT models higher in terms of usability and trustworthiness. However, the number of tasks needed until mastery hardly differs between KT models. This suggests that the relationship between model interpretability and teacher decisions is not straightforward: teachers do not solely rely on KT models to make decisions and further research is needed to investigate how learners and teachers actually understand and use KT models.
[LG-10] Curriculum Design for Trajectory-Constrained Agent : Compressing Chain-of-Thought Tokens in LLM s NEURIPS’25
链接: https://arxiv.org/abs/2511.02690
作者: Georgios Tzannetos,Parameswaran Kamalaruban,Adish Singla
类目: Machine Learning (cs.LG)
*备注: NeurIPS’25 paper
Abstract:Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self-paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain-of-thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource-constrained deployment.
[LG-11] Nesterov-Accelerated Robust Federated Learning Over Byzantine Adversaries
链接: https://arxiv.org/abs/2511.02657
作者: Lihan Xu,Yanjie Dong,Gang Wang,Runhao Zeng,Xiaoyi Fan,Xiping Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate robust federated learning, where a group of workers collaboratively train a shared model under the orchestration of a central server in the presence of Byzantine adversaries capable of arbitrary and potentially malicious behaviors. To simultaneously enhance communication efficiency and robustness against such adversaries, we propose a Byzantine-resilient Nesterov-Accelerated Federated Learning (Byrd-NAFL) algorithm. Byrd-NAFL seamlessly integrates Nesterov’s momentum into the federated learning process alongside Byzantine-resilient aggregation rules to achieve fast and safeguarding convergence against gradient corruption. We establish a finite-time convergence guarantee for Byrd-NAFL under non-convex and smooth loss functions with relaxed assumption on the aggregated gradients. Extensive numerical experiments validate the effectiveness of Byrd-NAFL and demonstrate the superiority over existing benchmarks in terms of convergence speed, accuracy, and resilience to diverse Byzantine attack strategies.
[LG-12] Recursively Enumerably Representable Classes and Computable Versions of the Fundamental Theorem of Statistical Learning
链接: https://arxiv.org/abs/2511.02644
作者: David Kattermann,Lothar Sebastian Krapp
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic (math.LO)
*备注:
Abstract:We study computable probably approximately correct (CPAC) learning, where learners are required to be computable functions. It had been previously observed that the Fundamental Theorem of Statistical Learning, which characterizes PAC learnability by finiteness of the Vapnik-Chervonenkis (VC-)dimension, no longer holds in this framework. Recent works recovered analogs of the Fundamental Theorem in the computable setting, for instance by introducing an effective VC-dimension. Guided by this, we investigate the connection between CPAC learning and recursively enumerable representable (RER) classes, whose members can be algorithmically listed. Our results show that the effective VC-dimensions can take arbitrary values above the traditional one, even for RER classes, which creates a whole family of (non-)examples for various notions of CPAC learning. Yet the two dimensions coincide for classes satisfying sufficiently strong notions of CPAC learning. We then observe that CPAC learnability can also be characterized via containment of RER classes that realize the same samples. Furthermore, it is shown that CPAC learnable classes satisfying a unique identification property are necessarily RER. Finally, we establish that agnostic learnability can be guaranteed for RER classes, by considering the relaxed notion of nonuniform CPAC learning.
[LG-13] he stability of shallow neural networks on spheres: A sharp spectral analysis
链接: https://arxiv.org/abs/2511.02625
作者: Xinliang Liu,Tong Mao,Jinchao Xu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We present an estimation of the condition numbers of the \emphmass and \emphstiffness matrices arising from shallow ReLU ^k neural networks defined on the unit sphere~ \mathbbS^d . In particular, when \theta_j^*_j=1^n \subset \mathbbS^d is \emphantipodally quasi-uniform, the condition number is sharp. Indeed, in this case, we obtain sharp asymptotic estimates for the full spectrum of eigenvalues and characterize the structure of the corresponding eigenspaces, showing that the smallest eigenvalues are associated with an eigenbasis of low-degree polynomials while the largest eigenvalues are linked to high-degree polynomials. This spectral analysis establishes a precise correspondence between the approximation power of the network and its numerical stability.
[LG-14] Verifying LLM Inference to Prevent Model Weight Exfiltration
链接: https://arxiv.org/abs/2511.02620
作者: Roy Rinberg,Adam Karvonen,Alex Hoover,Daniel Reuter,Keri Warr
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs, a strategy known as steganography. This work investigates how to verify model responses to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to 0.5% with false-positive rate of 0.01%, corresponding to a 200x slowdown for adversaries. Overall, this work further establishes a foundation for defending against model weight exfiltration and demonstrates that strong protection can be achieved with minimal additional cost to inference providers.
[LG-15] A Non-Adversarial Approach to Idempotent Generative Modelling
链接: https://arxiv.org/abs/2511.02614
作者: Mohammed Al-Jaff,Giovanni Luca Marchetti,Michael C Welle,Jens Lundell,Mats G. Gustafsson,Gustav Eje Henter,Hossein Azizpour,Danica Kragic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Idempotent Generative Networks (IGNs) are deep generative models that also function as local data manifold projectors, mapping arbitrary inputs back onto the manifold. They are trained to act as identity operators on the data and as idempotent operators off the data manifold. However, IGNs suffer from mode collapse, mode dropping, and training instability due to their objectives, which contain adversarial components and can cause the model to cover the data manifold only partially – an issue shared with generative adversarial networks. We introduce Non-Adversarial Idempotent Generative Networks (NAIGNs) to address these issues. Our loss function combines reconstruction with the non-adversarial generative objective of Implicit Maximum Likelihood Estimation (IMLE). This improves on IGN’s ability to restore corrupted data and generate new samples that closely match the data distribution. We moreover demonstrate that NAIGNs implicitly learn the distance field to the data manifold, as well as an energy-based model.
[LG-16] Neural Network Interoperability Across Platforms
链接: https://arxiv.org/abs/2511.02610
作者: Nadia Daoudi,Ivan Alfonso,Jordi Cabot
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
Abstract:The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.
[LG-17] A Large Language Model for Corporate Credit Scoring
链接: https://arxiv.org/abs/2511.02593
作者: Chitro Majumdar,Sergio Scandizzo,Ratanlal Mahanta,Avradip Mandal,Swarnendu Bhattacharjee
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Omega^2, a Large Language Model-driven framework for corporate credit scoring that combines structured financial data with advanced machine learning to improve predictive reliability and interpretability. Our study evaluates Omega^2 on a multi-agency dataset of 7,800 corporate credit ratings drawn from Moody’s, Standard Poor’s, Fitch, and Egan-Jones, each containing detailed firm-level financial indicators such as leverage, profitability, and liquidity ratios. The system integrates CatBoost, LightGBM, and XGBoost models optimized through Bayesian search under temporal validation to ensure forward-looking and reproducible results. Omega^2 achieved a mean test AUC above 0.93 across agencies, confirming its ability to generalize across rating systems and maintain temporal consistency. These results show that combining language-based reasoning with quantitative learning creates a transparent and institution-grade foundation for reliable corporate credit-risk assessment.
[LG-18] Redundancy Maximization as a Principle of Associative Memory Learning
链接: https://arxiv.org/abs/2511.02584
作者: Mark Blümel,Andreas C. Schneider,Valentin Neuhaus,David A. Ehrlich,Marcel Graetz,Michael Wibral,Abdullah Makkeh,Viola Priesemann
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注: 21 pages, 8 figures
Abstract:Associative memory, traditionally modeled by Hopfield networks, enables the retrieval of previously stored patterns from partial or noisy cues. Yet, the local computational principles which are required to enable this function remain incompletely understood. To formally characterize the local information processing in such systems, we employ a recent extension of information theory - Partial Information Decomposition (PID). PID decomposes the contribution of different inputs to an output into unique information from each input, redundant information across inputs, and synergistic information that emerges from combining different inputs. Applying this framework to individual neurons in classical Hopfield networks we find that below the memory capacity, the information in a neuron’s activity is characterized by high redundancy between the external pattern input and the internal recurrent input, while synergy and unique information are close to zero until the memory capacity is surpassed and performance drops steeply. Inspired by this observation, we use redundancy as an information-theoretic learning goal, which is directly optimized for each neuron, dramatically increasing the network’s memory capacity to 1.59, a more than tenfold improvement over the 0.14 capacity of classical Hopfield networks and even outperforming recent state-of-the-art implementations of Hopfield networks. Ultimately, this work establishes redundancy maximization as a new design principle for associative memories and opens pathways for new associative memory models based on information-theoretic goals.
[LG-19] Directional-Clamp PPO
链接: https://arxiv.org/abs/2511.02577
作者: Gilad Karpel,Ruida Zhou,Shoham Sabach,Mohammad Ghavamzadeh
类目: Machine Learning (cs.LG)
*备注:
[LG-20] Dynamic Priors in Bayesian Optimization for Hyperparameter Optimization
链接: https://arxiv.org/abs/2511.02570
作者: Lukas Fehring,Marcel Wever,Maximilian Spliethöver,Leona Hennig,Henning Wachsmuth,Marius Lindauer
类目: Machine Learning (cs.LG)
*备注: 9 pages
Abstract:Hyperparameter optimization (HPO), for example, based on Bayesian optimization (BO), supports users in designing models well-suited for a given dataset. HPO has proven its effectiveness on several applications, ranging from classical machine learning for tabular data to deep neural networks for computer vision and transformers for natural language processing. However, HPO still sometimes lacks acceptance by machine learning experts due to its black-box nature and limited user control. Addressing this, first approaches have been proposed to initialize BO methods with expert knowledge. However, these approaches do not allow for online steering during the optimization process. In this paper, we introduce a novel method that enables repeated interventions to steer BO via user input, specifying expert knowledge and user preferences at runtime of the HPO process in the form of prior distributions. To this end, we generalize an existing method, \pi BO, preserving theoretical guarantees. We also introduce a misleading prior detection scheme, which allows protection against harmful user inputs. In our experimental evaluation, we demonstrate that our method can effectively incorporate multiple priors, leveraging informative priors, whereas misleading priors are reliably rejected or overcome. Thereby, we achieve competitiveness to unperturbed BO.
[LG-21] heoretical Guarantees for Causal Discovery on Large Random Graphs
链接: https://arxiv.org/abs/2511.02536
作者: Mathieu Chevalley,Arash Mehrjou,Patrick Schwab
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate theoretical guarantees for the false-negative rate (FNR) – the fraction of true causal edges whose orientation is not recovered, under single-variable random interventions and an \epsilon -interventional faithfulness assumption that accommodates latent confounding. For sparse Erdős–Rényi directed acyclic graphs, where the edge probability scales as p_e = \Theta(1/d) , we show that the FNR concentrates around its mean at rate O(\frac\log d\sqrt d) , implying that large deviations above the expected error become exponentially unlikely as dimensionality increases. This concentration ensures that derived upper bounds hold with high probability in large-scale settings. Extending the analysis to generalized Barabási–Albert graphs reveals an even stronger phenomenon: when the degree exponent satisfies \gamma 3 , the deviation width scales as O(d^\beta - \frac12) with \beta = 1/(\gamma - 1) \frac12 , and hence vanishes in the limit. This demonstrates that realistic scale-free topologies intrinsically regularize causal discovery, reducing variability in orientation error. These finite-dimension results provide the first dimension-adaptive, faithfulness-robust guarantees for causal structure recovery, and challenge the intuition that high dimensionality and network heterogeneity necessarily hinder accurate discovery. Our simulation results corroborate these theoretical predictions, showing that the FNR indeed concentrates and often vanishes in practice as dimensionality grows.
[LG-22] Rawlsian many-to-one matching with non-linear utility
链接: https://arxiv.org/abs/2511.02533
作者: Hortence Nana,Andreas Athanasopoulos,Christos Dimitrakakis
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures
Abstract:We study a many-to-one matching problem, such as the college admission problem, where each college can admit multiple students. Unlike classical models, colleges evaluate sets of students through non-linear utility functions that capture diversity between them. In this setting, we show that classical stable matchings may fail to exist. To address this, we propose alternative solution concepts based on Rawlsian fairness, aiming to maximize the minimum utility across colleges. We design both deterministic and stochastic algorithms that iteratively improve the outcome of the worst-off college, offering a practical approach to fair allocation when stability cannot be guaranteed.
[LG-23] Many-vs-Many Missile Guidance via Virtual Targets
链接: https://arxiv.org/abs/2511.02526
作者: Marc Schneider,Walter Fichter
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: will be submitted to Journal of Guidance, Control, and Dynamics as Technical Note
Abstract:This paper presents a novel approach to many-vs-many missile guidance using virtual targets (VTs) generated by a Normalizing Flows-based trajectory predictor. Rather than assigning n interceptors directly to m physical targets through conventional weapon target assignment algorithms, we propose a centralized strategy that constructs n VT trajectories representing probabilistic predictions of maneuvering target behavior. Each interceptor is guided toward its assigned VT using Zero-Effort-Miss guidance during midcourse flight, transitioning to Proportional Navigation guidance for terminal interception. This approach treats many-vs-many engagements as many-vs-distribution scenarios, exploiting numerical superiority (n m) by distributing interceptors across diverse trajectory hypotheses rather than pursuing identical deterministic predictions. Monte Carlo simulations across various target-interceptor configurations (1-6 targets, 1-8 interceptors) demonstrate that the VT method matches or exceeds baseline straight-line prediction performance by 0-4.1% when n = m, with improvements increasing to 5.8-14.4% when n m. The results confirm that probabilistic VTs enable effective exploitation of numerical superiority, significantly increasing interception probability in many-vs-many scenarios.
[LG-24] Variational Geometric Information Bottleneck: Learning the Shape of Understanding
链接: https://arxiv.org/abs/2511.02496
作者: Ronald Katende
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a unified information-geometric framework that formalizes understanding in learning as a trade-off between informativeness and geometric simplicity. An encoder phi is evaluated by U(phi) = I(phi(X); Y) - beta * C(phi), where C(phi) penalizes curvature and intrinsic dimensionality, enforcing smooth, low-complexity manifolds. Under mild manifold and regularity assumptions, we derive non-asymptotic bounds showing that generalization error scales with intrinsic dimension while curvature controls approximation stability, directly linking geometry to sample efficiency. To operationalize this theory, we introduce the Variational Geometric Information Bottleneck (V-GIB), a variational estimator that unifies mutual-information compression and curvature regularization through tractable geometric proxies such as the Hutchinson trace, Jacobian norms, and local PCA. Experiments across synthetic manifolds, few-shot settings, and real-world datasets (Fashion-MNIST, CIFAR-10) reveal a robust information-geometry Pareto frontier, stable estimators, and substantial gains in interpretive efficiency. Fractional-data experiments on CIFAR-10 confirm that curvature-aware encoders maintain predictive power under data scarcity, validating the predicted efficiency-curvature law. Overall, V-GIB provides a principled and measurable route to representations that are geometrically coherent, data-efficient, and aligned with human-understandable structure.
[LG-25] Learning CNF formulas from uniform random solutions in the local lemma regime
链接: https://arxiv.org/abs/2511.02487
作者: Weiming Feng,Xiongxin Yang,Yixiao Yu,Yiyao Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the problem of learning a n -variables k -CNF formula \Phi from its i.i.d. uniform random solutions, which is equivalent to learning a Boolean Markov random field (MRF) with k -wise hard constraints. Revisiting Valiant’s algorithm (Commun. ACM’84), we show that it can exactly learn (1) k -CNFs with bounded clause intersection size under Lovász local lemma type conditions, from O(\log n) samples; and (2) random k -CNFs near the satisfiability threshold, from \widetildeO(n^\exp(-\sqrtk)) samples. These results significantly improve the previous O(n^k) sample complexity. We further establish new information-theoretic lower bounds on sample complexity for both exact and approximate learning from i.i.d. uniform random solutions.
[LG-26] NOWS: Neural Operator Warm Starts for Accelerating Iterative Solvers
链接: https://arxiv.org/abs/2511.02481
作者: Mohammad Sadegh Eshaghi,Cosmin Anitescu,Navid Valizadeh,Yizheng Wang,Xiaoying Zhuang,Timon Rabczuk
类目: Machine Learning (cs.LG)
*备注:
Abstract:Partial differential equations (PDEs) underpin quantitative descriptions across the physical sciences and engineering, yet high-fidelity simulation remains a major computational bottleneck for many-query, real-time, and design tasks. Data-driven surrogates can be strikingly fast but are often unreliable when applied outside their training distribution. Here we introduce Neural Operator Warm Starts (NOWS), a hybrid strategy that harnesses learned solution operators to accelerate classical iterative solvers by producing high-quality initial guesses for Krylov methods such as conjugate gradient and GMRES. NOWS leaves existing discretizations and solver infrastructures intact, integrating seamlessly with finite-difference, finite-element, isogeometric analysis, finite volume method, etc. Across our benchmarks, the learned initialization consistently reduces iteration counts and end-to-end runtime, resulting in a reduction of the computational time of up to 90 %, while preserving the stability and convergence guarantees of the underlying numerical algorithms. By combining the rapid inference of neural operators with the rigor of traditional solvers, NOWS provides a practical and trustworthy approach to accelerate high-fidelity PDE simulations.
[LG-27] Accounting for Underspecification in Statistical Claims of Model Superiority
链接: https://arxiv.org/abs/2511.02453
作者: Thomas Sanchez,Pedro M. Gordaliza,Meritxell Bach Cuadra
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Medical Imaging meets EurIPS Workshop: MedEurIPS 2025
Abstract:Machine learning methods are increasingly applied in medical imaging, yet many reported improvements lack statistical robustness: recent works have highlighted that small but significant performance gains are highly likely to be false positives. However, these analyses do not take \emphunderspecification into account – the fact that models achieving similar validation scores may behave differently on unseen data due to random initialization or training dynamics. Here, we extend a recent statistical framework modeling false outperformance claims to include underspecification as an additional variance component. Our simulations demonstrate that even modest seed variability ( \sim1% ) substantially increases the evidence required to support superiority claims. Our findings underscore the need for explicit modeling of training variance when validating medical imaging systems.
[LG-28] Improving Unlearning with Model Updates Probably Aligned with Gradients
链接: https://arxiv.org/abs/2511.02435
作者: Virgile Dine,Teddy Furon,Charly Faure
类目: Machine Learning (cs.LG)
*备注: Accepted to AISec’25 co-located with the 32nd ACM Conference on Computer and Communications Security
Abstract:We formulate the machine unlearning problem as a general constrained optimization problem. It unifies the first-order methods from the approximate machine unlearning literature. This paper then introduces the concept of feasible updates as the model’s parameter update directions that help with unlearning while not degrading the utility of the initial model. Our design of feasible updates is based on masking, \ie\ a careful selection of the model’s parameters worth updating. It also takes into account the estimation noise of the gradients when processing each batch of data to offer a statistical guarantee to derive locally feasible updates. The technique can be plugged in, as an add-on, to any first-order approximate unlearning methods. Experiments with computer vision classifiers validate this approach.
[LG-29] A Spatially Informed Gaussian Process UCB Method for Decentralized Coverag e Control
链接: https://arxiv.org/abs/2511.02398
作者: Gennaro Guidone,Luca Monegaglia,Elia Raimondi,Han Wang,Mattia Bianchi,Florian Dörfler
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a novel decentralized algorithm for coverage control in unknown spatial environments modeled by Gaussian Processes (GPs). To trade-off between exploration and exploitation, each agent autonomously determines its trajectory by minimizing a local cost function. Inspired by the GP-UCB (Upper Confidence Bound for GPs) acquisition function, the proposed cost combines the expected locational cost with a variance-based exploration term, guiding agents toward regions that are both high in predicted density and model uncertainty. Compared to previous work, our algorithm operates in a fully decentralized fashion, relying only on local observations and communication with neighboring agents. In particular, agents periodically update their inducing points using a greedy selection strategy, enabling scalable online GP updates. We demonstrate the effectiveness of our algorithm in simulation.
[LG-30] LUMA-RAG : Lifelong Multimodal Agents with Provably Stable Streaming Alignment
链接: https://arxiv.org/abs/2511.02371
作者: Rohan Wandre,Yash Gajewar,Namrata Patel,Vivek Dhalkari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for grounding large language model outputs in verifiable evidence. However, as modern AI agents transition from static knowledge bases to continuous multimodal streams encompassing text, images, video, and audio, two critical challenges arise: maintaining index freshness without prohibitive re-indexing costs, and preserving cross-modal semantic consistency across heterogeneous embedding spaces. We present LUMA-RAG, a lifelong multimodal agent architecture featuring three key innovations: (i) a streaming, multi-tier memory system that dynamically spills embeddings from a hot HNSW tier to a compressed IVFPQ tier under strict memory budgets; (ii) a streaming CLAP-CLIP alignment bridge that maintains cross-modal consistency through incremental orthogonal Procrustes updates; and (iii) stability-aware retrieval telemetry providing Safe@k guarantees by jointly bounding alignment drift and quantization error. Experiments demonstrate robust text-to-image retrieval (Recall@10 = 0.94), graceful performance degradation under product quantization offloading, and provably stable audio-to-image rankings (Safe@1 = 1.0), establishing LUMA-RAG as a practical framework for production multimodal RAG systems.
[LG-31] An Automated Framework for Strategy Discovery Retrieval and Evolution in LLM Jailbreak Attacks
链接: https://arxiv.org/abs/2511.02356
作者: Xu Liu,Yan Chen,Kan Ling,Yichi Zhu,Hengrun Zhang,Guisheng Fan,Huiqun Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The widespread deployment of Large Language Models (LLMs) as public-facing web services and APIs has made their security a core concern for the web ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have recently attracted extensive research. In this paper, we reveal a jailbreak strategy which can effectively evade current defense strategies. It can extract valuable information from failed or partially successful attack attempts and contains self-evolution from attack interactions, resulting in sufficient strategy diversity and adaptability. Inspired by continuous learning and modular design principles, we propose ASTRA, a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies to achieve more efficient and adaptive attacks. To enable this autonomous evolution, we design a closed-loop “attack-evaluate-distill-reuse” core mechanism that not only generates attack prompts but also automatically distills and generalizes reusable attack strategies from every interaction. To systematically accumulate and apply this attack knowledge, we introduce a three-tier strategy library that categorizes strategies into Effective, Promising, and Ineffective based on their performance scores. The strategy library not only provides precise guidance for attack generation but also possesses exceptional extensibility and transferability. We conduct extensive experiments under a black-box setting, and the results show that ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
[LG-32] Evolving Graph Learning for Out-of-Distribution Generalization in Non-stationary Environments
链接: https://arxiv.org/abs/2511.02354
作者: Qingyun Sun,Jiayi Luo,Haonan Yuan,Xingcheng Fu,Hao Peng,Jianxin Li,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks have shown remarkable success in exploiting the spatial and temporal patterns on dynamic graphs. However, existing GNNs exhibit poor generalization ability under distribution shifts, which is inevitable in dynamic scenarios. As dynamic graph generation progresses amid evolving latent non-stationary environments, it is imperative to explore their effects on out-of-distribution (OOD) generalization. This paper proposes a novel Evolving Graph Learning framework for OOD generalization (EvoOOD) by environment-aware invariant pattern recognition. Specifically, we first design an environment sequential variational auto-encoder to model environment evolution and infer the underlying environment distribution. Then, we introduce a mechanism for environment-aware invariant pattern recognition, tailored to address environmental diversification through inferred distributions. Finally, we conduct fine-grained causal interventions on individual nodes using a mixture of instantiated environment samples. This approach helps to distinguish spatio-temporal invariant patterns for OOD prediction, especially in non-stationary environments. Experimental results demonstrate the superiority of EvoGOOD on both real-world and synthetic dynamic datasets under distribution shifts. To the best of our knowledge, it is the first attempt to study the dynamic graph OOD generalization problem from the environment evolution perspective.
[LG-33] Reducing normalizing flow complexity for MCMC preconditioning
链接: https://arxiv.org/abs/2511.02345
作者: David Nabergoj,Erik Štrumbelj
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 22 pages, 6 figures
Abstract:Preconditioning is a key component of MCMC algorithms that improves sampling efficiency by facilitating exploration of geometrically complex target distributions through an invertible map. While linear preconditioners are often sufficient for moderately complex target distributions, recent work has explored nonlinear preconditioning with invertible neural networks as components of normalizing flows (NFs). However, empirical and theoretical studies show that overparameterized NF preconditioners can degrade sampling efficiency and fit quality. Moreover, existing NF-based approaches do not adapt their architectures to the target distribution. Related work outside of MCMC similarly finds that suitably parameterized NFs can achieve comparable or superior performance with substantially less training time or data. We propose a factorized preconditioning architecture that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry. The linear preconditioner is applied to dimensions that are approximately Gaussian, as estimated from warmup samples, while the conditional NF models more complex dimensions. Our method yields significantly better tail samples on two complex synthetic distributions and consistently better performance on a sparse logistic regression posterior across varying likelihood and prior strengths. It also achieves higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries. This approach is particularly relevant for hierarchical Bayesian model analyses with limited data and could inform current theoretical and software strides in neural MCMC design.
[LG-34] Learning A Universal Crime Predictor with Knowledge-guided Hypernetworks ECAI2025
链接: https://arxiv.org/abs/2511.02336
作者: Fidan Karimova,Tong Chen,Yu Yang,Shazia Sadiq
类目: Machine Learning (cs.LG)
*备注: Accepted by ECAI 2025
Abstract:Predicting crimes in urban environments is crucial for public safety, yet existing prediction methods often struggle to align the knowledge across diverse cities that vary dramatically in data availability of specific crime types. We propose HYpernetwork-enhanced Spatial Temporal Learning (HYSTL), a framework that can effectively train a unified, stronger crime predictor without assuming identical crime types in different cities’ records. In HYSTL, instead of parameterising a dedicated predictor per crime type, a hypernetwork is designed to dynamically generate parameters for the prediction function conditioned on the crime type of interest. To bridge the semantic gap between different crime types, a structured crime knowledge graph is built, where the learned representations of crimes are used as the input to the hypernetwork to facilitate parameter generation. As such, when making predictions for each crime type, the predictor is additionally guided by its intricate association with other relevant crime types. Extensive experiments are performed on two cities with non-overlapping crime types, and the results demonstrate HYSTL outperforms state-of-the-art baselines.
[LG-35] RoME: Domain-Robust Mixture-of-Experts for MILP Solution Prediction across Domains
链接: https://arxiv.org/abs/2511.02331
作者: Tianle Pu,Zijie Geng,Haoyang Liu,Shixuan Liu,Jie Wang,Li Zeng,Chao Chen,Changjun Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixed-Integer Linear Programming (MILP) is a fundamental and powerful framework for modeling complex optimization problems across diverse domains. Recently, learning-based methods have shown great promise in accelerating MILP solvers by predicting high-quality solutions. However, most existing approaches are developed and evaluated in single-domain settings, limiting their ability to generalize to unseen problem distributions. This limitation poses a major obstacle to building scalable and general-purpose learning-based solvers. To address this challenge, we introduce RoME, a domain-Robust Mixture-of-Experts framework for predicting MILP solutions across domains. RoME dynamically routes problem instances to specialized experts based on learned task embeddings. The model is trained using a two-level distributionally robust optimization strategy: inter-domain to mitigate global shifts across domains, and intra-domain to enhance local robustness by introducing perturbations on task embeddings. We reveal that cross-domain training not only enhances the model’s generalization capability to unseen domains but also improves performance within each individual domain by encouraging the model to capture more general intrinsic combinatorial patterns. Specifically, a single RoME model trained on three domains achieves an average improvement of 67.7% then evaluated on five diverse domains. We further test the pretrained model on MIPLIB in a zero-shot setting, demonstrating its ability to deliver measurable performance gains on challenging real-world instances where existing learning-based approaches often struggle to generalize.
[LG-36] Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning
链接: https://arxiv.org/abs/2511.02314
作者: Jueye Zhang,Chao Yang,Youfang Lai,Kai-Wen Li,Wenting Yan,Yunzhou Xia,Haimei Zhang,Jingjing Zhou,Gen Yang,Chen Lin,Tian Li,Yibao Zhang
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:
Abstract:Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL 85.93\pm7.85% vs Manual 85.02\pm6.92% ), with significant (p-value 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.
[LG-37] Reinforcement learning based data assimilation for unknown state model
链接: https://arxiv.org/abs/2511.02286
作者: Ziyi Wang,Lijian Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data assimilation (DA) has increasingly emerged as a critical tool for state estimation across a wide range of applications. It is signiffcantly challenging when the governing equations of the underlying dynamics are unknown. To this end, various machine learning approaches have been employed to construct a surrogate state transition model in a supervised learning framework, which relies on pre-computed training datasets. However, it is often infeasible to obtain noise-free ground-truth state sequences in practice. To address this challenge, we propose a novel method that integrates reinforcement learning with ensemble-based Bayesian ffltering methods, enabling the learning of surrogate state transition model for unknown dynamics directly from noisy observations, without using true state trajectories. Speciffcally, we treat the process for computing maximum likelihood estimation of surrogate model parameters as a sequential decision-making problem, which can be formulated as a discretetime Markov decision process (MDP). Under this formulation, learning the surrogate transition model is equivalent to ffnding an optimal policy of the MDP, which can be effectively addressed using reinforcement learning techniques. Once the model is trained offfine, state estimation can be performed in the online stage using ffltering methods based on the learned dynamics. The proposed framework accommodates a wide range of observation scenarios, including nonlinear and partially observed measurement models. A few numerical examples demonstrate that the proposed method achieves superior accuracy and robustness in high-dimensional settings. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.02286 [cs.LG] (or arXiv:2511.02286v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.02286 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ziyi Wang [view email] [v1] Tue, 4 Nov 2025 05:58:37 UTC (3,913 KB) Full-text links: Access Paper: View a PDF of the paper titled Reinforcement learning based data assimilation for unknown state model, by Ziyi Wang and Lijian JiangView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-38] Gradient-Variation Online Adaptivity for Accelerated Optimization with Hölder Smoothness NEURIPS2025
链接: https://arxiv.org/abs/2511.02276
作者: Yuheng Zhao,Yu-Hu Yan,Kfir Yehuda Levy,Peng Zhao
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: NeurIPS 2025
Abstract:Smoothness is known to be crucial for acceleration in offline optimization, and for gradient-variation regret minimization in online learning. Interestingly, these two problems are actually closely connected – accelerated optimization can be understood through the lens of gradient-variation online learning. In this paper, we investigate online learning with Hölder smooth functions, a general class encompassing both smooth and non-smooth (Lipschitz) functions, and explore its implications for offline optimization. For (strongly) convex online functions, we design the corresponding gradient-variation online learning algorithm whose regret smoothly interpolates between the optimal guarantees in smooth and non-smooth regimes. Notably, our algorithms do not require prior knowledge of the Hölder smoothness parameter, exhibiting strong adaptivity over existing methods. Through online-to-batch conversion, this gradient-variation online adaptivity yields an optimal universal method for stochastic convex optimization under Hölder smoothness. However, achieving universality in offline strongly convex optimization is more challenging. We address this by integrating online adaptivity with a detection-based guess-and-check procedure, which, for the first time, yields a universal offline method that achieves accelerated convergence in the smooth regime while maintaining near-optimal convergence in the non-smooth one.
[LG-39] Probabilistic Graph Cuts
链接: https://arxiv.org/abs/2511.02272
作者: Ayoub Ghriss
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 23 pages
Abstract:Probabilistic relaxations of graph cuts offer a differentiable alternative to spectral clustering, enabling end-to-end and online learning without eigendecompositions, yet prior work centered on RatioCut and lacked general guarantees and principled gradients. We present a unified probabilistic framework that covers a wide class of cuts, including Normalized Cut. Our framework provides tight analytic upper bounds on expected discrete cuts via integral representations and Gauss hypergeometric functions with closed-form forward and backward. Together, these results deliver a rigorous, numerically stable foundation for scalable, differentiable graph partitioning covering a wide range of clustering and contrastive learning objectives.
[LG-40] From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
链接: https://arxiv.org/abs/2511.02248
作者: Xingqi Cui,Chieh-Jan Mike Liang,Jiarong Xing,Haoran Qiu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16 pages, 13 figures
Abstract:Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size, sequence length, and traffic rate. This heterogeneity suggests that the operator, rather than the entire model, is the right granularity for scaling decisions. We propose an operator-level autoscaling framework, which allocates resources at finer (operator)-granularity, optimizing the scaling, batching, and placement based on individual operator profiles. Evaluated on production-scale traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less energy, or under fixed resources achieves 1.6x higher throughput with 5% less energy. These results show that the operator, rather than the model, is fundamentally a more effective unit for scaling large generative workloads. Comments: 16 pages, 13 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2511.02248 [cs.DC] (or arXiv:2511.02248v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.02248 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Neural network initialization with nonlinear characteristics and information on spectral bias
链接: https://arxiv.org/abs/2511.02244
作者: Hikaru Homma,Jun Ohkubo
类目: Machine Learning (cs.LG)
*备注: 7 pages, 7 figures
Abstract:Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance; if chosen well, we can even avoid the need for additional training with backpropagation. For example, algorithms based on the ridgelet transform or the SWIM (sampling where it matters) concept have been proposed for initialization. On the other hand, it is well-known that neural networks tend to learn coarse information in the earlier layers. The feature is called spectral bias. In this work, we investigate the effects of utilizing information on the spectral bias in the initialization of neural networks. Hence, we propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers and to represent high-frequency components in the late-stage hidden layers. Numerical experiments on a one-dimensional regression task and the MNIST classification task demonstrate that the proposed method outperforms the conventional initialization algorithms. This work clarifies the importance of intrinsic spectral properties in learning neural networks, and the finding yields an effective parameter initialization strategy that enhances their training performance.
[LG-42] Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
链接: https://arxiv.org/abs/2511.02237
作者: Costin-Andrei Oncescu,Qingyang Wu,Wai Tong Chung,Robert Wu,Bryan Gopal,Junxiong Wang,Tri Dao,Ben Athiwaratkun
类目: Machine Learning (cs.LG)
*备注: 18 pages, 9 figures, 10 tables
Abstract:An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of 16 . Without any statistically significant loss in accuracy, our approach achieves latency reductions of 39% and 15% in the MoE layer decode latency, respectively.
[LG-43] Learning Interactive World Model for Object-Centric Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2511.02225
作者: Fan Feng,Phillip Lippe,Sara Magliacane
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025
Abstract:Agents that understand objects and their interactions can learn policies that are more robust and transferable. However, most object-centric RL methods factor state by individual objects while leaving interactions implicit. We introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a unified framework that learns structured representations of both objects and their interactions within a world model. FIOC-WM captures environment dynamics with disentangled and modular representations of object interactions, improving sample efficiency and generalization for policy learning. Concretely, FIOC-WM first learns object-centric latents and an interaction structure directly from pixels, leveraging pre-trained vision encoders. The learned world model then decomposes tasks into composable interaction primitives, and a hierarchical policy is trained on top: a high level selects the type and order of interactions, while a low level executes them. On simulated robotic and embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and generalization over world-model baselines, indicating that explicit, modular interaction learning is crucial for robust control.
[LG-44] PrivGNN: High-Performance Secure Inference for Cryptographic Graph Neural Networks
链接: https://arxiv.org/abs/2511.02185
作者: Fuyi Wang,Zekai Chen,Mingyuan Fan,Jianying Zhou,Lei Pan,Leo Yu Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to FC’25
Abstract:Graph neural networks (GNNs) are powerful tools for analyzing and learning from graph-structured (GS) data, facilitating a wide range of services. Deploying such services in privacy-critical cloud environments necessitates the development of secure inference (SI) protocols that safeguard sensitive GS data. However, existing SI solutions largely focus on convolutional models for image and text data, leaving the challenge of securing GNNs and GS data relatively underexplored. In this work, we design, implement, and evaluate \sysname , a lightweight cryptographic scheme for graph-centric inference in the cloud. By hybridizing additive and function secret sharings within secure two-party computation (2PC), \sysname is carefully designed based on a series of novel 2PC interactive protocols that achieve 1.5\times \sim 1.7\times speedups for linear layers and 2\times \sim 15\times for non-linear layers over state-of-the-art (SotA) solutions. A thorough theoretical analysis is provided to prove \sysname 's correctness, security, and lightweight nature. Extensive experiments across four datasets demonstrate \sysname 's superior efficiency with 1.3\times \sim 4.7\times faster secure predictions while maintaining accuracy comparable to plaintext graph property inference.
[LG-45] Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLM s
链接: https://arxiv.org/abs/2511.02168
作者: Octavian Alexandru Trifan,Karthik Sangaiah,Muhammad Awad,Muhammad Osama,Sumanth Gudaparthi,Alexandru Nicolau,Alexander Veidenbaum,Ganesh Dasika
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:As large language models (LLMs) continue to scale, their workloads increasingly rely on distributed execution across multiple GPUs. However, the conventional bulk synchronous parallel~(BSP) model used in such settings introduces significant performance inefficiencies. To characterize these bottlenecks, we introduce the ‘‘Three Taxes’’ (Bulk Synchronous, Inter-Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework. We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution. By exploiting libraries like Iris for Triton, we gain access to in-kernel communication primitives that enable the design of novel fine-grained programming patterns, offering greater flexibility and performance than traditional BSP-based approaches. These patterns systematically eliminate the three taxes by creating direct, tile-level producer-consumer pipelines and replacing global barriers with fine-grained dataflow synchronization. Applying this methodology to critical kernels, from the foundational All-Gather + general matrix multiplication operation to the complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end latency over BSP-based approaches, establishing a more programmable and efficient paradigm for distributed LLM workloads.
[LG-46] ProtoTSNet: Interpretable Multivariate Time Series Classification With Prototypical Parts
链接: https://arxiv.org/abs/2511.02152
作者: Bartłomiej Małkus,Szymon Bobek,Grzegorz J. Nalepa
类目: Machine Learning (cs.LG)
*备注: 30 pages, 10 figures
Abstract:Time series data is one of the most popular data modalities in critical domains such as industry and medicine. The demand for algorithms that not only exhibit high accuracy but also offer interpretability is crucial in such fields, as decisions made there bear significant consequences. In this paper, we present ProtoTSNet, a novel approach to interpretable classification of multivariate time series data, through substantial enhancements to the ProtoPNet architecture. Our method is tailored to overcome the unique challenges of time series analysis, including capturing dynamic patterns and handling varying feature significance. Central to our innovation is a modified convolutional encoder utilizing group convolutions, pre-trainable as part of an autoencoder and designed to preserve and quantify feature importance. We evaluated our model on 30 multivariate time series datasets from the UEA archive, comparing our approach with existing explainable methods as well as non-explainable baselines. Through comprehensive evaluation and ablation studies, we demonstrate that our approach achieves the best performance among ante-hoc explainable methods while maintaining competitive performance with non-explainable and post-hoc explainable approaches, providing interpretable results accessible to domain experts.
[LG-47] CFL: On the Use of Characteristic Function Loss for Domain Alignment in Machine Learning
链接: https://arxiv.org/abs/2511.02148
作者: Abdullah Almansour,Ozan Tonguz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine Learning (ML) models are extensively used in various applications due to their significant advantages over traditional learning methods. However, the developed ML models often underperform when deployed in the real world due to the well-known distribution shift problem. This problem can lead to a catastrophic outcomes when these decision-making systems have to operate in high-risk applications. Many researchers have previously studied this problem in ML, known as distribution shift problem, using statistical techniques (such as Kullback-Leibler, Kolmogorov-Smirnov Test, Wasserstein distance, etc.) to quantify the distribution shift. In this letter, we show that using Characteristic Function (CF) as a frequency domain approach is a powerful alternative for measuring the distribution shift in high-dimensional space and for domain adaptation.
[LG-48] QuPCG: Quantum Convolutional Neural Network for Detecting Abnormal Patterns in PCG Signals
链接: https://arxiv.org/abs/2511.02140
作者: Yasaman Torabi,Shahram Shirani,James P. Reilly
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Early identification of abnormal physiological patterns is essential for the timely detection of cardiac disease. This work introduces a hybrid quantum-classical convolutional neural network (QCNN) designed to classify S3 and murmur abnormalities in heart sound signals. The approach transforms one-dimensional phonocardiogram (PCG) signals into compact two-dimensional images through a combination of wavelet feature extraction and adaptive threshold compression methods. We compress the cardiac-sound patterns into an 8-pixel image so that only 8 qubits are needed for the quantum stage. Preliminary results on the HLS-CMDS dataset demonstrate 93.33% classification accuracy on the test set and 97.14% on the train set, suggesting that quantum models can efficiently capture temporal-spectral correlations in biomedical signals. To our knowledge, this is the first application of a QCNN algorithm for bioacoustic signal processing. The proposed method represents an early step toward quantum-enhanced diagnostic systems for resource-constrained healthcare environments.
[LG-49] Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects
链接: https://arxiv.org/abs/2511.02132
作者: Mansi Choudhary,Karthik Sangaiah,Sonali Singh,Muhammad Osama,Lisa Wu Wills,Ganesh Dasika
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 11 pages, 14 figures
Abstract:The rise of disaggregated AI GPUs has exposed a critical bottleneck in large-scale attention workloads: non-uniform memory access (NUMA). As multi-chiplet designs become the norm for scaling compute capabilities, memory latency and bandwidth vary sharply across compute regions, undermining the performance of traditional GPU kernel scheduling strategies that assume uniform memory access. We identify how these NUMA effects distort locality in multi-head attention (MHA) and present Swizzled Head-first Mapping, a spatially-aware scheduling strategy that aligns attention heads with GPU NUMA domains to exploit intra-chiplet cache reuse. On AMD’s MI300X architecture, our method achieves up to 50% higher performance over state-of-the-art attention algorithms using conventional scheduling techniques and sustains consistently high L2 cache hit rates of 80-97%. These results demonstrate that NUMA-aware scheduling is now fundamental to achieving full efficiency on next-generation disaggregated GPUs, offering a path forward for scalable AI training and inference.
[LG-50] Variance-Aware Feel-Good Thompson Sampling for Contextual Bandits NEURIPS2025
链接: https://arxiv.org/abs/2511.02123
作者: Xuheng Li,Quanquan Gu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 19 pages, 2 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Variance-dependent regret bounds have received increasing attention in recent studies on contextual bandits. However, most of these studies are focused on upper confidence bound (UCB)-based bandit algorithms, while sampling based bandit algorithms such as Thompson sampling are still understudied. The only exception is the LinVDTS algorithm (Xu et al., 2023), which is limited to linear reward function and its regret bound is not optimal with respect to the model dimension. In this paper, we present FGTSVA, a variance-aware Thompson Sampling algorithm for contextual bandits with general reward function with optimal regret bound. At the core of our analysis is an extension of the decoupling coefficient, a technique commonly used in the analysis of Feel-good Thompson sampling (FGTS) that reflects the complexity of the model space. With the new decoupling coefficient denoted by \mathrmdc , FGTS-VA achieves the regret of \tildeO(\sqrt\mathrmdc\cdot\log|\mathcalF|\sum_t=1^T\sigma_t^2+\mathrmdc) , where |\mathcalF| is the size of the model space, T is the total number of rounds, and \sigma_t^2 is the subgaussian norm of the noise (e.g., variance when the noise is Gaussian) at round t . In the setting of contextual linear bandits, the regret bound of FGTSVA matches that of UCB-based algorithms using weighted linear regression (Zhou and Gu, 2022).
[LG-51] Measuring the Intrinsic Dimension of Earth Representations
链接: https://arxiv.org/abs/2511.02101
作者: Arjun Rao,Marc Rußwurm,Konstantin Klemmer,Esther Rolf
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Pre-print. 27 pages, 11 figures, 6 tables
Abstract:Within the context of representation learning for Earth observation, geographic Implicit Neural Representations (INRs) embed low-dimensional location inputs (longitude, latitude) into high-dimensional embeddings, through models trained on geo-referenced satellite, image or text data. Despite the common aim of geographic INRs to distill Earth’s data into compact, learning-friendly representations, we lack an understanding of how much information is contained in these Earth representations, and where that information is concentrated. The intrinsic dimension of a dataset measures the number of degrees of freedom required to capture its local variability, regardless of the ambient high-dimensional space in which it is embedded. This work provides the first study of the intrinsic dimensionality of geographic INRs. Analyzing INRs with ambient dimension between 256 and 512, we find that their intrinsic dimensions fall roughly between 2 and 10 and are sensitive to changing spatial resolution and input modalities during INR pre-training. Furthermore, we show that the intrinsic dimension of a geographic INR correlates with downstream task performance and can capture spatial artifacts, facilitating model evaluation and diagnostics. More broadly, our work offers an architecture-agnostic, label-free metric of information content that can enable unsupervised evaluation, model selection, and pre-training design across INRs.
[LG-52] Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models NEURIPS2025
链接: https://arxiv.org/abs/2511.02077
作者: Jucheng Shen,Yeonju Ro
类目: Machine Learning (cs.LG)
*备注: 7 pages, NeurIPS 2025 Efficient Reasoning Workshop
Abstract:Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.
[LG-53] Solving cold start in news recommendations: a RippleNet-based system for large scale media outlet
链接: https://arxiv.org/abs/2511.02052
作者: Karol Radziszewski,Michał Szpunar,Piotr Ociepka,Mateusz Buczyński
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:We present a scalable recommender system implementation based on RippleNet, tailored for the media domain with a production deployment in this http URL, one of Poland’s largest online media platforms. Our solution addresses the cold-start problem for newly published content by integrating content-based item embeddings into the knowledge propagation mechanism of RippleNet, enabling effective scoring of previously unseen items. The system architecture leverages Amazon SageMaker for distributed training and inference, and Apache Airflow for orchestrating data pipelines and model retraining workflows. To ensure high-quality training data, we constructed a comprehensive golden dataset consisting of user and item features and a separate interaction table, all enabling flexible extensions and integration of new signals.
[LG-54] Finding Probably Approximate Optimal Solutions by Training to Estimate the Optimal Values of Subproblems
链接: https://arxiv.org/abs/2511.02048
作者: Nimrod Megiddo,Segev Wasserkrug,Orit Davidovich,Shimrit Shtern
类目: Machine Learning (cs.LG)
*备注:
Abstract:The paper is about developing a solver for maximizing a real-valued function of binary variables. The solver relies on an algorithm that estimates the optimal objective-function value of instances from the underlying distribution of objectives and their respective sub-instances. The training of the estimator is based on an inequality that facilitates the use of the expected total deviation from optimality conditions as a loss function rather than the objective-function itself. Thus, it does not calculate values of policies, nor does it rely on solved instances.
[LG-55] A Dual-Use Framework for Clinical Gait Analysis: Attention-Based Sensor Optimization and Automated Dataset Auditing
链接: https://arxiv.org/abs/2511.02047
作者: Hamidreza Sadeghsalehi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Objective gait analysis using wearable sensors and AI is critical for managing neurological and orthopedic conditions. However, models are vulnerable to hidden dataset biases, and task-specific sensor optimization remains a challenge. We propose a multi-stream attention-based deep learning framework that functions as both a sensor optimizer and an automated data auditor. Applied to the Voisard et al. (2025) multi-cohort gait dataset on four clinical tasks (PD, OA, CVA screening; PD vs CVA differential), the model’s attention mechanism quantitatively discovered a severe dataset confound. For OA and CVA screening, tasks where bilateral assessment is clinically essential, the model assigned more than 70 percent attention to the Right Foot while statistically ignoring the Left Foot (less than 0.1 percent attention, 95 percent CI [0.0-0.1]). This was not a clinical finding but a direct reflection of a severe laterality bias (for example, 15 of 15 right-sided OA) in the public dataset. The primary contribution of this work is methodological, demonstrating that an interpretable framework can automatically audit dataset integrity. As a secondary finding, the model proposes novel, data-driven sensor synergies (for example, Head plus Foot for PD screening) as hypotheses for future optimized protocols.
[LG-56] Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
链接: https://arxiv.org/abs/2511.02043
作者: Bozhi You,Irene Wang,Zelal Su Mustafaoglu,Abhinav Jangda,Angélica Moreira,Roshan Dathathri,Divya Mahajan,Keshav Pingali
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: Submitted to MLSys 2026
Abstract:Bad charactors when submitting to arXiv: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch’s compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance. Comments: Submitted to MLSys 2026 Subjects: Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2511.02043 [cs.LG] (or arXiv:2511.02043v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.02043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-57] Predicting Microbial Interactions Using Graph Neural Networks NEURIPS2025
链接: https://arxiv.org/abs/2511.02038
作者: Elham Gholamzadeh,Kajal Singla,Nico Scherf
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 3 figures, NeurIPS 2025 Workshop New Perspectives in Graph Machine Learning
Abstract:Predicting interspecies interactions is a key challenge in microbial ecology, as these interactions are critical to determining the structure and activity of microbial communities. In this work, we used data on monoculture growth capabilities, interactions with other species, and phylogeny to predict a negative or positive effect of interactions. More precisely, we used one of the largest available pairwise interaction datasets to train our models, comprising over 7,500 interactions be- tween 20 species from two taxonomic groups co-cultured under 40 distinct carbon conditions, with a primary focus on the work of Nestor et al.[28 ]. In this work, we propose Graph Neural Networks (GNNs) as a powerful classifier to predict the direction of the effect. We construct edge-graphs of pairwise microbial interactions in order to leverage shared information across individual co-culture experiments, and use GNNs to predict modes of interaction. Our model can not only predict binary interactions (positive/negative) but also classify more complex interaction types such as mutualism, competition, and parasitism. Our initial results were encouraging, achieving an F1-score of 80.44%. This significantly outperforms comparable methods in the literature, including conventional Extreme Gradient Boosting (XGBoost) models, which reported an F1-score of 72.76%.
[LG-58] Bulk-boundary decomposition of neural networks
链接: https://arxiv.org/abs/2511.02003
作者: Donghee Lee,Hye-Sung Lee,Jaeok Yi
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Phenomenology (hep-ph)
*备注: 6 pages, 2 figures
Abstract:We present the bulk-boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a natural extension, we develop a field-theoretic formulation of neural dynamics based on this decomposition.
[LG-59] NeuroClean: A Generalized Machine-Learning Approach to Neural Time-Series Conditioning
链接: https://arxiv.org/abs/2511.01951
作者: Manuel A. Hernandez Alonso,Michael Depass,Stephan Quessy,Numa Dancause,Ignasi Cos
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electroencephalography (EEG) and local field potentials (LFP) are two widely used techniques to record electrical activity from the brain. These signals are used in both the clinical and research domains for multiple applications. However, most brain data recordings suffer from a myriad of artifacts and noise sources other than the brain itself. Thus, a major requirement for their use is proper and, given current volumes of data, a fully automatized conditioning. As a means to this end, here we introduce an unsupervised, multipurpose EEG/LFP preprocessing method, the NeuroClean pipeline. In addition to its completeness and reliability, NeuroClean is an unsupervised series of algorithms intended to mitigate reproducibility issues and biases caused by human intervention. The pipeline is designed as a five-step process, including the common bandpass and line noise filtering, and bad channel rejection. However, it incorporates an efficient independent component analysis with an automatic component rejection based on a clustering algorithm. This machine learning classifier is used to ensure that task-relevant information is preserved after each step of the cleaning process. We used several data sets to validate the pipeline. NeuroClean removed several common types of artifacts from the signal. Moreover, in the context of motor tasks of varying complexity, it yielded more than 97% accuracy (vs. a chance-level of 33.3%) in an optimized Multinomial Logistic Regression model after cleaning the data, compared to the raw data, which performed at 74% accuracy. These results show that NeuroClean is a promising pipeline and workflow that can be applied to future work and studies to achieve better generalization and performance on machine learning pipelines.
[LG-60] EchoLSTM: A Self-Reflective Recurrent Network for Stabilizing Long-Range Memory
链接: https://arxiv.org/abs/2511.01950
作者: Prasanth K K,Shubham Sharma
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 5 tables
Abstract:Standard Recurrent Neural Networks, including LSTMs, struggle to model long-range dependencies, particularly in sequences containing noisy or misleading information. We propose a new architectural principle, Output-Conditioned Gating, which enables a model to perform self-reflection by modulating its internal memory gates based on its own past inferences. This creates a stabilizing feedback loop that enhances memory retention. Our final model, the EchoLSTM, integrates this principle with an attention mechanism. We evaluate the EchoLSTM on a series of challenging benchmarks. On a custom-designed Distractor Signal Task, the EchoLSTM achieves 69.0% accuracy, decisively outperforming a standard LSTM baseline by 33 percentage points. Furthermore, on the standard ListOps benchmark, the EchoLSTM achieves performance competitive with a modern Transformer model, 69.8% vs. 71.8%, while being over 5 times more parameter-efficient. A final Trigger Sensitivity Test provides qualitative evidence that our model’s self-reflective mechanism leads to a fundamentally more robust memory system.
[LG-61] Learning a Distance for the Clustering of Patients with Amyotrophic Lateral Sclerosis
链接: https://arxiv.org/abs/2511.01945
作者: Guillaume Tejedor,Veronika Peralta(BDTLN),Nicolas Labroche(LIFAT, BDTLN),Patrick Marcel(LIFO, Pamda),Hélène Blasco(UT),Hugo Alarcan(CHRU Tours)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Amyotrophic lateral sclerosis (ALS) is a severe disease with a typical survival of 3-5 years after symptom onset. Current treatments offer only limited life extension, and the variability in patient responses highlights the need for personalized care. However, research is hindered by small, heterogeneous cohorts, sparse longitudinal data, and the lack of a clear definition for clinically meaningful patient clusters. Existing clustering methods remain limited in both scope and number. To address this, we propose a clustering approach that groups sequences using a disease progression declarative score. Our approach integrates medical expertise through multiple descriptive variables, investigating several distance measures combining such variables, both by reusing off-the-shelf distances and employing a weak-supervised learning method. We pair these distances with clustering methods and benchmark them against state-of-the-art techniques. The evaluation of our approach on a dataset of 353 ALS patients from the University Hospital of Tours, shows that our method outperforms state-of-the-art methods in survival analysis while achieving comparable silhouette scores. In addition, the learned distances enhance the relevance and interpretability of results for medical experts.
[LG-62] Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training
链接: https://arxiv.org/abs/2511.01918
作者: Ahmet Erdem Pamuk,Emir Kaan Özdemir,Şuayp Talha Kocabay
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Accepted at 2025 IEEE International Conference on Quantum Artificial Intelligence (IEEE QAI 2025). This is the accepted version of the paper. The final published version will appear in the IEEE proceedings. \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Abstract:Large language models (LLMs) are increasingly trained with classical optimization techniques like AdamW to improve convergence and generalization. However, the mechanisms by which quantum-inspired methods enhance classical training remain underexplored. We introduce Superpositional Gradient Descent (SGD), a novel optimizer linking gradient updates with quantum superposition by injecting quantum circuit perturbations. We present a mathematical framework and implement hybrid quantum-classical circuits in PyTorch and Qiskit. On synthetic sequence classification and large-scale LLM fine-tuning, SGD converges faster and yields lower final loss than AdamW. Despite promising results, scalability and hardware constraints limit adoption. Overall, this work provides new insights into the intersection of quantum computing and deep learning, suggesting practical pathways for leveraging quantum principles to control and enhance model behavior.
[LG-63] he Eigenvalues Entropy as a Classifier Evaluation Measure
链接: https://arxiv.org/abs/2511.01904
作者: Doulaye Dembélé
类目: Machine Learning (cs.LG)
*备注:
Abstract:Classification is a machine learning method used in many practical applications: text mining, handwritten character recognition, face recognition, pattern classification, scene labeling, computer vision, natural langage processing. A classifier prediction results and training set information are often used to get a contingency table which is used to quantify the method quality through an evaluation measure. Such measure, typically a numerical value, allows to choose a suitable method among several. Many evaluation measures available in the literature are less accurate for a dataset with imbalanced classes. In this paper, the eigenvalues entropy is used as an evaluation measure for a binary or a multi-class problem. For a binary problem, relations are given between the eigenvalues and some commonly used measures, the sensitivity, the specificity, the area under the operating receiver characteristic curve and the Gini index. A by-product result of this paper is an estimate of the confusion matrix to deal with the curse of the imbalanced classes. Various data examples are used to show the better performance of the proposed evaluation measure over the gold standard measures available in the literature.
[LG-64] Learned Cost Model for Placement on Reconfigurable Dataflow Hardware
链接: https://arxiv.org/abs/2511.01872
作者: Etash Guha,Tianxiao Jiang,Andrew Deng,Jian Zhang,Muthu Annamalai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 7 pages, 2 figures, 2 tables, DAC Conference style (2022)
Abstract:Mapping a dataflow-graph of an ML model onto a reconfigurable system is difficult, as different mappings have different throughputs and consume resource constraints differently. To solve this, a model to evaluate the throughput of mappings is necessary as measuring throughput completely is expensive. Many use a hand-designed analytical model, relying on proxy features or intuition, introducing error. We provide a Learned Approach that predicts throughput 31%-52% more accurately over a variety of graphs. In addition, our approach shows no accuracy degradation after removing performance annotations. We show that using this approach results in 5.6% faster compiled graphs.
[LG-65] Accelerated Frank-Wolfe Algorithms: Complementarity Conditions and Sparsity
链接: https://arxiv.org/abs/2511.02821
作者: Dan Garber
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We develop new accelerated first-order algorithms in the Frank-Wolfe (FW) family for minimizing smooth convex functions over compact convex sets, with a focus on two prominent constraint classes: (1) polytopes and (2) matrix domains given by the spectrahedron and the unit nuclear-norm ball. A key technical ingredient is a complementarity condition that captures solution sparsity – face dimension for polytopes and rank for matrices. We present two algorithms: (1) a purely linear optimization oracle (LOO) method for polytopes that has optimal worst-case first-order (FO) oracle complexity and, aside of a finite \emphburn-in phase and up to a logarithmic factor, has LOO complexity that scales with r/\sqrt\epsilon , where \epsilon is the target accuracy and r is the solution sparsity r (independently of the ambient dimension), and (2) a hybrid scheme that combines FW with a sparse projection oracle (e.g., low-rank SVDs for matrix domains with low-rank solutions), which also has optimal FO oracle complexity, and after a finite burn-in phase, only requires O(1/\sqrt\epsilon) sparse projections and LOO calls (independently of both the ambient dimension and the rank of optimal solutions). Our results close a gap on how to accelerate recent advancements in linearly-converging FW algorithms for strongly convex optimization, without paying the price of the dimension.
[LG-66] DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications
链接: https://arxiv.org/abs/2511.02754
作者: Zebin Wang,Ziming Gan,Weijing Tang,Zongqi Xia,Tianrun Cai,Tianxi Cai,Junwei Lu
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.
[LG-67] Optimizing Kernel Discrepancies via Subset Selection
链接: https://arxiv.org/abs/2511.02706
作者: Deyao Chen,François Clément,Carola Doerr,Nathan Kirk
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Kernel discrepancies are a powerful tool for analyzing worst-case errors in quasi-Monte Carlo (QMC) methods. Building on recent advances in optimizing such discrepancy measures, we extend the subset selection problem to the setting of kernel discrepancies, selecting an m-element subset from a large population of size n \gg m . We introduce a novel subset selection algorithm applicable to general kernel discrepancies to efficiently generate low-discrepancy samples from both the uniform distribution on the unit hypercube, the traditional setting of classical QMC, and from more general distributions F with known density functions by employing the kernel Stein discrepancy. We also explore the relationship between the classical L_2 star discrepancy and its L_\infty counterpart.
[LG-68] RL-Aided Cognitive ISAC: Robust Detection and Sensing-Communication Trade-offs
链接: https://arxiv.org/abs/2511.02672
作者: Adam Umra,Aya M. Ahmed,Aydin Sezgin
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 29 pages, 14 figures. Invited paper, submitted to the EURASIP Journal on Wireless Communications and Networking (JWCN)
Abstract:This paper proposes a reinforcement learning (RL)-aided cognitive framework for massive MIMO-based integrated sensing and communication (ISAC) systems employing a uniform planar array (UPA). The focus is on enhancing radar sensing performance in environments with unknown and dynamic disturbance characteristics. A Wald-type detector is employed for robust target detection under non-Gaussian clutter, while a SARSA-based RL algorithm enables adaptive estimation of target positions without prior environmental knowledge. Based on the RL-derived sensing information, a joint waveform optimization strategy is formulated to balance radar sensing accuracy and downlink communication throughput. The resulting design provides an adaptive trade-off between detection performance and achievable sum rate through an analytically derived closed-form solution. Monte Carlo simulations demonstrate that the proposed cognitive ISAC framework achieves significantly improved detection probability compared to orthogonal and non-learning adaptive baselines, while maintaining competitive communication performance. These results underline the potential of RL-assisted sensing for robust and spectrum-efficient ISAC in next-generation wireless networks.
[LG-69] RIS-Assisted 3D Spherical Splatting for Object Composition Visualization using Detection Transformers
链接: https://arxiv.org/abs/2511.02573
作者: Anastasios T. Sotiropoulos,Stavros Tsimpoukis,Dimitrios Tyrovolas,Sotiris Ioannidis,George K. Karagiannidis,Christos K. Liaskos
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IEEE ICC 2026
Abstract:The pursuit of immersive and structurally aware multimedia experiences has intensified interest in sensing modalities that reconstruct objects beyond the limits of visible light. Conventional optical pipelines degrade under occlusion or low illumination, motivating the use of radio-frequency (RF) sensing, whose electromagnetic waves penetrate materials and encode both geometric and compositional information. Yet, uncontrolled multipath propagation restricts reconstruction accuracy. Recent advances in Programmable Wireless Environments (PWEs) mitigate this limitation by enabling software-defined manipulation of propagation through Reconfigurable Intelligent Surfaces (RISs), thereby providing controllable illumination diversity. Building on this capability, this work introduces a PWE-driven RF framework for three-dimensional object reconstruction using material-aware spherical primitives. The proposed approach combines RIS-enabled field synthesis with a Detection Transformer (DETR) that infers spatial and material parameters directly from extracted RF features. Simulation results confirm the framework’s ability to approximate object geometries and classify material composition with an overall accuracy of 79.35%, marking an initial step toward programmable and physically grounded RF-based 3D object composition visualization.
[LG-70] An Adaptive Sampling Framework for Detecting Localized Concept Drift under Label Scarcity
链接: https://arxiv.org/abs/2511.02452
作者: Junghee Pyeon,Davide Cacciarelli,Kamran Paynabar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Concept drift and label scarcity are two critical challenges limiting the robustness of predictive models in dynamic industrial environments. Existing drift detection methods often assume global shifts and rely on dense supervision, making them ill-suited for regression tasks with local drifts and limited labels. This paper proposes an adaptive sampling framework that combines residual-based exploration and exploitation with EWMA monitoring to efficiently detect local concept drift under labeling budget constraints. Empirical results on synthetic benchmarks and a case study on electricity market demonstrate superior performance in label efficiency and drift detection accuracy.
[LG-71] Arithmetic Circuits and Neural Networks for Regular Matroids
链接: https://arxiv.org/abs/2511.02406
作者: Christoph Hertrich,Stefan Kober,Georg Loho
类目: Combinatorics (math.CO); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We prove that there exist uniform (+,\times,/) -circuits of size O(n^3) to compute the basis generating polynomial of regular matroids on n elements. By tropicalization, this implies that there exist uniform (\max,+,-) -circuits and ReLU neural networks of the same size for weighted basis maximization of regular matroids. As a consequence in linear programming theory, we obtain a first example where taking the difference of two extended formulations can be more efficient than the best known individual extended formulation of size O(n^6) by Aprile and Fiorini. Such differences have recently been introduced as virtual extended formulations. The proof of our main result relies on a fine-tuned version of Seymour’s decomposition of regular matroids which allows us to identify and maintain graphic substructures to which we can apply a local version of the star-mesh transformation.
[LG-72] A new class of Markov random fields enabling lightweight sampling
链接: https://arxiv.org/abs/2511.02373
作者: Jean-Baptiste Courbot,Hugo Gangloff,Bruno Colicchio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Computation (stat.CO)
*备注:
Abstract:This work addresses the problem of efficient sampling of Markov random fields (MRF). The sampling of Potts or Ising MRF is most often based on Gibbs sampling, and is thus computationally expensive. We consider in this work how to circumvent this bottleneck through a link with Gaussian Markov Random fields. The latter can be sampled in several cost-effective ways, and we introduce a mapping from real-valued GMRF to discrete-valued MRF. The resulting new class of MRF benefits from a few theoretical properties that validate the new model. Numerical results show the drastic performance gain in terms of computational efficiency, as we sample at least 35x faster than Gibbs sampling using at least 37x less energy, all the while exhibiting empirical properties close to classical MRFs.
[LG-73] Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks
链接: https://arxiv.org/abs/2511.02258
作者: Parsa Rangriz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of the population loss, we focus on the critical scaling regime of the step size. Below this critical scale, the effective dynamics are governed by ballistic (ODE) limits, but at the critical scale, new correction term appears that changes the phase diagram. In this regime, near the fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduces to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrates the limitations of deterministic scaling limit in capturing the stochastic fluctuations of high-dimensional learning dynamics.
[LG-74] DoFlow: Causal Generative Flows for Interventional and Counterfactual Time-Series Prediction
链接: https://arxiv.org/abs/2511.02137
作者: Dongze Wu,Feng Qiu,Yao Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Time-series forecasting increasingly demands not only accurate observational predictions but also causal forecasting under interventional and counterfactual queries in multivariate systems. We present DoFlow, a flow based generative model defined over a causal DAG that delivers coherent observational and interventional predictions, as well as counterfactuals through the natural encoding and decoding mechanism of continuous normalizing flows (CNFs). We also provide a supporting counterfactual recovery result under certain assumptions. Beyond forecasting, DoFlow provides explicit likelihoods of future trajectories, enabling principled anomaly detection. Experiments on synthetic datasets with various causal DAG and real world hydropower and cancer treatment time series show that DoFlow achieves accurate system-wide observational forecasting, enables causal forecasting over interventional and counterfactual queries, and effectively detects anomalies. This work contributes to the broader goal of unifying causal reasoning and generative modeling for complex dynamical systems.
[LG-75] Data-driven Learning of Interaction Laws in Multispecies Particle Systems with Gaussian Processes: Convergence Theory and Applications
链接: https://arxiv.org/abs/2511.02053
作者: Jinchao Feng,Charles Kulick,Sui Tang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST)
*备注: 40 pages, Appendix 17 pages
Abstract:We develop a Gaussian process framework for learning interaction kernels in multi-species interacting particle systems from trajectory data. Such systems provide a canonical setting for multiscale modeling, where simple microscopic interaction rules generate complex macroscopic behaviors. While our earlier work established a Gaussian process approach and convergence theory for single-species systems, and later extended to second-order models with alignment and energy-type interactions, the multi-species setting introduces new challenges: heterogeneous populations interact both within and across species, the number of unknown kernels grows, and asymmetric interactions such as predator-prey dynamics must be accommodated. We formulate the learning problem in a nonparametric Bayesian setting and establish rigorous statistical guarantees. Our analysis shows recoverability of the interaction kernels, provides quantitative error bounds, and proves statistical optimality of posterior estimators, thereby unifying and generalizing previous single-species theory. Numerical experiments confirm the theoretical predictions and demonstrate the effectiveness of the proposed approach, highlighting its advantages over existing kernel-based methods. This work contributes a complete statistical framework for data-driven inference of interaction laws in multi-species systems, advancing the broader multiscale modeling program of connecting microscopic particle dynamics with emergent macroscopic behavior.
[LG-76] SEAL - A Symmetry EncourAg ing Loss for High Energy Physics
链接: https://arxiv.org/abs/2511.01982
作者: Pradyun Hebbar,Thandikire Madula,Vinicius Mikuni,Benjamin Nachman,Nadav Outmezguine,Inbar Savoray
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:Physical symmetries provide a strong inductive bias for constructing functions to analyze data. In particular, this bias may improve robustness, data efficiency, and interpretability of machine learning models. However, building machine learning models that explicitly respect symmetries can be difficult due to the dedicated components required. Moreover, real-world experiments may not exactly respect fundamental symmetries at the level of finite granularities and energy thresholds. In this work, we explore an alternative approach to create symmetry-aware machine learning models. We introduce soft constraints that allow the model to decide the importance of added symmetries during the learning process instead of enforcing exact symmetries. We investigate two complementary approaches, one that penalizes the model based on specific transformations of the inputs and one inspired by group theory and infinitesimal transformations of the inputs. Using top quark jet tagging and Lorentz equivariance as examples, we observe that the addition of the soft constraints leads to more robust performance while requiring negligible changes to current state-of-the-art models.
[LG-77] Stability of mixed-state phases under weak decoherence
链接: https://arxiv.org/abs/2511.01976
作者: Yifan F. Zhang,Sarang Gopalakrishnan
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 25 pages, 3 figures
Abstract:We prove that the Gibbs states of classical, and commuting-Pauli, Hamiltonians are stable under weak local decoherence: i.e., we show that the effect of the decoherence can be locally reversed. In particular, our conclusions apply to finite-temperature equilibrium critical points and ordered low-temperature phases. In these systems the unconditional spatio-temporal correlations are long-range, and local (e.g., Metropolis) dynamics exhibits critical slowing down. Nevertheless, our results imply the existence of local “decoders” that undo the decoherence, when the decoherence strength is below a critical value. An implication of these results is that thermally stable quantum memories have a threshold against decoherence that remains nonzero as one approaches the critical temperature. Analogously, in diffusion models, stability of data distributions implies the existence of computationally-efficent local denoisers in the late-time generation dynamics.
[LG-78] Addressing prior dependence in hierarchical Bayesian modeling for PTA data analysis II: Noise and SGWB inference through parameter decorrelation
链接: https://arxiv.org/abs/2511.01959
作者: Eleonora Villa,Luigi D’Amico,Aldo Barca,Fatima Modica Bittordo,Francesco Alì,Massimo Meneghetti,Luca Naso
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures. Submitted to the Astronomy and Computing special issue HPC in Cosmology and Astrophysics
Abstract:Pulsar Timing Arrays provide a powerful framework to measure low-frequency gravitational waves, but accuracy and robustness of the results are challenged by complex noise processes that must be accurately modeled. Standard PTA analyses assign fixed uniform noise priors to each pulsar, an approach that can introduce systematic biases when combining the array. To overcome this limitation, we adopt a hierarchical Bayesian modeling strategy in which noise priors are parametrized by higher-level hyperparameters. We further address the challenge posed by the correlations between hyperparameters and physical noise parameters, focusing on those describing red noise and dispersion measure variations. To decorrelate these quantities, we introduce an orthogonal reparametrization of the hierarchical model implemented with Normalizing Flows. We also employ i-nessai, a flow-guided nested sampler, to efficiently explore the resulting higher-dimensional parameter space. We apply our method to a minimal 3-pulsar case study, performing a simultaneous inference of noise and SGWB parameters. Despite the limited dataset, the results consistently show that the hierarchical treatment constrains the noise parameters more tightly and partially alleviates the red-noise-SGWB degeneracy, while the orthogonal reparametrization further enhances parameter independence without affecting the correlations intrinsic to the power-law modeling of the physical processes involved.
[LG-79] Improving Bayesian inference in PTA data analysis: importance nested sampling with Normalizing Flows
链接: https://arxiv.org/abs/2511.01958
作者: Eleonora Villa,Golam Mohiuddin Shaifullah,Andrea Possenti,Carmelita Carbone
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 37 pages, 7 figures, 3 tables. Submitted to the Astronomy and Computing special issue HPC in Cosmology and Astrophysics
Abstract:We present a detailed study of Bayesian inference workflows for pulsar timing array data with a focus on enhancing efficiency, robustness and speed through the use of normalizing flow-based nested sampling. Building on the Enterprise framework, we integrate the i-nessai sampler and benchmark its performance on realistic, simulated datasets. We analyze its computational scaling and stability, and show that it achieves accurate posteriors and reliable evidence estimates with substantially reduced runtime, by up to three orders of magnitude depending on the dataset configuration, with respect to conventional single-core parallel-tempering MCMC analyses. These results highlight the potential of flow-based nested sampling to accelerate PTA analyses while preserving the quality of the inference.
[LG-80] Delta-learned force fields for nonbonded interactions: Addressing the strength mismatch between covalent-nonbonded interaction for global models
链接: https://arxiv.org/abs/2511.01913
作者: Leonardo Cázares-Trejo,Marco Loreto-Silva,Huziel E. Sauceda
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 12 pages, 8 figures
Abstract:Noncovalent interactions–vdW dispersion, hydrogen/halogen bonding, ion- \pi , and \pi -stacking–govern structure, dynamics, and emergent phenomena in materials and molecular systems, yet accurately learning them alongside covalent forces remains a core challenge for machine-learned force fields (MLFFs). This challenge is acute for global models that use Coulomb-matrix (CM) descriptors compared under Euclidean/Frobenius metrics in multifragment settings. We show that the mismatch between predominantly covalent force labels and the CM’s overrepresentation of intermolecular features biases single-model training and degrades force-field fidelity. To address this, we introduce \textit \Delta -sGDML, a scale-aware formulation within the sGDML framework that explicitly decouples intra- and intermolecular physics by training fragment-specific models alongside a dedicated binding model, then composing them at inference. Across benzene dimers, host-guest complexes (C _60 @buckycatcher, NO _3^- @i-corona[6]arene), benzene-water, and benzene-Na ^+ , \mbox \Delta -sGDML delivers consistent gains over a single global model, with fragment-resolved force-error reductions up to \textbf75%, without loss of energy accuracy. Furthermore, molecular-dynamics simulations further confirm that the \Delta -model yields a reliable force field for C _60 @buckycatcher, producing stable trajectories across a wide range of temperatures (10-400~K), unlike the single global model, which loses stability above \sim 200~K. The method offers a practical route to homogenize per-fragment errors and recover reliable noncovalent physics in global MLFFs.
[LG-81] Affordable EEG Actionable Insights: An Open Dataset and Evaluation Framework for Epilepsy Patient Stratification
链接: https://arxiv.org/abs/2511.01879
作者: HM Shadman Tabib,Md. Hasnaen Adil,Ayesha Rahman,Ahmmad Nur Swapnil,Maoyejatun Hasana,Ahmed Hossain Chowdhury,A. B. M. Alim Al Islam
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Access to clinical multi-channel EEG remains limited in many regions worldwide. We present NEUROSKY-EPI, the first open dataset of single-channel, consumer-grade EEG for epilepsy, collected in a South Asian clinical setting along with rich contextual metadata. To explore its utility, we introduce EmbedCluster, a patient-stratification pipeline that transfers representations from EEGNet models trained on clinical data and enriches them with contextual autoencoder embeddings, followed by unsupervised clustering of patients based on EEG patterns. Results show that low-cost, single-channel data can support meaningful stratification. Beyond algorithmic performance, we emphasize human-centered concerns such as deployability in resource-constrained environments, interpretability for non-specialists, and safeguards for privacy, inclusivity, and bias. By releasing the dataset and code, we aim to catalyze interdisciplinary research across health technology, human-computer interaction, and machine learning, advancing the goal of affordable and actionable EEG-based epilepsy care.
[LG-82] Effectiveness of High-Dimensional Distance Metrics on Solar Flare Time Series
链接: https://arxiv.org/abs/2511.01873
作者: Elaina Rohlfing,Azim Ahmadzadeh,V Aparna
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注:
Abstract:Solar-flare forecasting has been extensively researched yet remains an open problem. In this paper, we investigate the contributions of elastic distance measures for detecting patterns in the solar-flare dataset, SWAN-SF. We employ a simple k -medoids clustering algorithm to evaluate the effectiveness of advanced, high-dimensional distance metrics. Our results show that, despite thorough optimization, none of the elastic distances outperform Euclidean distance by a significant margin. We demonstrate that, although elastic measures have shown promise for univariate time series, when applied to the multivariate time series of SWAN-SF, characterized by the high stochasticity of solar activity, they effectively collapse to Euclidean distance. We conduct thousands of experiments and present both quantitative and qualitative evidence supporting this finding.
[LG-83] BondBERT: What we learn when assigning sentiment in the bond market
链接: https://arxiv.org/abs/2511.01869
作者: Toby Barter,Zheng Gao,Eva Christodoulaki,Jing Chen,John Cartlidge
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures
Abstract:Bond markets respond differently to macroeconomic news compared to equity markets, yet most sentiment models, including FinBERT, are trained primarily on general financial or equity news data. This mismatch is important because bond prices often move in the opposite direction to economic optimism, making general or equity-based sentiment tools potentially misleading. In this paper, we introduce BondBERT, a transformer-based language model fine-tuned on bond-specific news. BondBERT can act as the perception and reasoning component of a financial decision-support agent, providing sentiment signals that integrate with forecasting models. It is a generalisable framework for adapting transformers to low-volatility, domain-inverse sentiment tasks by compiling and cleaning 30,000 UK bond market articles (2018–2025) for training, validation, and testing. We compare BondBERT’s sentiment predictions against FinBERT, FinGPT, and Instruct-FinGPT using event-based correlation, up/down accuracy analyses, and LSTM forecasting across ten UK sovereign bonds. We find that BondBERT consistently produces positive correlations with bond returns, achieves higher alignment and forecasting accuracy than the three baseline models, with lower normalised RMSE and higher information coefficient. These results demonstrate that domain-specific sentiment adaptation better captures fixed income dynamics, bridging a gap between NLP advances and bond market analytics.
[LG-84] Condition-Invariant fMRI Decoding of Speech Intelligibility with Deep State Space Model
链接: https://arxiv.org/abs/2511.01868
作者: Ching-Chih Sung,Shuntaro Suzuki,Francis Pingfan Chien,Komei Sugiura,Yu Tsao
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:
Abstract:Clarifying the neural basis of speech intelligibility is critical for computational neuroscience and digital speech processing. Recent neuroimaging studies have shown that intelligibility modulates cortical activity beyond simple acoustics, primarily in the superior temporal and inferior frontal gyri. However, previous studies have been largely confined to clean speech, leaving it unclear whether the brain employs condition-invariant neural codes across diverse listening environments. To address this gap, we propose a novel architecture built upon a deep state space model for decoding intelligibility from fMRI signals, specifically tailored to their high-dimensional temporal structure. We present the first attempt to decode intelligibility across acoustically distinct conditions, showing our method significantly outperforms classical approaches. Furthermore, region-wise analysis highlights contributions from auditory, frontal, and parietal regions, and cross-condition transfer indicates the presence of condition-invariant neural codes, thereby advancing understanding of abstract linguistic representations in the brain.
[LG-85] Learning phases with Quantum Monte Carlo simulation cell DATE
链接: https://arxiv.org/abs/2503.23098
作者: Amrita Ghosh,Mugdha Sarkar,Ying-Jer Kao,Pochung Chen
类目: rongly Correlated Electrons (cond-mat.str-el); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 22 figures, updated to published version
Abstract:We propose the use of the ``spin-opstring", derived from Stochastic Series Expansion Quantum Monte Carlo (QMC) simulations as machine learning (ML) input data. It offers a compact, memory-efficient representation of QMC simulation cells, combining the initial state with an operator string that encodes the state’s evolution through imaginary time. Using supervised ML, we demonstrate the input’s effectiveness in capturing both conventional and topological phase transitions, and in a regression task to predict non-local observables. We also demonstrate the capability of spin-opstring data in transfer learning by training models on one quantum system and successfully predicting on another, as well as showing that models trained on smaller system sizes generalize well to larger ones. Importantly, we illustrate a clear advantage of spin-opstring over conventional spin configurations in the accurate prediction of a quantum phase transition. Finally, we show how the inherent structure of spin-opstring provides an elegant framework for the interpretability of ML predictions. Using two state-of-the-art interpretability techniques, Layer-wise Relevance Propagation and SHapley Additive exPlanations, we show that the ML models learn and rely on physically meaningful features from the input data. Together, these findings establish the spin-opstring as a broadly-applicable and interpretable input format for ML in quantum many-body physics.
信息检索
[IR-0] Relational Deep Dive: Error-Aware Queries Over Unstructured Data
链接: https://arxiv.org/abs/2511.02711
作者: Daren Chao,Kaiwen Chen,Naiqing Guan,Nick Koudas
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
Abstract:Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD’s effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD’s modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.
[IR-1] Averag e Precision at Cutoff k under Random Rankings: Expectation and Variance
链接: https://arxiv.org/abs/2511.02571
作者: Tetiana Manzhos,Tetiana Ianevych,Olga Melnyk
类目: Information Retrieval (cs.IR); Probability (math.PR)
*备注: 17 pages, 2 tables, 2 figures
Abstract:Recommender systems and information retrieval platforms rely on ranking algorithms to present the most relevant items to users, thereby improving engagement and satisfaction. Assessing the quality of these rankings requires reliable evaluation metrics. Among them, Mean Average Precision at cutoff k (MAP@k) is widely used, as it accounts for both the relevance of items and their positions in the list. In this paper, the expectation and variance of Average Precision at k (AP@k) are derived since they can be used as biselines for MAP@k. Here, we covered two widely used evaluation models: offline and online. The expectation establishes the baseline, indicating the level of MAP@k that can be achieved by pure chance. The variance complements this baseline by quantifying the extent of random fluctuations, enabling a more reliable interpretation of observed scores. Comments: 17 pages, 2 tables, 2 figures Subjects: Information Retrieval (cs.IR); Probability (math.PR) MSC classes: Primary 60E05, 60C05, Secondary 62R07, 68T05 Cite as: arXiv:2511.02571 [cs.IR] (or arXiv:2511.02571v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.02571 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] Library and Culture: A Scientometric Analysis and Visualization of Research Trends
链接: https://arxiv.org/abs/2511.02296
作者: Auwalu Abdullahi Umar,Muneer Ahmad,Dr M Sadik Batcha
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 8 pages, 3 figures, Research Article
[IR-3] Research Output on Alopecia Areata Disease: A Scientometric Analysis of Publications from 2010 to 2019
链接: https://arxiv.org/abs/2511.02275
作者: Muneer Ahmad,M Sadik Batcha
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 16 pages, 3 figures, Research Paper
Abstract:The present study is undertaken to find out the publication trends on Alopecia Areata Disease during 2010-2019 from the global perspective. The study mainly focus on distribution of research output, top journals for publications, most prolific authors, authorship pattern, and citations pattern on Alopecia Areata Disease. The results indicate that highest growth rate of publications occurred during the year 2019. Columbia University topped the scene among all institutes. The maximum publications were more than four authored publications. Christiano AM and Clynes R were found to be the most prolific authors. It is also found that most of the prolific authors (by number of publications) do appear in highly cited publications list. Alopecia Areata Disease researchers mostly preferred using article publications to communicate their findings.
[IR-4] KGBridge: Knowledge-Guided Prompt Learning for Non-overlapping Cross-Domain Recommendation
链接: https://arxiv.org/abs/2511.02181
作者: Yuhan Wang,Qing Xie,Zhifeng Bao,Mengzi Tang,Lin Li,Yongjian Liu
类目: Information Retrieval (cs.IR)
*备注: 13 pages, 4 figures
Abstract:Knowledge Graphs (KGs), as structured knowledge bases that organize relational information across diverse domains, provide a unified semantic foundation for cross-domain recommendation (CDR). By integrating symbolic knowledge with user-item interactions, KGs enrich semantic representations, support reasoning, and enhance model interpretability. Despite this potential, existing KG-based methods still face major challenges in CDR, particularly under non-overlapping user scenarios. These challenges arise from: (C1) sensitivity to KG sparsity and popularity bias, (C2) dependence on overlapping users for domain alignment and (C3) lack of explicit disentanglement between transferable and domain-specific knowledge, which limit effective and stable knowledge transfer. To this end, we propose KGBridge, a knowledge-guided prompt learning framework for cross-domain sequential recommendation under non-overlapping user scenarios. KGBridge comprises two core components: a KG-enhanced Prompt Encoder, which models relation-level semantics as soft prompts to provide structured and dynamic priors for user sequence modeling (addressing C1), and a Two-stage Training Paradigm, which combines cross-domain pretraining and privacy-preserving fine-tuning to enable knowledge transfer without user overlap (addressing C2). By combining relation-aware semantic control with correspondence-driven disentanglement, KGBridge explicitly separates and balances domain-shared and domain-specific semantics, thereby maintaining complementarity and stabilizing adaptation during fine-tuning (addressing C3). Extensive experiments on benchmark datasets demonstrate that KGBridge consistently outperforms state-of-the-art baselines and remains robust under varying KG sparsity, highlighting its effectiveness in mitigating structural imbalance and semantic entanglement in KG-enhanced cross-domain recommendation.
[IR-5] Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion
链接: https://arxiv.org/abs/2511.02113
作者: Hai-Dang Kieu,Min Xu,Thanh Trung Huynh,Dung D. Le
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent advances in multimodal recommendation (MMR) have shown that incorporating rich content sources such as images and text can lead to significant gains representation quality. However, existing methods often rely on coarse visual features and uncontrolled fusion, leading to redundant or misaligned representations. As a result, visual encoders often fail to capture salient, item-relevant semantics, limiting their contribution in multimodal fusion. From an information-theoretic perspective, effective fusion should balance the unique, shared, and redundant information across modalities, preserving complementary cues while avoiding correlation bias. This paper presents VLIF, a vision-language and information-theoretic fusion framework that enhances multimodal recommendation through two key components. (i) A VLM-based visual enrichment module generates fine-grained, title-guided descriptions to transform product images into semantically aligned representations. (ii) An information-aware fusion module, inspired by Partial Information Decomposition (PID), disentangles redundant and synergistic signals across modalities for controlled integration. Experiments on three Amazon datasets demonstrate that VLIF consistently outperforms recent multimodal baselines and substantially strengthens the contribution of visual features.



