本篇博文主要内容为 2025-11-07 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-07)

今日共更新450篇论文,其中:

  • 自然语言处理67篇(Computation and Language (cs.CL))
  • 人工智能125篇(Artificial Intelligence (cs.AI))
  • 计算机视觉77篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习137篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中缺乏自我验证能力的问题,即即使得出正确答案,其推理链条可能存在逻辑漏洞,从而在高风险场景中削弱可信度。解决方案的关键在于提出VeriCoT——一种神经符号方法,通过将CoT的每一步推理形式化为一阶逻辑(first-order logic),并识别支撑论证的前提(premises),这些前提可来自源文本、常识知识或先前推理步骤;这种符号化表示使自动化逻辑求解器能够验证推理的有效性,同时自然语言形式的前提便于人类和系统识别无依据或谬误推理步骤,从而实现对推理过程的可靠验证与改进。

链接: https://arxiv.org/abs/2511.04662
作者: Yu Feng,Nathaniel Weir,Kaj Bostrom,Sam Bayless,Darion Cassel,Sapana Chaudhary,Benjamin Kiesl-Reiter,Huzefa Rangwala
机构: University of Pennsylvania (宾夕法尼亚大学); Amazon Web Services (亚马逊网络服务)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT’s verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.
zh

[NLP-1] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning NEURIPS2025

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示中因生成固定长度推理过程而导致的计算资源浪费问题,包括冗余token消耗和延迟增加。其解决方案的关键在于提出一种无需训练的解码算法LEASH(Logit-Entropy Adaptive Stopping Heuristic),该算法通过监测两个内在信号——token级熵的变化斜率与top-logit margin的改进程度——在两者均趋于平稳时自适应终止推理生成,从而在保持模型性能的前提下显著降低计算开销。

链接: https://arxiv.org/abs/2511.04654
作者: Mohammad Atif Quamar,Mohammad Areeb
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注: Presented at the 1st Workshop on Efficient Reasoning (NeurIPS 2025)

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30–35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.
zh

[NLP-2] DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM -Based Multi-Agent Collaboration

【速读】: 该论文旨在解决多智能体协同规划中因局部信息感知和有限通信导致的轨迹级协调失败问题,即微小的时间或动作偏差会引发连锁冲突,从而阻碍群体任务的有效执行。其解决方案的关键在于提出了一种去中心化的神经符号框架DR. WELL,通过两阶段协商协议实现高效协作:首先,各智能体基于推理提出候选角色;其次,在共识与环境约束下达成联合分配,并独立生成符号化计划(symbolic plan)执行各自角色,无需共享详细轨迹。该方法利用共享的世界模型(world model)持续更新环境状态,使智能体在符号层面进行推理而非原始轨迹对齐,从而避免了低层同步的脆弱性,提升了策略的可复用性、同步性和可解释性,实验表明该机制显著提高了任务完成率与效率。

链接: https://arxiv.org/abs/2511.04646
作者: Narjes Nourzad,Hanqing Yang,Shiyu Chen,Carlee Joe-Wong
机构: University of Southern California (南加州大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Cooperative multi-agent planning requires agents to make joint decisions with partial information and limited communication. Coordination at the trajectory level often fails, as small deviations in timing or movement cascade into conflicts. Symbolic planning mitigates this challenge by raising the level of abstraction and providing a minimal vocabulary of actions that enable synchronization and collective progress. We present DR. WELL, a decentralized neurosymbolic framework for cooperative multi-agent planning. Cooperation unfolds through a two-phase negotiation protocol: agents first propose candidate roles with reasoning and then commit to a joint allocation under consensus and environment constraints. After commitment, each agent independently generates and executes a symbolic plan for its role without revealing detailed trajectories. Plans are grounded in execution outcomes via a shared world model that encodes the current state and is updated as agents act. By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids brittle step-level alignment and enables higher-level operations that are reusable, synchronizable, and interpretable. Experiments on cooperative block-push tasks show that agents adapt across episodes, with the dynamic world model capturing reusable patterns and improving task completion rates and efficiency. Experiments on cooperative block-push tasks show that our dynamic world model improves task completion and efficiency through negotiation and self-refinement, trading a time overhead for evolving, more efficient collaboration strategies.
zh

[NLP-3] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection

【速读】: 该论文旨在解决当前基于生成式 AI(Generative AI)的虚假信息验证系统在实际部署中面临的计算效率低下和幻觉风险问题。现有方法依赖于大型语言模型(Large Language Models, LLMs)生成解释性推理过程,虽然性能优异,但存在显著的计算开销和不可靠性。解决方案的关键在于提出一种轻量级框架 DeReC(Dense Retrieval Classification),其核心思想是用通用文本嵌入(general-purpose text embeddings)替代自回归 LLM 生成机制,通过密集检索与专用分类器相结合的方式实现高效且准确的事实验证。实验表明,DeReC 在保持甚至超越 LLM 方法性能的同时,将运行时间减少高达 95%,展现出更强的实用性与可扩展性。

链接: https://arxiv.org/abs/2511.04643
作者: Alamgir Munir Qazi,John P. McCrae,Jamal Abdul Nasir
机构: University of Galway (爱尔兰高威大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.
zh

[NLP-4] Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis

【速读】: 该论文旨在解决自然语言接口在处理表格数据时面临的查询歧义问题,传统方法常将歧义视为缺陷,而本文提出将其重构为合作交互中的特征,即用户与系统共同承担查询规范的责任。解决方案的关键在于构建一个原则性的框架,用于区分“合作性查询”(cooperative queries,可被解析的查询)与“非合作性查询”(uncooperative queries,无法解析的查询),并通过该框架对15个主流表格问答数据集中的查询类型进行分析,揭示现有评估体系中查询类型混杂的问题,从而推动更科学地评估系统执行准确性和解释能力。这一视角转变促使研究者从“消除歧义”转向“拥抱协作”,为表格数据的自然语言接口设计与评估提供了新的理论基础和实践方向。

链接: https://arxiv.org/abs/2511.04584
作者: Daniel Gomm,Cornelius Wolff,Madelon Hulsebos
机构: Centrum Wiskunde & Informatica (荷兰数学与计算机中心); University of Amsterdam (阿姆斯特丹大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Human-Computer Interaction (cs.HC)
备注: Accepted to the AI for Tabular Data workshop at EurIPS 2025

点击查看摘要

Abstract:Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction, where the responsibility of query specification is shared among the user and the system. We develop a principled framework distinguishing cooperative queries, i.e., queries that yield a resolvable interpretation, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system’s execution accuracy nor for evaluating interpretation capabilities. Our framework and analysis of queries shifts the perspective from fixing ambiguity to embracing cooperation in resolving queries. This reflection enables more informed design and evaluation for natural language interfaces for tabular data, for which we outline implications and directions for future research.
zh

[NLP-5] Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

【速读】: 该论文旨在解决当前AI Scientist系统在科学发现中可信性与可持续性不足的问题,以保障AI驱动的科学研究进程不损害学术生态系统的完整性。其解决方案的关键在于构建了一个名为Jr. AI Scientist的自主AI科学家系统,该系统模拟新手学生研究人员的核心科研流程:从分析基准文献的局限性出发,提出改进假设,通过严谨实验验证,并撰写包含结果的论文。该系统不同于以往完全自动化或仅限于小规模代码处理的方法,而是采用明确的研究工作流并利用现代编码代理(coding agents)处理复杂的多文件实现,从而产生具有科学价值的成果。

链接: https://arxiv.org/abs/2511.04583
作者: Atsuyuki Miyai,Mashiro Toyooka,Takashi Otonari,Zaiying Zhao,Kiyoharu Aizawa
机构: The University of Tokyo (东京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Issues, comments, and questions are all welcome in this https URL

点击查看摘要

Abstract:Understanding the current capabilities and risks of AI Scientist systems is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, validates them through rigorous experimentation, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven scientific contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores than existing fully automated systems. Nevertheless, we identify important limitations from both the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We hope these insights will deepen understanding of current progress and risks in AI Scientist development.
zh

[NLP-6] hinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

【速读】: 该论文旨在解决当前“文本推理”(Thinking with Text)与“图像推理”(Thinking with Images)范式在多模态大模型中的局限性问题:一是图像仅能捕捉静态瞬间,难以表征动态过程或连续变化;二是文本与视觉作为分离模态,阻碍了统一的多模态理解与生成。其解决方案的关键在于提出“视频推理”(Thinking with Video)新范式,利用视频生成模型(如Sora-2)构建统一的时间框架,实现视觉与文本推理的深度融合。通过开发VideoThinkBench基准测试集验证该范式有效性,结果显示Sora-2在视觉主导任务上媲美甚至超越先进视觉语言模型(VLMs),并在文本主导任务中达到高精度(如MATH 92%、MMMU 75.53%),表明视频生成模型具备成为统一多模态理解和生成模型的潜力。

链接: https://arxiv.org/abs/2511.04570
作者: Jingqi Tong,Yurong Mou,Hangcheng Li,Mingzhe Li,Yongzhuo Yang,Ming Zhang,Qiguang Chen,Tianyi Liang,Xiaomeng Hu,Yining Zheng,Xinchi Chen,Jun Zhao,Xuanjing Huang,Xipeng Qiu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 36 pages, 14 figures

点击查看摘要

Abstract:“Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce “Thinking with Video”, a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions “thinking with video” as a unified multimodal reasoning paradigm.
zh

[NLP-7] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering

【速读】: 该论文旨在解决低资源语言(如孟加拉语)中生物医学问答(Biomedical Question Answering, QA)系统准确率低的问题,从而提升医疗知识在这些语言中的可及性和可靠性。其关键解决方案是构建了首个大规模孟加拉语生物医学多选题数据集BanglaMedQA和BanglaMMedBench,并在此基础上系统评估多种检索增强生成(Retrieval-Augmented Generation, RAG)策略,其中最具创新性的是提出并实现了一种基于代理的Agentic RAG管道,该管道能动态选择是否调用检索或推理策略,结合教科书文本与网络检索信息进行生成式推理,最终在GPT-OSS-120B模型上达到89.54%的最高准确率,显著优于其他配置,验证了RAG方法在提升孟加拉语医疗问答准确性方面的有效性。

链接: https://arxiv.org/abs/2511.04560
作者: Sadia Sultana,Saiyma Sittul Muna,Mosammat Zannatul Samarukh,Ajwad Abrar,Tareque Mohmud Chowdhury
机构: IUT-Dhaka(达卡国际大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.
zh

[NLP-8] From Model to Breach: Towards Actionable LLM -Generated Vulnerabilities Reporting

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLM)在软件开发中生成存在安全漏洞代码的问题,尤其是在现实应用场景下,即使最新开放权重的LLM仍可能产生最早被报告的漏洞类型,表明当前安全-功能权衡尚未有效缓解此类风险。其解决方案的关键在于提出一个新的严重性度量指标——Prompt Exposure (PE),该指标综合考虑漏洞严重性、生成概率以及诱导漏洞代码的提示(prompt)特征,从而更精准地反映LLM生成漏洞的实际风险;进一步基于PE定义Model Exposure (ME)得分,用于量化模型生成漏洞的严重性和普遍性,以引导对最危险且高频漏洞的有效治理。

链接: https://arxiv.org/abs/2511.04538
作者: Cyril Vallez,Alexander Sternfeld,Andrei Kucharavy,Ljiljana Dolamic
机构: IEM, HES-SO Valais-Wallis, Switzerland; II, HES-SO Valais-Wallis, Switzerland; Cyber-Defence Campus, armasuisse, Switzerland
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation of the most serious and prevalent vulnerabilities, we use PE to define the Model Exposure (ME) score, which indicates the severity and prevalence of vulnerabilities a model generates.
zh

[NLP-9] IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection AAAI

【速读】: 该论文旨在解决传统自动化作文评分系统在分析论证类文章时缺乏对论点结构与逻辑关系的深度理解的问题。现有系统通常仅提供分数而无法解释评分依据,难以支持用户对论证质量的细致评估与改进。解决方案的关键在于提出IntelliProof系统,其将议论文建模为论证图(argumentation graph),其中论点作为节点、证据作为节点属性、支持或攻击关系作为边;利用大语言模型(LLM)自动识别并量化每条关系,结合可视化与自然语言解释,实现对论证连贯性(coherence)的可解释分析,从而在保持人类监督的前提下提升用户对文本结构语义的理解效率。

链接: https://arxiv.org/abs/2511.04528
作者: Kaveh Eskandari Miandoab,Katharine Kowalyshyn,Kabir Pamnani,Anesu Gavhera,Vasanth Sarathy,Matthias Scheutz
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for the 40th Annual AAAI Conference on Artificial Intelligence (2026) - Demonstration Track

点击查看摘要

Abstract:We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user’s understanding of a given text. A live demo and the system are available here to try: \textbfthis https URL
zh

[NLP-10] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

【速读】: 该论文试图解决生成式语言模型在链式思维(chain-of-thought)推理过程中,由于token选择不确定性导致的推理路径多样性难以量化的问题。解决方案的关键在于利用隐藏激活(hidden activations)来控制和预测模型在推理过程中的不确定性;实验表明,模型在尚未确定最终答案前(即存在多个潜在推理路径时),其隐藏激活对不确定性的解释力最强,且这些激活状态能够有效预测未来的输出分布,说明模型隐式地表征了可能的推理路径空间。

链接: https://arxiv.org/abs/2511.04527
作者: Amir Zur,Atticus Geiger,Ekdeep Singh Lubana,Eric Bigelow
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model’s uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model – in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model’s future outcome distribution, demonstrating that models implicitly represent the space of possible paths.
zh

[NLP-11] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways

【速读】: 该论文旨在解决放射科报告中不确定性建模的问题,具体包括显式不确定性和隐式不确定性两类:显式不确定性表现为放射科医生使用模糊表述(hedging phrases)表达对病灶存在与否的怀疑,其语义依赖于上下文,传统规则系统难以量化;隐式不确定性则源于报告省略部分推理过程,仅记录关键发现,导致无法判断未提及的病灶是否真实不存在或仅因简洁而被忽略。解决方案的关键在于构建一个两阶段框架:首先基于专家验证的大型语言模型(LLM)参考排序,将常见模糊表述映射为对应诊断概率值以量化显式不确定性;其次通过扩展框架,依据专家定义的14种常见疾病的诊断路径,系统性补充特征子发现,从而建模隐式不确定性。最终成果为Lunguage++——一个增强版、具备不确定性感知能力的细粒度结构化放射科报告基准数据集,支持更可靠的图像分类、忠实的诊断推理及临床不确定性影响的新研究。

链接: https://arxiv.org/abs/2511.04506
作者: Paloma Rabaey,Jong Hak Moon,Jung-Oh Lee,Min Gwan Kim,Hangyul Yoon,Thomas Demeester,Edward Choi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.
zh

[NLP-12] RAG alyst: Automated Human-Aligned Agent ic Evaluation for Domain-Specific RAG

【速读】: 该论文旨在解决在专业化、高安全性领域中对检索增强生成(Retrieval-Augmented Generation, RAG)系统进行可靠评估的难题。现有评估框架多依赖启发式指标,难以捕捉领域特异性细节;而基于大语言模型作为裁判(LLM-as-a-Judge)的方法又缺乏与人类判断的一致性验证。其解决方案的关键在于提出RAGalyst——一个自动化且与人类判断对齐的智能体(agentic)评估框架,通过构建高质量合成问答数据集并引入智能体过滤机制保障数据真实性,同时优化两个核心LLM-as-a-Judge指标(答案正确性与可回答性),实现与人工标注的高度相关性。该框架在军事作战、网络安全和桥梁工程三个领域验证了性能的高度情境依赖性,揭示了无通用最优嵌入模型、大语言模型或超参数配置,强调了系统性评估对于构建可靠RAG系统的重要性。

链接: https://arxiv.org/abs/2511.04502
作者: Joshua Gao,Quoc Huy Pham,Subin Varghese,Silwal Saurav,Vedhus Hoskere
机构: University of Houston (休斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.
zh

[NLP-13] Large language models replicate and predict human cooperation across experiments in game theory

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会决策行为模拟中与人类实际行为之间存在差距的问题,这一差距不仅可能导致实际应用中的有害后果,也限制了LLMs在社会模拟场景中的有效性。其解决方案的关键在于构建一个博弈论实验的数字孪生系统,并提出一套系统性的提示(prompting)与探测(probing)框架,用于机器行为评估。通过在三个开源模型(Llama、Mistral 和 Qwen)上测试,研究发现Llama能高保真地复现人类合作模式并捕捉偏离理性选择理论的行为特征,而Qwen则更贴近纳什均衡预测;尤为关键的是,该方法无需基于人格设定的提示即可实现群体层面的行为复制,从而简化了仿真流程,并进一步扩展至原始实验参数范围之外的新博弈配置,生成可预注册的假设,为社会行为科学提供了一种可系统探索未覆盖实验空间的互补研究范式。

链接: https://arxiv.org/abs/2511.04500
作者: Andrea Cera Palatsi,Samuel Martin-Gutierrez,Ana S. Cardenal,Max Pellert
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); Universidad Politécnica de Madrid (马德里理工大学); Universitat Oberta de Catalunya (加泰罗尼亚开放大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used both to make decisions in domains such as health, education and law, and to simulate human behavior. Yet how closely LLMs mirror actual human decision-making remains poorly understood. This gap is critical: misalignment could produce harmful outcomes in practical applications, while failure to replicate human behavior renders LLMs ineffective for social simulations. Here, we address this gap by developing a digital twin of game-theoretic experiments and introducing a systematic prompting and probing framework for machine-behavioral evaluation. Testing three open-source models (Llama, Mistral and Qwen), we find that Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory, while Qwen aligns closely with Nash equilibrium predictions. Notably, we achieved population-level behavioral replication without persona-based prompting, simplifying the simulation process. Extending beyond the original human-tested games, we generate and preregister testable hypotheses for novel game configurations outside the original parameter grid. Our findings demonstrate that appropriately calibrated LLMs can replicate aggregate human behavioral patterns and enable systematic exploration of unexplored experimental spaces, offering a complementary approach to traditional research in the social and behavioral sciences that generates new empirical predictions about human social decision-making.
zh

[NLP-14] Decoding Emergent Big Five Traits in Large Language Models : Temperature-Dependent Expression and Architectural Clustering AACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在人机交互场景中表现出类人格行为的可预测性与可控性问题,以支持负责任的模型开发与部署。其解决方案的关键在于系统性地应用大五人格量表(Big Five Inventory-2, BFI-2)框架,量化评估六种主流LLMs在不同采样温度下的性格维度表现,发现神经质(Neuroticism)和外向性(Extraversion)对温度变化敏感,且通过层次聚类揭示模型架构特征可能决定其人格模式的稳定性,从而为模型调优、选择及伦理治理提供数据驱动的新视角。

链接: https://arxiv.org/abs/2511.04499
作者: Christos-Nikolaos Zacharopoulos,Revekka Kyriakoglou
机构: Université Paris 8 (巴黎第八大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at IJCNLP-AACL 2025

点击查看摘要

Abstract:As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: this https URL
zh

[NLP-15] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation EMNLP2025

【速读】: 该论文旨在解决可读性可控的文本简化(readability-controlled text simplification)问题,核心挑战在于如何在保持语义不变的前提下,将原文本从较高CEFR(Common European Framework of Reference for Languages)水平逐步简化至目标CEFR水平。解决方案的关键在于发现文本简化性能与源文本和目标文本之间CEFR等级差距密切相关,并据此提出两种多轮简化方法:基于规则的多轮简化(MRS-Rule)和联合规则与大语言模型(LLM)的多轮简化(MRS-Joint),其中MRS-Joint通过以LLM生成的简化候选文本作为初始起点,显著提升了多轮简化效果。

链接: https://arxiv.org/abs/2511.04495
作者: Cuong Huynh,Jie Cao
机构: University of Oklahoma (俄克拉荷马大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to TSAR 2025 Workshop at EMNLP2025

点击查看摘要

Abstract:This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task (Alva-Manchego et al., 2025), designed for readability-controlled text simplification using LLM-prompting-based generation. Based on the analysis of prompt-based text simplification methods, we discovered an interesting finding that text simplification performance is highly related to the gap between the source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods and generate them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost the multi-round simplification performance.
zh

[NLP-16] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

【速读】: 该论文旨在解决现有表格推理基准测试普遍存在的局限性问题,即大多数基准仅使用小型、同质化表格进行评估,无法充分反映现实世界数据的复杂性,从而导致对大语言模型(Large Language Models, LLMs)推理能力的评价不全面。为填补这一空白,作者提出了RUST-BENCH,其关键在于构建了一个包含2031张真实世界表格的多维度评测基准,涵盖科学(NSF资助记录)和体育(NBA统计数据)两个领域,共7966个问题,显著提升了表格长度、结构异构性(heterogeneity)、领域特异性(domain specificity)及多跳推理(multi-hop reasoning)的复杂度。该方案首次实现了对LLMs在规模、异构性、领域适应性和推理深度上的联合评估,揭示了当前模型在处理复杂异构表结构和深层逻辑推理中的系统性不足,为推动表格推理研究提供了更具挑战性的新基准。

链接: https://arxiv.org/abs/2511.04491
作者: Nikhil Abhyankar,Purvi Chaurasia,Sanchit Kabra,Ananya Srivastava,Vivek Gupta,Chandan K. Reddy
机构: Virginia Tech (弗吉尼亚理工学院); IGDTUW New Delhi (印度理工大学德里分校); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models’ (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.
zh

[NLP-17] haiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai AACL2025

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在泰语文本丰富的视觉理解任务中评估不足的问题,尤其针对低资源语言和复杂书写系统下的文档结构理解能力缺失。其解决方案的关键在于提出ThaiOCRBench——首个面向泰语的综合性基准测试平台,包含2,808个由人工标注的样本,覆盖13类任务,并在零样本设置下对多种前沿VLMs进行系统评估。该基准不仅揭示了开源模型与商用模型之间的显著性能差距,还通过细粒度错误分析识别出语言偏倚、结构不匹配和幻觉内容等核心挑战,为提升泰语文档理解能力提供了标准化评估框架与改进方向。

链接: https://arxiv.org/abs/2511.04479
作者: Surapon Nonesung,Teetouch Jaknamon,Sirinya Chaiophat,Natapong Nitarach,Chanakan Wittayasakpan,Warit Sirichotedumrong,Adisai Na-Thalang,Kunat Pipatanakul
机构: SCB 10X R&D, SCB 10X, SCBX Group
类目: Computation and Language (cs.CL)
备注: Accepted at the IJCNLP-AACL 2025 (Main)

点击查看摘要

Abstract:We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
zh

[NLP-18] Probabilistic Textual Time Series Depression Detection

【速读】: 该论文旨在解决抑郁症严重程度预测中缺乏不确定性估计和时间建模能力的问题,这限制了临床决策支持系统的准确性与可解释性。其解决方案的关键在于提出一种概率文本时间序列抑郁检测框架(PTTSD),该框架通过结合双向长短期记忆网络(Bidirectional LSTM)、自注意力机制和残差连接,并采用高斯或学生t分布输出头,实现了对PHQ-8评分的时序建模与不确定性量化;同时,在E-DAIC和DAIC-WOZ数据集上验证了其在纯文本系统中的最先进性能及校准良好的预测区间,显著提升了模型的临床可用性与可解释性。

链接: https://arxiv.org/abs/2511.04476
作者: Fabian Schmidt,Seyedehmoniba Ravan,Vladimir Vlassov
机构: KTH Royal Institute of Technology (皇家理工学院); Uppsala University (乌普萨拉大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Accurate and interpretable predictions of depression severity are essential for clinical decision support, yet existing models often lack uncertainty estimates and temporal modeling. We propose PTTSD, a Probabilistic Textual Time Series Depression Detection framework that predicts PHQ-8 scores from utterance-level clinical interviews while modeling uncertainty over time. PTTSD includes sequence-to-sequence and sequence-to-one variants, both combining bidirectional LSTMs, self-attention, and residual connections with Gaussian or Student-t output heads trained via negative log-likelihood. Evaluated on E-DAIC and DAIC-WOZ, PTTSD achieves state-of-the-art performance among text-only systems (e.g., MAE = 3.85 on E-DAIC, 3.55 on DAIC) and produces well-calibrated prediction intervals. Ablations confirm the value of attention and probabilistic modeling, while comparisons with MentalBERT establish generality. A three-part calibration analysis and qualitative case studies further highlight the interpretability and clinical relevance of uncertainty-aware forecasting.
zh

[NLP-19] Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLM s

【速读】: 该论文旨在解决当前知识图谱(Knowledge Graph, KG)增强大语言模型(Large Language Models, LLMs)时缺乏高质量、具有明确真实答案的问答数据集的问题,从而阻碍了KG检索方法的有效比较与优化。其解决方案的关键在于提出SynthKGQA框架,该框架能够从任意知识图谱自动生成高质量合成的KG问答数据集,并提供完整的图谱事实作为推理依据,从而支持更精准的KG检索器评估与训练。通过在Wikidata上应用该框架生成GTSQA数据集,研究进一步验证了SynthKGQA在测试KG检索器零样本泛化能力方面的有效性。

链接: https://arxiv.org/abs/2511.04473
作者: Alberto Cattaneo,Carlo Luschi,Daniel Justus
机构: Graphcore Research (Graphcore 研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval. We present SynthKGQA, a framework for generating high-quality synthetic Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over each question. We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models. We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset designed to test zero-shot generalization abilities of KG retrievers with respect to unseen graph structures and relation types, and benchmark popular solutions for KG-augmented LLMs on it.
zh

[NLP-20] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在时间推理(temporal reasoning)能力上的表现问题,特别是评估其在历史语境下回答基于特定年代背景知识的问题的能力。解决方案的关键在于设计一个基于1940年挪威书籍中 trivia 问题的实验框架,通过让LLMs以“当时”的视角作答,并采用LLM-as-judge进行评分与人工抽样验证相结合的方式,系统性地比较不同模型家族(如DeepSeek-R1、Gemma3、Qwen3、Llama3.1及专为挪威语优化的模型)和语言提示(英语 vs. 挪威语)对结果的影响。研究发现,尽管使用挪威语提示本应更贴合本地化语境,但英语提示反而表现更优,而模型规模的增大则显著提升了时间推理准确性。

链接: https://arxiv.org/abs/2511.04432
作者: Lars Bungum,Charles Yijia Huang,Abeer Kashar
机构: NTNU(挪威科技大学); University of Waterloo(滑铁卢大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure, 3 tables, submitted to aconference

点击查看摘要

Abstract:In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.
zh

[NLP-21] he Illusion of Certainty: Uncertainty quantification for LLM s fails under ambiguity

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对真实世界中固有模糊性(aleatoric uncertainty)时,现有不确定性量化(Uncertainty Quantification, UQ)方法性能显著下降的问题。当前主流UQ方法多在无歧义任务上进行评估,但实际语言数据常包含语义模糊性,导致这些方法失效。论文的关键解决方案是构建了两个全新的模糊问答数据集MAQA和AmbigQA,其标注基于事实共现统计估计的真值答案分布,从而首次实现了对LLM在模糊场景下UQ性能的系统性评估。研究发现,无论采用预测分布、内部表征还是模型集成等不同估计范式,UQ性能均显著退化,且理论分析表明预测分布和集成类方法在模糊环境下存在根本性局限。这一发现揭示了当前LLM UQ方法的核心缺陷,并呼吁重新审视建模范式以实现更可靠的不确定性估计。

链接: https://arxiv.org/abs/2511.04418
作者: Tim Tomov,Dominik Fuchsgruber,Tom Wollschläger,Stephan Günnemann
机构: Technical University of Munich (慕尼黑工业大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.
zh

[NLP-22] Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

【速读】: 该论文旨在解决机器翻译模型微调过程中数据质量与有效选择问题,即如何从大规模语料中筛选出最具学习价值的数据点以提升模型性能。其解决方案的关键在于提出一种基于学习者模型(learner model)与预训练参考模型(pre-trained reference model)协同作用的数据选择方法,通过定义“可学性得分”(learnability score)系统评估每个数据点的训练价值,并结合考虑数据点间依赖关系的批量选择策略,从而在减少训练数据量的同时显著提升模型训练效率和泛化能力。实验表明,该方法相较独立同分布(iid)基线可实现高达5倍的数据效率提升,并在使用缓存嵌入时降低24%的计算开销。

链接: https://arxiv.org/abs/2511.04406
作者: Mohammad Amin Ghanizadeh,Mohammad Javad Dousti
机构: University of Tehran (德黑兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.
zh

[NLP-23] SSPO: Subsentence-level Policy Optimization

【速读】: 该论文旨在解决当前基于可验证奖励的强化学习算法(Reinforcement Learning from Verifiable Reward, RLVR)在大型语言模型(Large Language Models, LLMs)后训练过程中存在的两个核心问题:一是GRPO(Group Relative Policy Optimization)因采用token级重要性比(importance ratio)导致策略更新不稳定、易受异常值干扰,进而引发训练崩溃;二是GSPO(Group Sequence Policy Optimization)虽引入response级重要性比缓解了高方差问题,但因所有token共享同一重要性比,极端值易使整个响应被裁剪机制误判为无效,造成采样数据利用率低下。解决方案的关键在于提出SSPO(Sentence-level Sequence Policy Optimization),其创新性地采用sentence-level重要性比,在GRPO与GSPO之间取得平衡——既避免了token级方法的训练不稳定性,又防止了response级方法中因全局裁剪导致的整条响应被丢弃的问题。此外,通过引入sentence entropy动态调整PPO-CLIP的裁剪边界,进一步优化了策略更新的稳定性与探索效率,从而显著提升了模型性能,在多个基准数据集上达到最优结果。

链接: https://arxiv.org/abs/2511.04256
作者: Kun Yang,Zikang chen,Yanmeng Wang,Zhigen Li
机构: Ping An Technology (平安科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs’ reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO’s effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.
zh

[NLP-24] Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models

【速读】: 该论文旨在解决主题建模(Topic Modeling, TM)输出的主题词分布缺乏清晰可解释性的问题,即如何为自动提取的词语集合分配语义明确的主题标签。其解决方案的关键在于提出一种基于图结构的方法,通过引入语义相关的扩展词并分析词间关系来增强主题表示,从而推导出准确反映主题含义的标签。该方法在保持计算高效的同时,在BERTScore和余弦相似度指标上优于传统基准模型,并达到与ChatGPT-3.5相当的标签质量。

链接: https://arxiv.org/abs/2511.04248
作者: Salma Mekaooui,Hiba Sofyan,Imane Amaaz,Imane Benchrif,Arsalane Zarghili,Ilham Chaker,Nikola S. Nikolov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling ™, can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic’s meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.
zh

[NLP-25] Reusing Pre-Training Data at Test Time is a Compute Multiplier

【速读】: 该论文试图解决的问题是:当前大规模语言模型(Large Language Models, LLMs)的预训练过程是否充分挖掘了预训练数据集中的知识与信息,即预训练机制是否存在效率瓶颈。解决方案的关键在于利用检索增强生成(Retrieval-Augmented Generation, RAG)与测试时计算(test-time compute)相结合的方法,量化预训练过程中未被充分利用的数据价值,并验证通过外部检索补充上下文能否显著提升模型性能。实验表明,仅通过预训练后检索标准开源数据集即可在MMLU、Math-500和SimpleQA等任务上获得显著准确率提升,且这种增益在去污(decontamination)后依然存在;尤其在MMLU任务中,检索相当于约5倍的计算效率倍增器,进一步引入测试时推理解析能力可使公开的LLaMA 3.1 8B模型准确率提升10个百分点。结果表明,现有预训练方法未能充分提取数据中的信息,存在显著优化空间。

链接: https://arxiv.org/abs/2511.04234
作者: Alex Fang,Thomas Voice,Ruoming Pang,Ludwig Schmidt,Tom Gunter
机构: Apple(苹果); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today’s pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.
zh

[NLP-26] REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLM s

【速读】: 该论文旨在解决机器遗忘(Machine Unlearning)中一个关键问题:现有评估方法仅在单个输入层面检测遗忘效果,可能忽略语义相似样本中存在的残留记忆(Residual Memorization),从而导致隐私泄露和间接信息泄漏。解决方案的关键在于提出一种名为REMIND(Residual Memorization In Neighborhood Dynamics)的新评估方法,其通过分析模型在小范围输入扰动下的损失变化模式,识别未完全遗忘数据的特征——这类数据对应的损失曲面更平坦、变化更缓和,而保留或无关数据则表现出更陡峭、波动更强的模式。REMIND仅需查询访问(query-based access),在不同模型、数据集和改写输入下均表现鲁棒,显著优于现有方法,为语言模型中的遗忘有效性提供了更敏感且可解释的评估框架。

链接: https://arxiv.org/abs/2511.04228
作者: Liran Cohen,Yaniv Nemcovesky,Avi Mendelson
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Pre-print version under review

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model’s loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.
zh

[NLP-27] Black-Box Guardrail Reverse-engineering Attack

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中安全防护机制——即“护栏”(guardrails)——存在的可被逆向工程攻击的问题。当前护栏虽能有效抑制有害输出,但其决策模式在黑盒环境下暴露了可被探测的规律,从而引入新的安全风险。解决方案的关键在于提出一种基于强化学习的逆向攻击框架——Guardrail Reverse-engineering Attack (GRA),该方法利用遗传算法驱动的数据增强策略,通过迭代收集输入-输出对、优先识别分歧样本,并实施针对性的变异与交叉操作,逐步逼近目标护栏的真实决策策略,最终实现高保真度的代理模型构建。实验表明,GRA在三个主流商用系统(ChatGPT、DeepSeek 和 Qwen3)上规则匹配率超过 0.92,且 API 成本低于 85,验证了其高效性和实用性。

链接: https://arxiv.org/abs/2511.04215
作者: Hongwei Yao,Yun Xia,Shuo Shao,Haoran Shi,Tong Qiao,Cong Wang
机构: City University of Hong Kong (香港城市大学); Zhejiang University (浙江大学); Hangzhou Dianzi University (杭州电子科技大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3, and demonstrate that it achieves an rule matching rate exceeding 0.92 while requiring less than 85 in API costs. These findings underscore the practical feasibility of guardrail extraction and highlight significant security risks for current LLM safety mechanisms. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.
zh

[NLP-28] Block Rotation is All You Need for MXFP4 Quantization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练量化(Post-Training Quantization, PTQ)过程中,针对新兴的MXFP4(一种具有多种硬件支持的FP4格式)实现高精度W4A4量化仍面临显著挑战的问题。现有主流方法多针对INT4设计,而旋转类方法(rotation-based approaches)虽在INT4中表现优异,却因与MXFP4的PoT(power-of-two)块缩放机制存在根本性冲突,导致性能严重下降。解决方案的关键在于首次深入分析了这一冲突根源——即全局旋转对异常值能量的重新分布与MXFP4块缩放之间的不兼容性,并提出了一种简单但有效的“块旋转”策略(block rotation strategy),通过适配旋转机制以匹配MXFP4的结构特性,从而显著提升多种LLMs在MXFP4下的量化精度。

链接: https://arxiv.org/abs/2511.04214
作者: Yuantian Shao,Peisong Wang,Yuanteng Chen,Chang Xu,Zhihui Wei,Jian Cheng
机构: Nanjing University of Science and Technology (南京理工大学); C2C2DL, Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Zhongguancun Academy (中关村学院); School of Computer Science, University of Sydney (悉尼大学计算机科学学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 – a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4’s PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.
zh

[NLP-29] LLM -as-a-Judge is Bad Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)是否具备通过波兰国家上诉法院(Krajowa Izba Odwoławcza)官方资格考试的能力,从而评估其在法律实务场景中替代人类法官或独立评审员的可行性。解决方案的关键在于构建了一个混合信息检索与抽取流水线,并采用两种测试范式——一是将LLM作为实际考试参与者完成多选题知识测试和书面判决写作;二是引入“LLM-as-a-judge”方法,由其他模型自动评判生成答案的质量。实验表明,尽管LLM在知识测试中表现良好,但在实践性写作部分均未达到及格线,且自动评分结果与官方评审委员会判断存在显著偏差,揭示出模型在法律推理、引文准确性及逻辑严谨性方面的系统性局限,强调了法律专家与技术团队深度协作的必要性。

链接: https://arxiv.org/abs/2511.04205
作者: Michał Karp,Anna Kubaszewska,Magdalena Król,Robert Król,Aleksander Smywiński-Pohl,Mateusz Szymański,Witold Wydmański
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland’s National Appeal Chamber (Krajowa Izba Odwoławcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the ‘LLM-as-a-judge’ approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the ‘LLM-as-a-judge’ often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.
zh

[NLP-30] Computational Turing Test Reveals Systematic Differences Between Human and AI Language

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在社会科学中用于模拟人类行为时缺乏可靠验证工具的问题,尤其是现有基于人类判断的评估方法存在主观性和不可靠性,导致无法准确衡量LLM生成文本的真实性与语义一致性。其解决方案的关键在于提出一种计算型图灵测试(computational Turing test)框架,该框架融合了聚合指标(如基于BERT的可检测性与语义相似度)和可解释的语言特征(如风格标记与主题模式),从而系统性地评估LLM在特定数据集内对人类语言的逼近程度;同时通过五种校准策略(包括微调、风格提示和上下文检索)对九个开源LLM进行实证比较,揭示出优化人类相似性常以牺牲语义保真度为代价的核心权衡关系,为LLM模拟研究提供了可扩展的验证与校准机制,并警示当前模型在捕捉人类交流复杂性方面的局限。

链接: https://arxiv.org/abs/2511.04195
作者: Nicolò Pagan,Petter Törnberg,Christopher A. Bail,Anikó Hannák,Christopher Barrie
机构: University of Zurich, Department of Informatics (苏黎世大学信息学院); University of Amsterdam, Institute for Logic, Language and Computation (阿姆斯特丹大学逻辑、语言与计算研究所); Duke University (杜克大学); New York University, Department of Sociology (纽约大学社会学系)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations – testing whether humans can distinguish AI from human output – despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies – including fine-tuning, stylistic prompting, and context retrieval – benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations – and offer a cautionary note about their current limitations in capturing human communication.
zh

[NLP-31] rustworthy LLM -Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains

【速读】: 该论文旨在解决当前由大语言模型(Large Language Models, LLMs)驱动的AI生成内容所引发的“沟通异化”问题——即发送方用LLM将简单想法膨胀为冗长文本,接收方再用LLM压缩回摘要,导致双方均未真正参与知识交流。其解决方案的关键在于提出LAAC(LLM as a Communicator)范式,将LLM定位为智能通信中介,通过结构化对话捕捉发送方意图,并促进与接收方之间的实质性知识传递。该方案的核心创新在于重构LLM在沟通链中的角色,从内容膨胀与压缩的循环中脱离,转而聚焦于提升信息捕获保真度、交互可复现性以及响应完整性三大信任维度,从而实现跨场景(如学术论文、专业邮件等)的可靠沟通。

链接: https://arxiv.org/abs/2511.04184
作者: Mohammed Musthafa Rafi,Adarsh Krishnamurthy,Aditya Balu
机构: Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures. Submitted to IEEE DISTILL 2025 (co-located with IEEE TPS 2025)

点击查看摘要

Abstract:The proliferation of AI-generated content has created an absurd communication theater where senders use LLMs to inflate simple ideas into verbose content, recipients use LLMs to compress them back into summaries, and as a consequence neither party engage with authentic content. LAAC (LLM as a Communicator) proposes a paradigm shift - positioning LLMs as intelligent communication intermediaries that capture the sender’s intent through structured dialogue and facilitate genuine knowledge exchange with recipients. Rather than perpetuating cycles of AI-generated inflation and compression, LAAC enables authentic communication across diverse contexts including academic papers, proposals, professional emails, and cross-platform content generation. However, deploying LLMs as trusted communication intermediaries raises critical questions about information fidelity, consistency, and reliability. This position paper systematically evaluates the trustworthiness requirements for LAAC’s deployment across multiple communication domains. We investigate three fundamental dimensions: (1) Information Capture Fidelity - accuracy of intent extraction during sender interviews across different communication types, (2) Reproducibility - consistency of structured knowledge across multiple interaction instances, and (3) Query Response Integrity - reliability of recipient-facing responses without hallucination, source conflation, or fabrication. Through controlled experiments spanning multiple LAAC use cases, we assess these trust dimensions using LAAC’s multi-agent architecture. Preliminary findings reveal measurable trust gaps that must be addressed before LAAC can be reliably deployed in high-stakes communication scenarios.
zh

[NLP-32] ransforming Mentorship: An AI Powered Chatbot Approach to University Guidance

【速读】: 该论文旨在解决大学新生在本科阶段面临的个性化指导缺失问题,尤其是在大规模背景下传统导师难以提供及时、定制化辅导的困境。其解决方案的关键在于构建一个基于人工智能的聊天机器人(AI-powered chatbot),通过高效的数据摄入管道从CSV文件和校方网页等多源异构数据中提取并更新信息,并采用混合检索策略——结合BM25词法排序与ChromaDB语义检索——以精准获取上下文信息,再利用LLaMA-3.3-70B大语言模型生成语义高度相关且自然流畅的对话响应(BERTScore达0.831,METEOR得分0.809)。该系统显著提升了数据更新效率(仅需106.82秒),相较全新数据加载节省了约71%时间,从而实现对BRAC大学学生在学业规划与校园适应方面的实时、定制化支持。

链接: https://arxiv.org/abs/2511.04172
作者: Mashrur Rahman,Mantaqa abedin,Monowar Zamil Abir,Faizul Islam Ansari,Adib Reza,Farig Yousuf Sadeque,Niloy Farhan
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:University students face immense challenges during their undergraduate lives, often being deprived of personalized on-demand guidance that mentors fail to provide at scale. Digital tools exist, but there is a serious lack of customized coaching for newcomers. This paper presents an AI-powered chatbot that will serve as a mentor for the students of BRAC University. The main component is a data ingestion pipeline that efficiently processes and updates information from diverse sources, such as CSV files and university webpages. The chatbot retrieves information through a hybrid approach, combining BM25 lexical ranking with ChromaDB semantic retrieval, and uses a Large Language Model, LLaMA-3.3-70B, to generate conversational responses. The generated text was found to be semantically highly relevant, with a BERTScore of 0.831 and a METEOR score of 0.809. The data pipeline was also very efficient, taking 106.82 seconds for updates, compared to 368.62 seconds for new data. This chatbot will be able to help students by responding to their queries, helping them to get a better understanding of university life, and assisting them to plan better routines for their semester in the open-credit university.
zh

[NLP-33] Seeing Straight: Document Orientation Detection for Efficient OCR

【速读】: 该论文旨在解决扫描或拍摄文档在实际应用场景中因相机角度错误导致的图像旋转问题,这是提升下游任务(如光学字符识别,OCR)性能的关键预处理步骤。解决方案的核心在于提出一个名为OCR-Rotation-Bench(ORB)的新基准,涵盖英文结构化与自由格式数据集(ORB-En)以及11种印度语系低资源语言的数据集(ORB-Indic),并设计了一个基于Phi-3.5-Vision视觉编码器的轻量级、快速且鲁棒的旋转分类流水线,结合动态图像裁剪技术,在独立微调4类旋转分类任务后,实现了96%和92%的准确率;此外,该模块还能显著提升OCR性能,在封闭源代码模型中最高达14%,开放权重模型中最高达4倍。

链接: https://arxiv.org/abs/2511.04161
作者: Suranjan Goswami,Abhinav Ravi,Raja Kolla,Ali Faraz,Shaharukh Khan,Akash,Chandra Khatri,Shubham Agarwal
机构: OLA Electric(OLA电动); Krutrim AI(Krutrim AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.
zh

[NLP-34] BAPPA: Benchmarking Agents Plans and Pipelines for Automated Text-to-SQL Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLM)在从自然语言指令生成结构化查询语言(SQL)时面临的挑战,尤其是由于数据库模式(schema)规模大和推理复杂性高导致的性能瓶颈。现有方法多依赖于复杂的、不切实际的旗舰模型流水线,而小型高效模型则被忽视。解决方案的关键在于设计三种多智能体(multi-agent)LLM流水线架构:(1) 多智能体讨论流水线,通过迭代式批判与优化提升小模型表现;(2) 规划-编码流水线,由思考型规划器生成分步SQL生成计划,再由编码器合成查询;(3) 编码-聚合流水线,多个编码器独立生成SQL,由推理代理选择最优结果。实验表明,多智能体讨论可显著提升小模型执行准确率(如Qwen2.5-7b-Instruct提升10.6%),且规划-编码流水线结合强规划器(如DeepSeek-R1-32B)与编码器(Gemma 3 27B IT)可实现最高准确率(达56.4%)。

链接: https://arxiv.org/abs/2511.04153
作者: Fahim Ahmed,Md Mubtasim Ahasan,Jahir Sadik Monon,Muntasir Wahed,M Ashraful Amin,A K M Mahbubur Rahman,Amin Ahsan Ali
机构: Center for Computational & Data Sciences, Independent University, Bangladesh; University of Massachusetts Amherst; University of Illinois Urbana-Champaign
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at this https URL.
zh

[NLP-35] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

【速读】: 该论文旨在解决低资源粤语(Cantonese)自动语音识别(ASR)中的高词错误率(WER)问题,其核心挑战包括标注数据稀缺、六调系统、变调规则(tone sandhi)以及口音变异。解决方案的关键在于提出一种协同式ASR-LALM纠错框架——CantoASR,该框架通过强制对齐(forced alignment)提取声学特征,结合LoRA微调的Whisper模型增强声调区分能力,并利用指令微调的Qwen-Audio模型实现基于韵律感知的文本纠错,从而在自发口语数据上显著降低字符错误率(CER),验证了将声学线索与大音频语言模型(LALM)推理相结合的可扩展性。

链接: https://arxiv.org/abs/2511.04139
作者: Dazhong Chen,Yi-Cheng Lin,Yuchen Huang,Ziwei Gong,Di Jiang,Zeying Xie,Yi R.(May)Fung
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.
zh

[NLP-36] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数学推理任务中性能评估可能因训练数据泄露或表面模式匹配而被高估的问题,从而需要一种对抗性扰动评估方法来真实衡量其数学推理能力。解决方案的关键在于提出RIDE框架——一个基于项目反应理论(Item Response Theory, IRT)的对抗性题目重写机制,通过35个LLMs模拟学生行为构建难度排序器,并利用该排序器作为强化学习中的奖励信号,指导题目重写模型生成更具挑战性且语法正确的数学问题变体,从而有效降低先进LLMs在竞赛级基准上的表现(平均下降21.73%),验证了该评估方法的有效性和对模型推理鲁棒性的揭示能力。

链接: https://arxiv.org/abs/2511.04120
作者: Xinyuan Li,Murong Xu,Wenbiao Tao,Hanlun Zhu,Yike Zhao,Jipeng Zhang,Yunshi Lan
机构: East China Normal University (华东师范大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.
zh

[NLP-37] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在多步推理过程中存在的效率低下与行为不稳定问题,例如过度思考(overthinking)、冗余修正(hedging language)以及推理token消耗过高。其解决方案的关键在于利用批处理(batch prompting)机制,在推理阶段实现对模型行为的正则化(regularization),从而提升推理准确性并显著降低推理成本(通常减少3x–5x的推理token使用量)。研究发现,批处理不仅能抑制模型的非必要推理行为,还能通过批次内样本间的模式迁移产生协同效应,使模型在解决较难任务时表现出更强的泛化能力,因此批处理不仅是吞吐量优化手段,更是一种高效的推理时正则化策略。

链接: https://arxiv.org/abs/2511.04108
作者: Wenmo Qiu,Saurabh Srivastava
机构: University of Toronto (多伦多大学); George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.
zh

[NLP-38] A Characterization of List Language Identification in the Limit

【速读】: 该论文致力于解决语言识别(Language Identification)在极限情况下的可学习性问题,即当学习者从目标语言中获得一个示例序列时,其目标是输出一系列对目标语言的猜测,并确保在某个有限时间之后所有猜测均正确。经典结果表明,在绝大多数有趣的语言集合上,这种单次猜测的识别任务是不可能实现的;而Angluin则给出了此类任务可实现的精确条件。本文通过引入“k-列表识别”机制——即每一步学习者可以输出k个候选语言并保证其中至少有一个正确——重新审视了该问题,并提出了一个基于递归版本Angluin条件的精确刻画:一个语言集合能被k-列表识别在极限下,当且仅当它可以被分解为k个子集,每个子集均可通过单次猜测方式识别在极限下。这一关键洞察不仅提供了理论上的清晰分类标准,还进一步揭示了统计场景下识别速率的最优性质:若集合可k-列表识别,则存在指数级收敛速率,且这是最优的;否则无法以任何趋于零的速率进行识别。

链接: https://arxiv.org/abs/2511.04103
作者: Moses Charikar,Chirag Pabbaraju,Ambuj Tewari
机构: Stanford University (斯坦福大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of k guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be k -list identified in the limit, based on a recursive version of Angluin’s characterization (for language identification with a list of size 1 ). This further leads to a conceptually appealing characterization: A language collection can be k -list identified in the limit if and only if the collection can be decomposed into k collections of languages, each of which can be identified in the limit (with a list of size 1 ). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is k -list identifiable in the limit, then the collection can be k -list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not k -list identifiable in the limit, then it cannot be k -list identified at any rate that goes to zero. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2511.04103 [cs.CL] (or arXiv:2511.04103v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.04103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

【速读】: 该论文旨在解决放射科报告中受保护健康信息(Protected Health Information, PHI)自动脱敏的准确性与泛化能力问题,尤其针对现有模型在跨机构场景下性能下降及商业云服务系统检测效果不佳的局限性。其解决方案的关键在于:利用大规模、多模态的标注放射科文本数据集对基于Transformer的模型进行微调,并引入新的PHI类别(如AGE),从而提升模型在不同医疗机构数据上的鲁棒性和识别精度;同时通过“隐匿于明处”(hide-in-plain-sight)方法生成合成PHI以验证模型稳定性,结果表明该方法在保持数据可用性的前提下实现了高精度的隐私保护,显著优于当前学术界和商业系统的基准表现。

链接: https://arxiv.org/abs/2511.04079
作者: Eva Prakash,Maayane Attias,Pierre Chambon,Justin Xu,Steven Truong,Jean-Benoit Delbrouck,Tessa Cook,Curtis Langlotz
机构: Stanford University (斯坦福大学); JP Morgan Chase & Co (摩根大通); Sorbonne University (索邦大学); University of Oxford (牛津大学); NVIDIA (英伟达); HOPPR; University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注: In submission to JAMIA

点击查看摘要

Abstract:Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a “hide-in-plain-sight” method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.
zh

[NLP-40] he truth is no diaper: Human and AI-generated associations to emotional words

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在生成词义联想时是否与人类的联想行为一致,尤其是在处理情绪负载词(emotionally loaded words)时,其关联机制是否具有相似性与创造性。解决方案的关键在于通过对比人类与LLMs对同一词 cue 的自发联想反应,量化二者之间的重叠程度,并分析LLMs在情感强度放大、可预测性及创造性方面的差异,从而揭示LLMs在模拟人类语义联想过程中的优势与局限。

链接: https://arxiv.org/abs/2511.04077
作者: Špela Vintar,Jan Jona Javoršek
机构: University of Ljubljana (卢布尔雅那大学); Jožef Stefan Institute (约瑟夫·斯蒂芬研究所)
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure. Presented at ICCC’25, Campinas, Brazil

点击查看摘要

Abstract:Human word associations are a well-known method of gaining insight into the internal mental lexicon, but the responses spontaneously offered by human participants to word cues are not always predictable as they may be influenced by personal experience, emotions or individual cognitive styles. The ability to form associative links between seemingly unrelated concepts can be the driving mechanisms of creativity. We perform a comparison of the associative behaviour of humans compared to large language models. More specifically, we explore associations to emotionally loaded words and try to determine whether large language models generate associations in a similar way to humans. We find that the overlap between humans and LLMs is moderate, but also that the associations of LLMs tend to amplify the underlying emotional load of the stimulus, and that they tend to be more predictable and less creative than human ones.
zh

[NLP-41] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

链接: https://arxiv.org/abs/2511.04072
作者: Xinying Qian,Ying Zhang,Yu Zhao,Baohang Zhou,Xuhui Sui,Xiaojie Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to the IEEE for possible publication

点击查看摘要

[NLP-42] -FIX: Text-Based Explanations with Features Interpretable to eXperts

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型场景(如手术、天文学、心理治疗等)中生成解释时,现有评估方法仅关注解释的合理性或内部一致性,而未能衡量其是否真正契合领域专家直觉的问题。解决方案的关键在于提出T-FIX基准,并与领域专家合作开发新的量化指标,以形式化“专家对齐”(expert alignment)作为评价标准,从而更准确地评估LLM解释内容与专家判断的一致性。

链接: https://arxiv.org/abs/2511.04070
作者: Shreya Havaldar,Helen Jin,Chaehyeon Kim,Anton Xue,Weiqiu You,Marco Gatti,Bhuvnesh Jain,Helen Qu,Daniel A Hashimoto,Amin Madani,Rajat Deo,Sameed Ahmed M. Khatana,Gary E. Weissman,Lyle Ungar,Eric Wong
机构: University of Pennsylvania (宾夕法尼亚大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Flatiron Institute (扁平铁研究所); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.
zh

[NLP-43] DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization NEURIPS2025

【速读】: 该论文旨在解决大规模模型量化(quantization)中旋转优化(rotational optimization)的高计算成本与过拟合问题。现有方法在端到端微调旋转矩阵时效率低下且易受任务特定损失影响,导致性能不稳定。其解决方案的关键在于提出一种分布感知的旋转校准方法 DartQuant,通过约束旋转后激活值的分布来降低优化复杂度,并减少对任务特定损失的依赖,从而缓解过拟合;同时引入 QR-Orth 优化方案,以更高效的替代方式取代昂贵的交替优化策略。该方法在70B参数模型上实现了47倍加速和10倍内存节省,并首次在单张3090 GPU上完成旋转校准,显著提升了大语言模型在资源受限环境下的可量化性。

链接: https://arxiv.org/abs/2511.04063
作者: Yuantian Shao,Yuanteng Chen,Peisong Wang,Jianlin Yu,Jing Lin,Yiwu Yao,Zhihui Wei,Jian Cheng
机构: Nanjing University of Science and Technology (南京理工大学); C2C2DL, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所C2C2DL); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Zhongguancun Academy (中关村研究院); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: NeurIPS 2025, 10 pages, 12 figures

点击查看摘要

Abstract:Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47 \times acceleration and 10 \times memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments. Code is available at this https URL.
zh

[NLP-44] Explorability in Pushdown Automata

【速读】: 该论文旨在解决如何量化和分类下推自动机(Pushdown Automata, PDA)中非确定性程度的问题,特别是通过引入“可探索性”(explorability)这一新概念来细化非确定性的层级结构。其核心解决方案在于定义了k-可探索性(k-explorable),即在读取输入时只需维护k个并发运行路径(基于已观察输入逐步构建),即可构造出接受路径(若存在)。关键发现包括:(1) 可探索性形成无限层次结构,每层k严格优于前一层,但整体仍弱于一般非确定性PDA;(2) 当可探索性随输入长度呈指数增长时,恰好刻画所有上下文无关语言(Context-Free Languages);(3) 可探索性PDA相较于历史确定性PDA可实现双指数级压缩,且确定性与2-可探索性PDA之间的简洁性差距不可递归枚举。这表明可探索性是一个在表达能力和简洁性上均具有操作意义的非确定性度量。

链接: https://arxiv.org/abs/2511.04048
作者: Ayaan Bedi,Karoliina Lehtinen
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study explorability, a measure of nondeterminism in pushdown automata, which generalises history-determinism. An automaton is k-explorable if, while reading the input, it suffices to follow k concurrent runs, built step-by-step based only on the input seen so far, to construct an accepting one, if it exists. We show that the class of explorable PDAs lies strictly between history-deterministic and fully nondeterministic PDAs in terms of both expressiveness and succinctness. In fact increasing explorability induces an infinite hierarchy: each level k defines a strictly more expressive class than level k-1, yet the entire class remains less expressive than general nondeterministic PDAs. We then introduce a parameterized notion of explorability, where the number of runs may depend on input length, and show that exponential explorability precisely captures the context-free languages. Finally, we prove that explorable PDAs can be doubly exponentially more succinct than history-deterministic ones, and that the succinctness gap between deterministic and 2-explorable PDAs is not recursively enumerable. These results position explorability as a robust and operationally meaningful measure of nondeterminism for pushdown systems.
zh

[NLP-45] WST: Weakly Supervised Transducer for Automatic Speech Recognition

【速读】: 该论文旨在解决端到端(End-to-End, E2E)自动语音识别(Automatic Speech Recognition, ASR)中对大规模高质量标注数据的高度依赖问题,尤其是在实际应用中获取此类数据成本高昂且困难的场景。其解决方案的关键在于提出一种弱监督Transducer(Weakly Supervised Transducer, WST),该方法通过设计一个灵活的训练图结构,能够在不依赖额外置信度估计或预训练辅助模型的前提下,有效容忍转录文本中的错误,从而提升模型在存在高比例标注噪声环境下的鲁棒性。实验表明,WST在合成与工业数据集上均能在转录错误率达70%时保持性能稳定,显著优于基于连接时序分类(Connectionist Temporal Classification, CTC)的现有弱监督方法(如Bypass Temporal Classification, BTC 和 Omni-Temporal Classification, OTC)。

链接: https://arxiv.org/abs/2511.04035
作者: Dongji Gao,Chenda Liao,Changliang Liu,Matthew Wiesner,Leibny Paola Garcia,Daniel Povey,Sanjeev Khudanpur,Jian Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.
zh

[NLP-46] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对不完整检索证据时推理失效的问题,即当现有知识库无法提供足够支持信息时,模型难以完成连贯且可信的推理过程。其解决方案的关键在于引入归纳推理(abductive inference),通过检测证据不足、生成候选缺失前提,并结合一致性与合理性验证机制对这些前提进行筛选和确认,从而有效填补推理链条中的空白,提升答案准确性和推理可解释性。

链接: https://arxiv.org/abs/2511.04020
作者: Shiyin Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) enhanced with retrieval – commonly referred to as Retrieval-Augmented Generation (RAG) – have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emphabductive inference – the process of generating plausible missing premises to explain observations – offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.
zh

[NLP-47] owards Scalable Meta-Learning of near-optimal Interpretable Models via Synthetic Model Generations NEURIPS2025

【速读】: 该论文旨在解决决策树(decision tree)在高风险领域(如金融和医疗)中进行元学习(meta-learning)时面临的计算成本高、数据获取困难的问题。其核心挑战在于如何高效生成大规模且具有真实性的预训练数据,以支持可解释模型的快速适应与泛化。解决方案的关键在于提出一种高效的合成预训练数据生成方法:通过采样近优决策树(near-optimal decision trees)构建大规模、现实的数据集,并结合MetaTree Transformer架构实现轻量级但性能优异的元学习。该策略显著降低了计算复杂度,同时提升了数据生成灵活性,为可解释决策树模型的规模化元学习提供了可行路径。

链接: https://arxiv.org/abs/2511.04000
作者: Kyaw Hpone Myint,Zhe Wu,Alexandre G.R. Day,Giri Iyengar
机构: Capital One (资本一号)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 9 pages, 3 figures, Neurips 2025 GenAI in Finance Workshop

点击查看摘要

Abstract:Decision trees are widely used in high-stakes fields like finance and healthcare due to their interpretability. This work introduces an efficient, scalable method for generating synthetic pre-training data to enable meta-learning of decision trees. Our approach samples near-optimal decision trees synthetically, creating large-scale, realistic datasets. Using the MetaTree transformer architecture, we demonstrate that this method achieves performance comparable to pre-training on real-world data or with computationally expensive optimal decision trees. This strategy significantly reduces computational costs, enhances data generation flexibility, and paves the way for scalable and efficient meta-learning of interpretable decision tree models.
zh

[NLP-48] LLM s and Cultural Values: the Impact of Prompt Language and Explicit Cultural Framing

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在跨文化语境下是否能够有效反映其全球用户群体的文化多样性,以及如何通过提示(prompt)设计提升模型输出与不同国家人类价值观的对齐程度。解决方案的关键在于系统性地评估两种干预策略的影响——一是使用不同语言的提示(prompt language),二是引入显式的文化框架(cultural framing)。研究发现,虽然两种方法均能引起LLM输出的变化,但仅靠语言调整无法克服模型固有的文化偏见;而明确的文化视角显著提升了响应与当地价值观的一致性,且结合两者并未带来额外增益,反而表明LLM对特定国家(如荷兰、德国、美国和日本)的文化默认值具有强锚定效应,难以真正实现多元文化的代表性。

链接: https://arxiv.org/abs/2511.03980
作者: Bram Bulté,Ayla Rigouts Terryn
机构: Brussels Centre for Language Studies, Vrije Universiteit Brussel (布鲁塞尔语言研究中心,布鲁塞尔自由大学); Université de Montréal & Mila - Quebec AI Institute (蒙特利尔大学与魁北克人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint under review at Computational Linguistics. Accepted with minor revisions (10/10/2025); second round

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly being adopted by users across the globe, who interact with them in a diverse range of languages. At the same time, there are well-documented imbalances in the training data and optimisation objectives of this technology, raising doubts as to whether LLMs can represent the cultural diversity of their broad user base. In this study, we look at LLMs and cultural values and examine how prompt language and cultural framing influence model responses and their alignment with human values in different countries. We probe 10 LLMs with 63 items from the Hofstede Values Survey Module and World Values Survey, translated into 11 languages, and formulated as prompts with and without different explicit cultural perspectives. Our study confirms that both prompt language and cultural perspective produce variation in LLM outputs, but with an important caveat: While targeted prompting can, to a certain extent, steer LLM responses in the direction of the predominant values of the corresponding countries, it does not overcome the models’ systematic bias toward the values associated with a restricted set of countries in our dataset: the Netherlands, Germany, the US, and Japan. All tested models, regardless of their origin, exhibit remarkably similar patterns: They produce fairly neutral responses on most topics, with selective progressive stances on issues such as social tolerance. Alignment with cultural values of human respondents is improved more with an explicit cultural perspective than with a targeted prompt language. Unexpectedly, combining both approaches is no more effective than cultural framing with an English prompt. These findings reveal that LLMs occupy an uncomfortable middle ground: They are responsive enough to changes in prompts to produce variation, but too firmly anchored to specific cultural defaults to adequately represent cultural diversity.
zh

[NLP-49] Multi-Agent Collaborative Framework For Math Problem Generation

【速读】: 该论文旨在解决自动数学问题生成(Automatic Question Generation, AQG)在智能辅导系统和教育实践中难以精确控制题目复杂度与认知需求的问题。现有基于预训练Transformer的语言模型虽在自然语言生成方面表现优异,但在生成符合教学目标的高质量、结构化数学题时存在局限性。解决方案的关键在于提出一种协作式多智能体框架(collaborative multi-agent framework),通过多个智能体在推理阶段迭代优化生成的问题-答案对,从而实现对题目难度、认知挑战与清晰度之间更精细的平衡。该方法显著提升了生成内容的教育质量,为自动化教育内容生成和自适应学习环境的发展提供了新的技术路径。

链接: https://arxiv.org/abs/2511.03958
作者: Kia Karbasi,Kevin Hong,Mohammad Amin Samadi,Gregory Pottie
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Published in the Proceedings of the 18th International Conference on Educational Data Mining, 6 pages, 5 figures

点击查看摘要

Abstract:Automatic question generation (AQG) for mathematics education remains an elusive goal for Intelligent Tutoring Systems and educators. While pre-trained transformer-based language models have significantly advanced natural language generation, they often struggle to precisely control problem complexity and cognitive demands. In this paper, we introduce a collaborative multi-agent framework as a novel method of incorporating inference-time computation into AQG. This approach leverages multiple agents that iteratively refine generated question-answer pairs to better balance complexity and cognitive demand. We evaluate the generated questions on five meta-evaluation criteria: relevance, importance, clarity, difficulty matching, answerability, to assess the system’s ability to control the required complexity and quality of the questions. Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content by fostering a more nuanced balance between cognitive challenge and clarity. These promising outcomes suggest that integrating collaborative multi-agent workflows can yield more controlled, pedagogically valuable content that can help advance automated educational content generation and adaptive learning environments.
zh

[NLP-50] Direct Semantic Communication Between Large Language Models via Vector Translation

【速读】: 该论文旨在解决多智能体场景下大语言模型(Large Language Models, LLMs)通过纯token传递信息时丢失潜在语义、导致信息传输效率低下及计算开销增加的问题。其解决方案的关键在于构建一个基于向量翻译的隐式桥梁(latent bridge),利用训练得到的双编码器翻译器(dual-encoder translator)在不同模型的表示空间之间进行语义对齐,从而实现跨模型的直接语义交换。实验表明,在Llama-2-7B与Mistral-7B-Instruct之间训练的翻译器可达到平均余弦相似度0.538,且以30%混合强度注入翻译后的向量能有效引导目标模型生成而不破坏logits稳定性,验证了跨模型隐式通信的可行性。

链接: https://arxiv.org/abs/2511.03945
作者: Fu-Chun Yang,Jason Eshraghian
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 2 tables

点击查看摘要

Abstract:In multi-agent settings, such as debate, reflection, or tool-calling, large language models (LLMs) pass messages as plain tokens, discarding most latent semantics. This constrains information transfer and adds unnecessary computational overhead. We form a latent bridge via vector translations, which use learned mappings that enable direct semantic exchange between representation spaces. A dual-encoder translator trained between Llama-2-7B and Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the translated vectors at 30 percent blending strength steers the target model’s generation without destabilizing logits. Bidirectional evaluation shows a 2.01:1 transfer asymmetry, indicating that general-purpose models yield more transferable representations than instruction-tuned variants. This conservative injection preserves computational stability while demonstrating that cross-model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens.
zh

[NLP-51] MIDI-LLM : Adapting Large Language Models for Text-to-MIDI Music Generation NEURIPS2025

【速读】: 该论文旨在解决从自由文本提示(free-form text prompts)生成多轨MIDI音乐(multitrack MIDI music)的问题,即如何让大型语言模型(LLM)具备理解自然语言并输出结构化、可演奏的音乐表示的能力。其解决方案的关键在于:首先将原始文本LLM的词汇表扩展以包含MIDI标记(MIDI tokens),从而使其能够处理音乐数据;其次采用两阶段训练策略,使模型在保留原有参数结构的前提下,有效学习文本到MIDI的映射关系;同时利用vLLM库实现加速推理,显著提升生成效率与质量。

链接: https://arxiv.org/abs/2511.03942
作者: Shih-Lun Wu,Yoon Kim,Cheng-Zhi Anna Huang
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: To appear at NeurIPS 2025 Workshop on AI for Music

点击查看摘要

Abstract:We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM’s vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM’s parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at this https URL.
zh

[NLP-52] RLHF: A comprehensive Survey for Cultural Multimodal and Low Latency Alignment Methods

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)对齐研究中的三大关键问题:多模态对齐、文化公平性以及低延迟优化。其解决方案的核心在于系统性地梳理和对比当前主流的对齐算法,如近端策略优化(Proximal Policy Optimization, PPO)、直接偏好优化(Direct Preference Optimization, DPO)和广义相对策略优化(Generalized Relative Policy Optimization, GRPO),并通过深入分析这些方法在新场景下的演进与创新,为构建更鲁棒、高效且公平的人工智能系统提供可操作的技术路线图。

链接: https://arxiv.org/abs/2511.03939
作者: Raghav Sharma,Manan Mehta,Sai Tiger Raina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is the standard for aligning Large Language Models (LLMs), yet recent progress has moved beyond canonical text-based methods. This survey synthesizes the new frontier of alignment research by addressing critical gaps in multi-modal alignment, cultural fairness, and low-latency optimization. To systematically explore these domains, we first review foundational algo- rithms, including PPO, DPO, and GRPO, before presenting a detailed analysis of the latest innovations. By providing a comparative synthesis of these techniques and outlining open challenges, this work serves as an essential roadmap for researchers building more robust, efficient, and equitable AI systems.
zh

[NLP-53] he Human Flourishing Geographic Index: A County-Level Dataset for the United States 2013–2023

【速读】: 该论文旨在解决现有衡量人类繁荣(Human Flourishing)指标在空间和时间分辨率上不足的问题,以更精细地刻画社会福祉状态。其解决方案的关键在于构建了人类繁荣地理指数(Human Flourishing Geographic Index, HFGI),通过分析约26亿条美国地理标记推文(2013–2023年),利用微调的大语言模型对文本进行分类,提取48个与哈佛全球繁荣研究框架一致的指标,并包含移民态度与腐败感知等维度,从而实现县和州层面月度及年度水平的人类繁荣相关话语量化,且经过验证确保指标能准确反映底层构念并呈现与既有指标的预期相关性。

链接: https://arxiv.org/abs/2511.03915
作者: Stefano M. Iacus,Devika Jain,Andrea Nasuto,Giuseppe Porro,Marcello Carammia,Andrea Vezzulli
机构: Harvard University (哈佛大学); University of Liverpool (利物浦大学); University of Insubria (因苏布里亚大学); University of Catania (卡塔尼亚大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard’s Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.
zh

[NLP-54] Context informs prag matic interpretation in vision-language models NEURIPS2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多轮语言环境中进行上下文敏感语用推理(context-sensitive pragmatic reasoning)的能力评估问题。研究通过迭代参考游戏(iterated reference games)这一任务范式,考察人类与视觉-语言模型在不同情境信息(包括信息量、顺序和相关性)下的表现差异。其解决方案的关键在于:引入结构化的情境变量来系统性地测试模型对上下文的利用能力,发现当提供相关上下文时,模型性能显著提升,表明其具备一定的渐进式学习和语用适应能力;但即便如此,在涉及抽象参照物的少样本场景中,模型仍难以达到人类水平,揭示了当前机器学习模型在复杂语用理解上的局限性。

链接: https://arxiv.org/abs/2511.03908
作者: Alvin Wei Ming Tan,Ben Prystawski,Veronica Boyce,Michael C. Frank
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: Accepted at CogInterp Workshop, NeurIPS 2025

点击查看摘要

Abstract:Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents’ ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.
zh

[NLP-55] GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中持续存在的幻觉(Hallucination)问题,即模型生成内容与事实不符的现象。现有方法多依赖外部知识源(如结构化数据库或知识图谱),但提示引导式接地(prompt-based grounding)脆弱且领域敏感,而符号知识整合则带来高昂的检索与格式化开销。解决方案的关键在于提出一种解码时方法——图检索自适应解码(Graph-Retrieved Adaptive Decoding, GRAD),其通过在单次前向传播中构建基于语料库的稀疏词元转移图(sparse token transition graph),利用统计证据对生成过程进行引导:在解码阶段,将图检索得到的词元概率进行最大值归一化,并与模型原生概率自适应融合,从而优先选择高证据支持的延续,同时保持输出流畅性。该方法无需重新训练,即可显著提升生成结果的真实性与可验证性。

链接: https://arxiv.org/abs/2511.03900
作者: Manh Nguyen,Sunil Gupta,Dai Do,Hung Le
机构: Deakin University (迪肯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7 % higher intrinsic accuracy, 8.6 % lower hallucination rates, and 6.9 % greater correctness compared to greedy decoding, while attaining the highest truth–informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.
zh

[NLP-56] Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens

【速读】: 该论文旨在解决低资源语言(如Afan Oromo、Amharic和Tigrinya)在机器翻译(Machine Translation, MT)数据集构建中因过度追求数据规模而忽视质量的问题,特别是性别代表性失衡及有害内容的传播风险。其关键解决方案在于系统性评估这些语言数据集中性别偏见的分布特征,包括人物姓名、动词语法性别以及刻板印象内容,并揭示数据量大并不等同于质量高,从而呼吁在数据收集早期阶段就引入质量控制机制,以减少对女性群体的歧视性表述并提升模型的社会公平性。

链接: https://arxiv.org/abs/2511.03880
作者: Hellina Hailu Nigatu,Bethelhem Yemane Mamo,Bontu Fufa Balcha,Debora Taye Tesfaye,Elbethel Daniel Zewdie,Ikram Behiru Nesiru,Jitu Ewnetu Hailu,Senait Mengesha Yayo
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Paper Under Review

点击查看摘要

Abstract:As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages–Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender–in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and toxic depictions against women, which were more prominent for the language with the largest amount of data, underscoring that quantity does not guarantee quality. We hope that our work inspires further inquiry into the datasets collected for low-resourced languages and prompts early mitigation of harmful content. WARNING: This paper contains discussion of NSFW content that some may find disturbing.
zh

[NLP-57] Divide Cache Conquer: Dichotomic Prompting for Efficient Multi-Label LLM -Based Classification

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多标签文本分类任务中效率低下的问题,尤其是在短文本推理场景下,传统一次性生成所有标签的方法存在计算开销大、响应慢等瓶颈。其解决方案的关键在于将多标签分类任务重构为一系列独立的二元决策(dichotomic yes/no queries),并通过前缀缓存(prefix caching)机制显著提升推理效率,同时保持分类准确性不变;此外,结合LLM到小语言模型(Small Language Models, SLM)的知识蒸馏策略,利用强大标注模型(如DeepSeek-V3)生成高质量多标签标注数据,对小型模型进行微调,从而在不影响性能的前提下实现高效部署。

链接: https://arxiv.org/abs/2511.03830
作者: Mikołaj Langner,Jan Eliasz,Ewa Rudnicka,Jan Kocoń
机构: Wroclaw Tech (弗罗茨瓦夫科技大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.
zh

[NLP-58] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与人类价值观对齐(alignment)的问题,现有方法如监督微调(Supervised Fine-Tuning, SFT)计算成本高且效果有限,而基于推理时采样的最佳N个样本(Best-of-N)策略虽能提升对齐质量但需极高的计算资源。解决方案的关键在于提出STARS(Segment-level Token Alignment with Rejection Sampling),一种在解码阶段通过迭代采样、评分和拒绝/接受固定长度token片段来引导生成过程的算法,实现早期路径修正,从而显著提高计算效率并增强对齐质量。

链接: https://arxiv.org/abs/2511.03827
作者: Mohammad Atif Quamar,Mohammad Areeb,Mikhail Kuznetsov,Muslum Ozgur Ozmen,Z. Berkay Celik
机构: Purdue University (普渡大学); Amazon (亚马逊); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注: Presented at the 2nd Workshop on Frontiers in Probabilistic Inference: Sampling Meets Learning (NeurIPS 2025)

点击查看摘要

Abstract:Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.
zh

[NLP-59] How Different Tokenization Algorithms Impact LLM s and Transformer Models for Binary Code Analysis NDSS2025

【速读】: 该论文旨在解决汇编代码(assembly code)中分词(tokenization)研究不足的问题,特别是如何优化分词模型以提升低级代码分析任务的性能。其核心挑战在于现有自然语言处理(Natural Language Processing, NLP)中的分词方法未充分适配汇编代码的独特语法结构与语义特性,导致词汇表规模、语义覆盖度及下游任务表现不佳。解决方案的关键在于系统性地评估多种NLP分词模型(如基于BPE、WordPiece等机制的tokenizer)及其参数配置(例如词汇表大小),并引入针对汇编代码特性的预处理定制策略和预分词规则;同时通过函数签名预测(function signature prediction)这一关键二进制代码分析任务进行外在性能验证,揭示内在分词指标(如分词效率、词汇压缩比、表示保真度)与外在任务表现之间的复杂权衡关系,从而为构建更鲁棒、可扩展的基于自然语言模型(Natural Language Model, NLM)的二进制分析流程提供实证依据与优化方向。

链接: https://arxiv.org/abs/2511.03825
作者: Ahmed Mostafa,Raisul Arefin Nahid,Samuel Mulder
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Publication Notice. This paper was published in the BAR 2025 Workshop (with NDSS 2025) and is for research and educational use. Copyright \c{opyright} 2025 Internet Society. All rights reserved. Personal/classroom reproduction is permitted with this notice and full paper citation. All other uses, including commercial, require prior written permission from the Internet Society

点击查看摘要

Abstract:Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction – a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows. Comments: Publication Notice. This paper was published in the BAR 2025 Workshop (with NDSS 2025) and is for research and educational use. Copyright \copyright 2025 Internet Society. All rights reserved. Personal/classroom reproduction is permitted with this notice and full paper citation. All other uses, including commercial, require prior written permission from the Internet Society Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2511.03825 [cs.AI] (or arXiv:2511.03825v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.03825 Focus to learn more arXiv-issued DOI via DataCite Journalreference: https://www.ndss-symposium.org/wp-content/uploads/bar2025-final13.pdf Related DOI: https://doi.org/10.14722/bar.2025.23013 Focus to learn more DOI(s) linking to related resources
zh

[NLP-60] PLLuM: A Family of Polish Large Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)主要面向英语开发、对其他语言支持不足的问题,尤其针对波兰语缺乏高质量、透明且文化相关的基础模型的现状。解决方案的关键在于构建了波兰语最大的开源基础模型家族——PLLuM,其核心包括:1)创建包含1400亿token的波兰语预训练语料库;2)构建7.7万条自定义指令数据集和10万条偏好优化数据集以实现模型对齐;3)引入负责任人工智能(Responsible AI)框架,集成严格的数据治理机制与输出修正及安全过滤的混合模块。该方案不仅提升了波兰语自然语言处理能力,还推动了本土化AI技术的自主可控发展。

链接: https://arxiv.org/abs/2511.03823
作者: Jan Kocoń,Maciej Piasecki,Arkadiusz Janz,Teddy Ferdinan,Łukasz Radliński,Bartłomiej Koptyra,Marcin Oleksy,Stanisław Woźniak,Paweł Walkowiak,Konrad Wojtasik,Julia Moska,Tomasz Naskręt,Bartosz Walkowiak,Mateusz Gniewkowski,Kamil Szyc,Dawid Motyka,Dawid Banach,Jonatan Dalasiński,Ewa Rudnicka,Bartłomiej Alberski,Tomasz Walkowiak,Aleksander Szczęsny,Maciej Markiewicz,Tomasz Bernaś,Hubert Mazur,Kamil Żyta,Mateusz Tykierko,Grzegorz Chodak,Tomasz Kajdanowicz,Przemysław Kazienko,Agnieszka Karlińska,Karolina Seweryn,Anna Kołos,Maciej Chrabąszcz,Katarzyna Lorenc,Aleksandra Krasnodębska,Artur Wilczek,Katarzyna Dziewulska,Paula Betscher,Zofia Cieślińska,Katarzyna Kowol,Daria Mikoś,Maciej Trzciński,Dawid Krutul,Marek Kozłowski,Sławomir Dadas,Rafał Poświata,Michał Perełkiewicz,Małgorzata Grębowiec,Maciej Kazuła,Marcin Białas,Roman Roszko,Danuta Roszko,Jurgita Vaičenonienė,Andrius Utka,Paweł Levchuk,Paweł Kowalski,Irena Prawdzic-Jankowska,Maciej Ogrodniczuk,Monika Borys,Anna Bulińska,Wiktoria Gumienna,Witold Kieraś,Dorota Komosińska,Katarzyna Krasnowska-Kieraś,Łukasz Kobyliński,Martyna Lewandowska,Marek Łaziński,Mikołaj Łątkowski,Dawid Mastalerz,Beata Milewicz,Agnieszka Anna Mykowiecka,Angelika Peljak-Łapińska,Sandra Penno,Zuzanna Przybysz,Michał Rudolf,Piotr Rybak,Karolina Saputa,Aleksandra Tomaszewska,Aleksander Wawer,Marcin Woliński,Joanna Wołoszyn,Alina Wróblewska,Bartosz Żuk,Filip Żarnecki,Konrad Kaczyński,Anna Cichosz,Zuzanna Deckert,Monika Garnys,Izabela Grabarczyk,Wojciech Janowski,Sylwia Karasińska,Aleksandra Kujawiak,Piotr Misztela,Maria Szymańska,Karolina Walkusz,Igor Siek,Jakub Kwiatkowski,Piotr Pęzik
机构: Politechnika Wrocławska (弗罗茨瓦夫理工大学); National Information Processing Institute (国家信息处理研究所); OPI (波兰科学院信息研究所); University of Warsaw (华沙大学); Vilnius University (维尔纽斯大学); Institute of Computer Science, Polish Academy of Sciences (波兰科学院计算机科学研究所); University of Lodz (洛兹大学); University of Lodz, Faculty of Philology (洛兹大学文学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 83 pages, 19 figures

点击查看摘要

Abstract:Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models’ architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.
zh

[NLP-61] GRDD: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在希腊语方言(Greek dialects)处理能力上的严重不足问题,尤其是缺乏高质量、多样化的方言数据集来支持模型的本地化与泛化能力提升。其解决方案的关键在于构建了一个扩展的希腊语方言语料库(Extended Greek Dialectal Dataset, GRDD+),涵盖10种希腊语变体(包括Cretan、Cypriot、Pontic、Northern Greek及新增的Greco-Corsican、Griko、Maniot、Heptanesian、Tsakonian和Katharevusa Greek),总词数达6,374,939,是迄今规模最大且多样性最高的希腊语方言语料库。通过在此数据集上对三种主流LLM架构(Llama-3-8B、Llama-3.1-8B、Krikri-8B)进行微调,并与前沿模型(Claude-3.7-Sonnet、Gemini-2.5、ChatGPT-5)对比,验证了高质量方言数据对提升模型在特定方言理解与生成任务中的有效性。

链接: https://arxiv.org/abs/2511.03772
作者: Stergios Chatzikyriakidis,Dimitris Papadakis,Sevasti-Ioanna Papaioannou,Erofili Psaltaki
机构: University of Crete (克里特大学); University of Athens (雅典大学); University of Turku (图尔库大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
zh

[NLP-62] xtualVerifier: Verify TextGrad Step-by-Step

【速读】: 该论文旨在解决TextGrad在文本-based自动微分(automatic differentiation)过程中缺乏自验证机制的问题,即无法确保基于文本推理的决策过程具有有效性。其解决方案的关键在于提出TextualVerifier框架,该框架通过链式思维分解(chain-of-thought decomposition)、变体生成、多数投票(majority voting)与共识聚合四阶段流程,利用大语言模型(LLM)实现对推理步骤和优化结果的自我验证。该方法无需数值梯度即可完成验证,且可非侵入式地集成至TextGrad的损失函数和优化结果验证阶段,显著提升推理有效性与优化可靠性。

链接: https://arxiv.org/abs/2511.03739
作者: Eugenius Mario Situmorang,Adila Alfa Krisnadhi,Ari Wibisono
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.
zh

[NLP-63] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中隐含人格特质难以被可靠控制或对齐以满足特定需求的问题,即如何通过有效机制在生成阶段操纵模型行为,从而实现人格导向的可控输出。其解决方案的关键在于提出一种新颖的流水线方法:首先利用五大人格特质(Big Five Personality Traits, 包括开放性、尽责性、外向性、宜人性和神经质)作为心理构念,从Transformer层的隐藏状态激活中提取人格相关表征;其次采用低秩子空间发现技术识别不同模型架构下与特定人格特质最优对应的层;最后构建一个具有动态层选择能力的灵活调控框架,通过微扰操作将这些低秩人格方向转化为可执行的控制机制,从而在不损害模型流畅性、多样性及通用能力的前提下,实现对生成内容中人格表达的精确调控。

链接: https://arxiv.org/abs/2511.03738
作者: Pranav Bhandari,Nicolas Fay,Sanjeevan Selvaganapathy,Amitava Datta,Usman Naseem,Mehwish Nasim
机构: Network Analysis and Social Influence Modeling (NASIM) Lab; School of Physics Maths and Computing (物理数学与计算学院); The University of Western Australia (西澳大利亚大学); School of Psychological Science (心理科学学院); School of Computing (计算机学院); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models’ behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.
zh

[NLP-64] MimiTalk: Revolutionizing Qualitative Research with Dual-Agent AI

【速读】: 该论文旨在解决社会科学研究中高质量、可扩展且伦理合规的对话数据收集难题,传统人工访谈存在效率低、成本高及主观偏差等问题。解决方案的关键在于提出MimiTalk框架——一种双代理宪法式人工智能(dual-agent constitutional AI)架构,其中监督模型(supervisor model)负责战略层面的伦理与质量控制,而对话模型(conversational model)专注于生成结构化问题;该设计通过多轮实证研究验证了其在降低受访者焦虑、保持对话连贯性方面的优势,并在信息丰富度、一致性与稳定性上超越人类访谈,同时保留了对敏感话题的开放性表达能力,从而实现人机协同驱动的可复现、可扩展且受控的定性研究范式。

链接: https://arxiv.org/abs/2511.03731
作者: Fengming Liu,Shubin Yu
机构: University College London (伦敦大学学院); HEC Paris (巴黎高等商学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages

点击查看摘要

Abstract:We present MimiTalk, a dual-agent constitutional AI framework designed for scalable and ethical conversational data collection in social science research. The framework integrates a supervisor model for strategic oversight and a conversational model for question generation. We conducted three studies: Study 1 evaluated usability with 20 participants; Study 2 compared 121 AI interviews to 1,271 human interviews from the MediaSum dataset using NLP metrics and propensity score matching; Study 3 involved 10 interdisciplinary researchers conducting both human and AI interviews, followed by blind thematic analysis. Results across studies indicate that MimiTalk reduces interview anxiety, maintains conversational coherence, and outperforms human interviews in information richness, coherence, and stability. AI interviews elicit technical insights and candid views on sensitive topics, while human interviews better capture cultural and emotional nuances. These findings suggest that dual-agent constitutional AI supports effective human-AI collaboration, enabling replicable, scalable and quality-controlled qualitative research.
zh

[NLP-65] Measuring Teaching with LLM s

【速读】: 该论文旨在解决教育领域中教学品质的客观且可扩展测量难题,尤其是传统通用大语言模型(Large Language Models, LLMs)在应用复杂、真实的课堂观察量表时表现不稳定的问题。其解决方案的关键在于构建基于句子级嵌入(sentence-level embeddings)的定制化LLM架构,该架构更契合课堂转录文本的长篇幅与解释性特征,而非依赖常规的子词分词方式;同时采用数据高效训练策略以避免过拟合,从而实现接近甚至超越人类专家水平的评分一致性(rater correlation > 0.65),并揭示模型对教学整体特征的敏感性高于孤立语句,验证了其在教师价值增值指标上的外部效度,为AI驱动的教学评估提供了可靠、可扩展的新方法论路径。

链接: https://arxiv.org/abs/2510.22968
作者: Michael Hardy
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better aligned with human judgments-attribute a larger share of score variation to lesson-level features rather than isolated utterances, challenging the sufficiency of single-turn annotation paradigms. Finally, to assess external validity, we find that aggregate model scores align with teacher value-added measures, indicating they are capturing features relevant to student learning. However, this trend does not hold at the individual item level, suggesting that while the models learn useful signals, they have not yet achieved full generalization. This work establishes a viable and powerful new methodology for AI-driven instructional measurement, offering a path toward providing scalable, reliable, and valid feedback for educator development.
zh

[NLP-66] Sub-exponential Growth in Online Word Usage: A Piecewise Power-Law Model

【速读】: 该论文试图解决社会扩散现象中传统S型模型(如逻辑斯蒂曲线)难以准确刻画复杂增长模式的问题,特别是忽略了次指数增长(sub-exponential growth)在更广泛社会现象中的作用。其解决方案的关键在于提出了一种分段幂律模型(piecewise power-law model),通过少量参数即可有效描述具有多个阶段的复杂增长曲线。该模型在约十亿篇日文博客与维基百科词汇关联数据及多语言网络搜索趋势数据中得到系统验证,发现约55%的传播事件呈现无突变的平滑增长特征,且其中多数可由一个或两个幂律段落拟合;进一步分析表明,形状参数α的众数接近0.5,说明次指数增长是普遍存在的社会扩散机制,并且α可作为衡量个体对外部交流偏好程度的指标,从而为理解社会扩散动力学提供了统一、可比较的定量框架。

链接: https://arxiv.org/abs/2511.04106
作者: Hayafumi Watanabe
机构: Seijo University (成蹊大学); The Institute of Statistical Mathematics (统计数理研究所)
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL); Computers and Society (cs.CY); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -a slower than exponential pattern known in epidemiology- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of the 2,965 selected items reveals that about 55% (1,625 items) were found to have no abrupt jumps and were well captured by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter alpha was near 0.5, indicating prevalent sub-exponential growth; (ii) the ultimate diffusion scale is primarily determined by the growth rate R, with minor contributions from alpha or the duration T; and (iii) alpha showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model distinguishing outward contact with strangers from inward interaction within their community suggests that alpha can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.
zh

计算机视觉

[CV-0] Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping

【速读】:该论文旨在解决自动图像裁剪(automatic image cropping)中如何生成多个具有美学吸引力且互不相同的裁剪结果的问题,而现有研究多集中于单一裁剪方案。其解决方案的关键在于引入一种基于图像分割(image partitioning)的预处理算法,结合已有的单裁剪模型,在统一框架下实现多裁剪输出,从而提升裁剪结果的多样性与人类感知质量。

链接: https://arxiv.org/abs/2511.04680
作者: Rafe Loya,Andrew Hamara,Benjamin Estell,Benjamin Kilpatrick,Andrew C. Freeman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Datasets track of VCIP 2025

点击查看摘要

Abstract:Automatic image cropping is a method for maximizing the human-perceived quality of cropped regions in photographs. Although several works have proposed techniques for producing singular crops, little work has addressed the problem of producing multiple, distinct crops with aesthetic appeal. In this paper, we motivate the problem with a discussion on modern social media applications, introduce a dataset of 277 relevant images and human labels, and evaluate the efficacy of several single-crop models with an image partitioning algorithm as a pre-processing step. The dataset is available at this https URL.
zh

[CV-1] GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

【速读】:该论文旨在解决人形机器人在人类中心环境中进行安全且自然物理交互的问题,尤其针对现有强化学习(Reinforcement Learning, RL)策略因强调刚性跟踪而抑制外部力、导致交互生硬的缺陷。解决方案的关键在于提出GentleHumanoid框架,其核心是一个统一的基于弹簧的阻抗控制机制,能够同时建模抵抗性接触(restoring forces)和引导性接触(guiding contacts),从而实现全身上肢的顺应性(compliance)。该方法通过确保肩、肘、腕关节处力的运动学一致性,并引入任务可调的力阈值以提升安全性,在仿真与Unitree G1人形机器人平台上验证了其在轻柔拥抱、坐起辅助及安全物体操作等多场景下显著降低峰值接触力并保持任务成功率的能力,为实现人机协作中更自然、安全的交互提供了有效路径。

链接: https://arxiv.org/abs/2511.04679
作者: Qingzhou Lu,Yao Feng,Baiyu Shi,Michael Piseno,Zhenan Bao,C. Karen Liu
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Home page: this https URL

点击查看摘要

Abstract:Humanoid robots are expected to operate in human-centered environments where safe and natural physical interaction is essential. However, most recent reinforcement learning (RL) policies emphasize rigid tracking and suppress external forces. Existing impedance-augmented approaches are typically restricted to base or end-effector control and focus on resisting extreme forces rather than enabling compliance. We introduce GentleHumanoid, a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. At its core is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls sampled from human motion data). This formulation ensures kinematically consistent forces across the shoulder, elbow, and wrist, while exposing the policy to diverse interaction scenarios. Safety is further supported through task-adjustable force thresholds. We evaluate our approach in both simulation and on the Unitree G1 humanoid across tasks requiring different levels of compliance, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, our policy consistently reduces peak contact forces while maintaining task success, resulting in smoother and more natural interactions. These results highlight a step toward humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.
zh

[CV-2] racking and Understanding Object Transformations NEURIPS2025

【速读】:该论文旨在解决现实世界中物体状态变化(如苹果被切分、蝴蝶破茧而出)导致现有跟踪方法因外观显著改变而丢失目标的问题。其核心挑战在于如何在对象发生形态或属性转变时仍能持续追踪,并识别与描述这些状态变化。解决方案的关键是提出了一种名为TubeletGraph的零样本系统,该系统通过识别潜在被忽略的轨迹并基于语义和空间邻近先验判断是否整合新轨迹,进而推理新增轨迹并构建描述每个观测到的状态转换的时序状态图(state graph),从而实现对物体变换过程的深度理解与精准跟踪。

链接: https://arxiv.org/abs/2511.04678
作者: Yihong Sun,Xinyu Yang,Jennifer J. Sun,Bharath Hariharan
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at this https URL.
zh

[CV-3] InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation NEURIPS2025

【速读】:该论文旨在解决高分辨率图像与动态视频生成中如何统一建模空间(spatial)与时间(temporal)依赖关系的问题,同时提升生成效率与质量。其解决方案的关键在于提出了一种统一的时空自回归框架 InfinityStar,通过纯离散(purely discrete)建模方式,在单一架构内联合捕捉图像和视频中的空间与时间依赖性,从而支持多种生成任务(如文本到图像、文本到视频、图像到视频及长交互式视频合成),并实现显著优于现有自回归模型的性能(VBench 得分 83.74),且在无需额外优化的情况下,生成 5 秒 720p 视频的速度比主流扩散模型快约 10 倍,是首个能够生成工业级 720p 视频的离散自回归视频生成模型。

链接: https://arxiv.org/abs/2511.04675
作者: Jinlai Liu,Jian Han,Bin Yan,Hui Wu,Fengda Zhu,Xing Wang,Yi Jiang,Bingyue Peng,Zehuan Yuan
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Oral

点击查看摘要

Abstract:We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.
zh

[CV-4] X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

【速读】:该论文旨在解决人类示范动作与机器人执行之间因具身差异(embodiment difference)导致的动作不兼容问题,即直接将人类手部运动进行运动学映射(kinematic retargeting)会生成对机器人而言物理不可行的动作,从而降低策略学习效果。解决方案的关键在于利用前向扩散过程(forward diffusion process)的特性:随着噪声逐步加入,低层次的执行差异逐渐消失,而高层次的任务意图得以保留。作者提出X-Diffusion框架,通过训练一个分类器来判断噪声动作是否来自人类或机器人,并仅在噪声足够强使分类器无法区分时才引入人类动作进行策略训练;此时人类动作提供粗粒度任务引导,而机器人可执行的动作则监督低噪声水平下的精细去噪过程。这一机制有效避免了学习动态不可行的动作,同时最大化利用人类数据中的任务信息。

链接: https://arxiv.org/abs/2511.04671
作者: Maximus A. Pace,Prithwish Dan,Chuanruo Ning,Atiksh Bhardwaj,Audrey Du,Edward W. Duan,Wei-Chiu Ma,Kushal Kedia
机构: Cornell University (康奈尔大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at this https URL.
zh

[CV-5] Cambrian-S: Towards Spatial Supersensing in Video

【速读】:该论文旨在解决当前多模态智能系统在空间感知能力上的局限性问题,即现有模型主要依赖于任务驱动的反应式架构和暴力扩展上下文长度,难以实现对环境的深层理解与主动建模。其核心挑战在于如何构建具备“超感知”(supersensing)能力的系统,涵盖从语义感知到预测性世界建模的四个阶段:语义感知(semantic perception)、流式事件认知(streaming event cognition)、隐式三维空间认知(implicit 3D spatial cognition)以及预测性世界建模(predictive world modeling)。解决方案的关键在于提出一个双阶段基准 VSI-SUPER(包含长时程视觉空间回忆 VSR 和持续视觉空间计数 VSC),该基准设计可抵抗单纯扩大上下文长度带来的性能提升,从而真正评估模型的空间认知深度;同时引入自监督的下一潜在帧预测机制,利用预测误差(surprise)驱动记忆组织与事件分割,证明了仅靠数据规模不足以实现空间超感知,而必须通过预测性感知(predictive sensing)机制使模型具备预见、选择和组织经验的能力。

链接: https://arxiv.org/abs/2511.04670
作者: Shusheng Yang,Jihan Yang,Pinzhi Huang,Ellis Brown,Zihao Yang,Yue Yu,Shengbang Tong,Zihan Zheng,Yifan Xu,Muhan Wang,Daohan Lu,Rob Fergus,Yann LeCun,Li Fei-Fei,Saining Xie
机构: New York University (纽约大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.
zh

[CV-6] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

【速读】:该论文旨在解决多模态语言模型在时空空间推理能力上的不足问题,尤其是在真实世界视频中进行精确空间理解时表现有限的问题。其关键解决方案是提出SIMS-V框架,通过利用3D模拟器提供的结构化先验信息生成高质量的空间丰富型视频训练数据,从而有效缓解真实世界标注数据稀缺的瓶颈。研究发现,仅需三种核心问题类型(度量测量、视角依赖推理和时间追踪)即可显著提升模型在真实场景中的空间推理迁移能力,实现高效且高性能的训练,证明了模拟数据设计与任务多样性对空间智能发展的关键作用。

链接: https://arxiv.org/abs/2511.04668
作者: Ellis Brown,Arijit Ray,Ranjay Krishna,Ross Girshick,Rob Fergus,Saining Xie
机构: New York University (纽约大学); Boston University (波士顿大学); AllenAI (艾伦人工智能研究所); Vercept
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V – a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
zh

[CV-7] Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions

【速读】:该论文旨在解决机器人抓取与操作柔性物体(deformable objects)时,真实世界中策略评估成本高、耗时长且难以复现的问题。现有仿真平台往往无法准确捕捉软体物理与视觉的耦合复杂性,导致仿真与现实之间的性能差异较大。其解决方案的关键在于构建从真实视频中提取的软体数字孪生(soft-body digital twins),并利用3D Gaussian Splatting实现机器人、物体及环境的高保真度渲染,从而在仿真中实现与真实世界高度一致的行为模拟。该方法显著提升了策略评估的可重复性、可扩展性和准确性,验证了物理信息驱动的重建与高质量渲染相结合在机器人操控策略评估中的有效性。

链接: https://arxiv.org/abs/2511.04665
作者: Kaifeng Zhang,Shuo Sha,Hanxiao Jiang,Matthew Loper,Hyunjong Song,Guangyan Cai,Zhuo Xu,Xiaochen Hu,Changxi Zheng,Yunzhu Li
机构: Columbia University (哥伦比亚大学); SceniX Inc.; Google DeepMind (谷歌深度思维)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies. Website: this https URL
zh

[CV-8] Benchmark Designers Should “Train on the Test Set” to Expose Exploitable Non-Visual Shortcuts

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)评估中普遍存在的非视觉偏差问题,即模型可能通过利用文本偏置、语言先验或表面模式在视觉基准测试中取得高分,而无需真正理解视觉内容。这种现象尤其影响以视觉为核心的评测基准,导致性能指标失真。解决方案的关键在于提出一套系统性的诊断与去偏框架:首先通过“测试集压力测试”(Test-set Stress-Test, TsT)方法,利用 k 折交叉验证对仅含文本输入的测试集进行微调,量化每个样本的偏置得分 s(x),从而识别非视觉捷径;其次采用“迭代偏置剪枝”(Iterative Bias Pruning, IBP)策略过滤高偏置样本,实现基准的去偏重构。该框架在 VSI-Bench、CV-Bench、MMMU 和 VideoMME 四个基准上的应用表明,其能有效降低非视觉可解性并扩大视觉盲区模型的性能差距,从而提升评测的可靠性与公平性。

链接: https://arxiv.org/abs/2511.04655
作者: Ellis Brown,Jihan Yang,Shusheng Yang,Rob Fergus,Saining Xie
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly training on the test set’’ – probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via k -fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score s(x) . We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an Iterative Bias Pruning’’ (IBP) procedure. Applying this framework to four benchmarks – VSI-Bench, CV-Bench, MMMU, and VideoMME – we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original. Comments: Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.04655 [cs.CV] (or arXiv:2511.04655v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.04655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-9] Polarization-resolved imaging improves eye tracking

【速读】:该论文旨在解决传统眼动追踪技术在复杂场景下性能受限的问题,例如因眼睑遮挡、瞳孔大小变化或眼距调整导致的跟踪精度下降。其解决方案的关键在于引入偏振分辨近红外成像(polarization-resolved near-infrared imaging),通过构建一个由偏振滤光阵列相机与线性偏振近红外光源组成的偏振增强眼动追踪(Polarization-Enabled Eye Tracking, PET)系统,利用光与眼部组织相互作用时产生的偏振态变化来增强可追踪特征。实验表明,PET系统可在巩膜上揭示可追踪结构,并在角膜上呈现与注视方向相关的信息,这些信息在仅依赖强度的图像中几乎不可见;基于卷积神经网络的机器学习模型在346名受试者中验证,相较容量相当的强度基线方法,在标准条件及存在眼睑遮挡、眼距变化和瞳孔大小波动的情况下,平均95%分位绝对注视误差降低了10–16%,证明了该方案在提升人机交互鲁棒性方面的有效性。

链接: https://arxiv.org/abs/2511.04652
作者: Mantas Žurauskas,Tom Bu,Sanaz Alali,Beyza Kalkanli,Derek Shi,Fernando Alamos,Gauresh Pandit,Christopher Mei,Ali Behrooz,Ramin Mirjalili,Dave Stronks,Alexander Fix,Dmitri Model
机构: Meta(Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Polarization-resolved near-infrared imaging adds a useful optical contrast mechanism to eye tracking by measuring the polarization state of light reflected by ocular tissues in addition to its intensity. In this paper we demonstrate how this contrast can be used to enable eye tracking. Specifically, we demonstrate that a polarization-enabled eye tracking (PET) system composed of a polarization–filter–array camera paired with a linearly polarized near-infrared illuminator can reveal trackable features across the sclera and gaze-informative patterns on the cornea, largely absent in intensity-only images. Across a cohort of 346 participants, convolutional neural network based machine learning models trained on data from PET reduced the median 95th-percentile absolute gaze error by 10–16% relative to capacity-matched intensity baselines under nominal conditions and in the presence of eyelid occlusions, eye-relief changes, and pupil-size variation. These results link light–tissue polarization effects to practical gains in human–computer interaction and position PET as a simple, robust sensing modality for future wearable devices.
zh

[CV-10] NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment

【速读】:该论文旨在解决视频质量评估(Video Quality Assessment, VQA)中现有方法的两大局限性:一是全参考(Full-Reference, FR)指标依赖干净的参考视频,难以在实际场景中应用;二是大多数无参考(No-Reference, NR)模型依赖昂贵的人类主观评分标签,且多数为图像级方法,忽略了视频特有的时序上下文信息,从而影响视频目标检测等任务的性能。解决方案的关键在于提出一种可扩展的、基于流式处理的无参考、无主观评分的VQA模型,该模型利用DAVIS数据集的合成退化视频进行训练,采用时序感知的卷积架构直接从退化视频中预测FR指标(如LPIPS、PSNR、SSIM),无需参考视频即可完成推理。实验表明,该方法在多样化退化场景下优于图像基线模型,且与FR指标的相关性高于BRISQUE等主流图像级质量评估方法,验证了时序建模对真实视觉系统中可扩展VQA的重要性。

链接: https://arxiv.org/abs/2511.04628
作者: Kylie Cancilla,Alexander Moore,Amar Saini,Carmen Carrano
机构: Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.
zh

[CV-11] Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality

【速读】:该论文旨在解决虚拟免疫组化(virtual immunohistochemistry, virtual IHC)图像质量评估中缺乏准确、自动化指标的问题,传统基于纹理和分布的图像保真度指标(如FID、PSNR、SSIM)无法有效反映虚拟染色在像素级标签准确性上的表现。其解决方案的关键在于提出一种基于颜色解卷积(color deconvolution)的自动化框架,通过生成真实与虚拟IHC图像中呈棕色(即IHC阳性)像素的分割掩膜,并利用Dice系数、交并比(IoU)和Hausdorff距离等几何相似性指标直接量化染色准确性,从而无需专家人工标注即可客观评估模型性能。该方法揭示了现有保真度指标与病理学家判读之间相关性较差,且指出配对模型(如PyramidPix2Pix和AdaptiveNCE)优于无配对扩散或GAN模型,并强调全切片图像(WSI)层面评估的重要性,为虚拟IHC模型的临床转化提供了可复现的质量评估标准。

链接: https://arxiv.org/abs/2511.04615
作者: Tushar Kataria,Shikha Dubey,Mary Bronner,Jolanta Jedrzkiewicz,Ben J. Brintz,Shireen Y. Elhabian,Beatrice S. Knudsen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (HE) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.
zh

[CV-12] PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

【速读】:该论文旨在解决对比语言-图像预训练(CLIP)模型在细粒度图像-文本对齐能力上的局限性,尤其在处理长文本描述和局部视觉区域精准匹配方面存在瓶颈。其核心问题在于CLIP的文本编码器固有的token长度限制,导致难以有效利用长文本中蕴含的细粒度语义信息,同时现有方法多依赖视觉提示增强局部感知,但未同步优化文本侧的表达能力。解决方案的关键在于提出PixCLIP框架:首先构建一个自动化标注流程生成像素级定位的长文本描述,形成高质量数据集LongGRIT;其次用大语言模型(LLM)替换原CLIP文本编码器,并设计三分支像素-文本对齐学习机制,实现图像区域与任意粒度文本描述之间的精细对齐,从而在像素级交互和长文本处理上取得突破性性能提升。

链接: https://arxiv.org/abs/2511.04601
作者: Yicheng Xiao,Yu Chen,Haoxuan Ma,Jiale Hong,Caorui Li,Lingxiang Wu,Haiyun Guo,Jinqiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model’s fine-grained vision-language alignment. However, the inherent token length limitation of CLIP’s text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP’s original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.
zh

[CV-13] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction

【速读】:该论文旨在解决自动驾驶场景中基于前馈式3D重建面临的两大挑战:一是相机视场稀疏且非重叠导致的几何信息不足,二是复杂动态场景下难以实现稳定、高质量的重建。其解决方案的关键在于提出UniSplat框架,通过统一的潜在时空融合机制,在3D潜在骨架(3D latent scaffold)中联合建模空间视角与时间帧的信息,从而实现一致的时空对齐;同时设计双分支解码器,结合点锚定精修与体素生成策略,输出动态感知的高斯表示,并利用静态高斯的持久记忆实现超出当前摄像头覆盖范围的连续场景补全,显著提升新视角合成的质量与鲁棒性。

链接: https://arxiv.org/abs/2511.04595
作者: Chen Shi,Shaoshuai Shi,Xiaoyang Lyu,Chunyang Liu,Kehua Sheng,Bo Zhang,Li Jiang
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); Voyager Research, Didi Chuxing (滴滴出行); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.
zh

[CV-14] Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型存在的计算资源消耗大、部署效率低以及因预训练依赖机器人数据导致的感知表征退化和泛化能力差等问题。其关键解决方案是提出一种轻量级VLA模型Evo-1,该模型基于原生多模态视觉-语言模型(Vision-Language Model, VLM),引入一种新颖的跨模态调制扩散Transformer(cross-modulated diffusion transformer)与优化的融合模块,并设计了一个两阶段训练范式,逐步对齐动作与感知表征,从而在不依赖机器人数据预训练的前提下保持VLM的原始表征能力。Evo-1仅使用0.77亿参数,在Meta-World、RoboTwin和LIBERO基准上均达到领先性能,且在真实场景中实现78%的成功率、高推理频率和低内存开销,显著提升了VLA模型的实用性与可部署性。

链接: https://arxiv.org/abs/2511.04555
作者: Tao Lin,Yilei Zhong,Yuxin Du,Jingjing Zhang,Jiting Liu,Yinxinyu Chen,Encheng Gu,Ziyan Liu,Hongyi Cai,Yanwen Zou,Lixing Zou,Zhaoye Zhou,Gen Li,Bo Zhao
机构: School of AI, Shanghai Jiao Tong University (上海交通大学人工智能学院); EvoMind Tech; IAAR-Shanghai (上海人工智能研究院); SII; Carnegie Mellon University (卡内基梅隆大学); University of Cambridge (剑桥大学); Nanyang Technological University (南洋理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Github: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.
zh

[CV-15] Learning from Single Timestamps: Complexity Estimation in Laparoscopic Cholecystectomy

【速读】:该论文旨在解决腹腔镜胆囊切除术(Laparoscopic Cholecystectomy, LC)中手术复杂性自动化评估的问题,尤其是针对严重炎症程度的客观量化。现有方法多依赖静态图像或人工裁剪的视频片段,难以在真实临床场景中直接应用于完整视频分析。其解决方案的关键在于提出STC-Net框架,该框架基于单时间戳(SingleTimestamp)实现复杂性估计,通过联合执行时序定位、窗口提议和分级三个模块,在弱时间监督下自动识别关键手术阶段并进行Parkland评分(Parkland Grading Scale, PGS)分级。创新性地引入硬/软定位目标与背景感知的分级监督损失函数,显著提升了模型在未预处理全视频上的性能,验证了弱监督策略在手术复杂性自动化评估中的有效性。

链接: https://arxiv.org/abs/2511.04525
作者: Dimitrios Anastasiou,Santiago Barbarisi,Lucy Culshaw,Jayna Patel,Evangelos B. Mazomenos,Imanol Luengo,Danail Stoyanov
机构: University College London (伦敦大学学院); Medtronic (美敦力)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Accurate assessment of surgical complexity is essential in Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with longer operative times and increased risk of postoperative complications. The Parkland Grading Scale (PGS) provides a clinically validated framework for stratifying inflammation severity; however, its automation in surgical videos remains largely unexplored, particularly in realistic scenarios where complete videos must be analyzed without prior manual curation. Methods: In this work, we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity estimation in LC via the PGS, designed to operate under weak temporal supervision. Unlike prior methods limited to static images or manually trimmed clips, STC-Net operates directly on full videos. It jointly performs temporal localization and grading through a localization, window proposal, and grading module. We introduce a novel loss formulation combining hard and soft localization objectives and background-aware grading supervision. Results: Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by over 10% in both metrics and highlighting the effectiveness of weak supervision for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable and effective approach for automated PGS-based surgical complexity estimation from full LC videos, making it promising for post-operative analysis and surgical training.
zh

[CV-16] HEval. Evaluation Framework for Talking Head Video Generation

【速读】:该论文旨在解决当前生成式视频(尤其是说话头视频)评估体系滞后于生成技术发展的问题,现有评估方法主要依赖有限指标,难以全面衡量视频质量、自然度和同步性。其解决方案的关键在于提出一个包含8个指标的新评估框架,涵盖质量、自然度和同步性三个维度,并强调指标的高效性与人类偏好的一致性;该框架通过精细化分析头部、嘴部和眉毛的动态变化及面部质量,实现对生成视频更细致的量化评估,同时基于自建的真实数据集以减少训练数据偏差,从而为生成模型的改进提供可靠基准。

链接: https://arxiv.org/abs/2511.04520
作者: Nabyl Quignon,Baptiste Chopin,Yaohui Wang,Antitza Dantcheva
机构: Inria at University Côte d’Azur, France; Shanghai AI Laboratory, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.
zh

[CV-17] Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks

【速读】:该论文旨在解决神经网络在图像相关任务中计算资源消耗大、模型部署困难的问题,核心在于通过张量分解(tensorization)和低秩表示(low-rank representation)实现高效压缩。其关键创新在于摒弃传统基于权重空间的范数(如Frobenius范数)最小化策略,转而采用数据驱动的函数空间误差度量——即最小化层输出分布的变化,具体形式为 (WW~)Σ1/2F\lVert (W - \widetilde{W}) \Sigma^{1/2} \rVert_F,其中 Σ1/2\Sigma^{1/2} 为输入特征的协方差矩阵平方根。该方法直接优化压缩后模型与原始模型在功能上的差异,并提出针对Tucker-2和CPD两种主流张量分解的新交替最小二乘算法,显著提升了压缩效率与精度;更重要的是,该方案通常无需微调(fine-tuning)即可达到竞争性性能,且所用协方差信息可在不同数据集间迁移,增强了实际应用灵活性。

链接: https://arxiv.org/abs/2511.04494
作者: Alper Kalle,Theo Rudkiewicz,Mohamed-Oumar Ouerfelli,Mohamed Tamaazousti
机构: CEA(法国原子能和替代能源委员会); Inria(法国国家信息与自动化研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural networks are widely used for image-related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory- and compute-footprint can be reduced by compression. In this work, we focus on compression through tensorization and low-rank representations. Whereas classical approaches search for a low-rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight-space, we use data-informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as \lVert (W - \widetildeW) \Sigma^1/2\rVert_F where \Sigma^1/2 is the square root of the covariance matrix of the layer’s input and W , \widetildeW are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker-2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post-compression fine-tuning, our data-informed approach often achieves competitive accuracy without any fine-tuning. We further show that the same covariance-based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable. Experiments on several CNN architectures (ResNet-18/50, and GoogLeNet) and datasets (ImageNet, FGVC-Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.
zh

[CV-18] Landslide Hazard Mapping with Geospatial Foundation Models: Geographical Generalizability Data Scarcity and Band Adaptability

【速读】:该论文旨在解决滑坡遥感影像自动识别中因传感器差异、区域变化及标注数据稀缺导致的传统深度学习模型泛化能力不足的问题。其解决方案的关键在于提出一个三轴分析框架(sensor, label, domain),用于适配地理空间基础模型(GeoFMs),特别是针对Prithvi-EO-2.0模型进行优化;该框架通过全球预训练、自监督学习与可适应微调相结合的方式,显著提升了模型在光谱变化下的鲁棒性、小样本条件下的准确性以及跨数据集和地理场景的泛化性能。

链接: https://arxiv.org/abs/2511.04474
作者: Wenwen Li,Sizhe Wang,Hyunho Lee,Chenyan Lu,Sujit Roy,Rahul Ramachandran,Chia-Yu Hsu
机构: Arizona State University (亚利桑那州立大学); NASA Marshall Space Flight Center (美国国家航空航天局马歇尔太空飞行中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Landslides cause severe damage to lives, infrastructure, and the environment, making accurate and timely mapping essential for disaster preparedness and response. However, conventional deep learning models often struggle when applied across different sensors, regions, or under conditions of limited training data. To address these challenges, we present a three-axis analytical framework of sensor, label, and domain for adapting geospatial foundation models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a series of experiments, we show that it consistently outperforms task-specific CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other GeoFMs (TerraMind, SatMAE). The model, built on global pretraining, self-supervision, and adaptable fine-tuning, proved resilient to spectral variation, maintained accuracy under label scarcity, and generalized more reliably across diverse datasets and geographic settings. Alongside these strengths, we also highlight remaining challenges such as computational cost and the limited availability of reusable AI-ready training data for landslide research. Overall, our study positions GeoFMs as a step toward more robust and scalable approaches for landslide risk reduction and environmental monitoring.
zh

[CV-19] V-Thinker: Interactive Thinking with Images

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在实现图像交互与长时程推理能力深度融合方面的长期挑战,特别是如何从依赖图像辅助的推理模式转向以图像为核心的交互式思维(Thinking with Images)。其解决方案的关键在于提出V-Thinker,一个基于端到端强化学习的通用多模态推理助手,包含两个核心组件:一是数据进化飞轮(Data Evolution Flywheel),可自动合成、演化并验证跨多样性、质量与难度三个维度的交互式推理数据集;二是视觉渐进训练课程(Visual Progressive Training Curriculum),通过点级监督对齐感知能力,并结合两阶段强化学习框架整合交互式推理能力。这一方法显著提升了LMMs在图像交互推理任务中的表现,为推进图像交互式推理应用提供了新范式。

链接: https://arxiv.org/abs/2511.04460
作者: Runqi Qiao,Qiuna Tan,Minghan Yang,Guanting Dong,Peiqing Yang,Shiqiang Lang,Enhui Wan,Xiaowan Wang,Yida Xu,Lan Yang,Chong Sun,Chen Li,Honggang Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); WeChat Vision, Tencent Inc. (腾讯微信视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Working in progress

点击查看摘要

Abstract:Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
zh

[CV-20] Solving Convex Partition Visual Jigsaw Puzzles

【速读】:该论文旨在解决凸分割(Convex Partitions)类多边形拼图的自动求解问题,这类拼图相较于传统的方形拼图更具现实应用价值但计算复杂度更高。其解决方案的关键在于同时利用几何兼容性(geometrical compatibility)和图像内容兼容性(pictorial compatibility),并提出一种贪心式求解算法(greedy solver),从而有效提升拼图重构的准确性和效率。此外,研究还首次构建了此类拼图的基准数据集,为后续算法评估提供了标准化测试平台。

链接: https://arxiv.org/abs/2511.04450
作者: Yaniv Ohayon,Ofir Itzhak Shahar,Ohad Ben-Shahar
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Jigsaw puzzle solving requires the rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole, often an image, and is known to be an intractable problem. While the possible impact of automatic puzzle solvers can be disruptive in various application domains, most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use. In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions, a major subset of polygonal puzzles whose pieces are convex. We utilize both geometrical and pictorial compatibilities, introduce a greedy solver, and report several performance measures next to the first benchmark dataset of such puzzles.
zh

[CV-21] HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats

【速读】:该论文旨在解决在自然栖息地中对章鱼进行视频分割的难题,其核心挑战包括章鱼强大的伪装能力、皮肤颜色与纹理的快速变化、非刚性形变以及频繁的遮挡现象,这些问题在水下光照和浑浊度变化的复杂环境中进一步加剧。为应对现有大规模标注数据集缺失的问题,作者提出了一种名为HideAndSeg的最小监督AI工具,其关键创新在于将SAM2(Segment Anything Model 2)与自训练的YOLOv11目标检测器相结合:首先通过用户提供的点坐标生成初始分割掩膜作为YOLO模型的训练数据,随后利用边界框提示实现全流程自动化,无需人工干预;同时引入两个无监督评估指标——时间一致性DICE_t和新组件计数NC_t,用于定量评估分割质量并指导掩膜优化,从而在缺乏真实标注数据的情况下仍能保持高精度分割性能。实验表明,该方法显著降低了分割噪声,并可在完全遮挡后重新识别和分割章鱼,优于传统手动提示方法。

链接: https://arxiv.org/abs/2511.04426
作者: Alan de Aguiar,Michaella Pereira Andrade,Charles Morphy D. Santos,João Paulo Gois
机构: Universidade Federal do ABC (联邦大学-ABC分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency DICE_t and new component count NC_t - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.
zh

[CV-22] On the Equivalence of Regression and Classification

【速读】:该论文试图解决回归与分类之间缺乏明确形式化联系的问题,尤其是传统支持向量回归(Support Vector Regression, SVR)中对范数正则项 |w| 的解释仅限于作为正则化手段,而未从几何或优化角度建立与分类任务的深层等价关系。其解决方案的关键在于:证明了在 M 个样本共线于超平面的回归问题,可等价转化为一个包含 2M 个样本的线性可分分类任务;基于此等价性,提出了一种通过最大化分类边界(margin maximization)来导出新的回归公式的方法,并进一步利用该等价关系定义了一个“可回归性”(regressability)度量,用于无需训练模型即可评估数据集的回归难度,同时启发神经网络学习一个线性化映射,将输入变量变换至适合线性回归的空间。

链接: https://arxiv.org/abs/2511.04422
作者: Jayadeva,Naman Dwivedi,Hari Krishnan,N.M. Anoop Krishnan
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:A formal link between regression and classification has been tenuous. Even though the margin maximization term |w| is used in support vector regression, it has at best been justified as a regularizer. We show that a regression problem with M samples lying on a hyperplane has a one-to-one equivalence with a linearly separable classification task with 2M samples. We show that margin maximization on the equivalent classification task leads to a different regression formulation than traditionally used. Using the equivalence, we demonstrate a ``regressability’’ measure, that can be used to estimate the difficulty of regressing a dataset, without needing to first learn a model for it. We use the equivalence to train neural networks to learn a linearizing map, that transforms input variables into a space where a linear regressor is adequate.
zh

[CV-23] DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale

【速读】:该论文旨在解决视觉对象建模与表示学习在不同尺度下缺乏统一框架的问题,导致研究效率低下且难以实现跨任务的快速实验与部署。其解决方案的关键在于构建一个开源的 PyTorch 库 DORAEMON,通过单一 YAML 驱动的工作流整合分类、检索和度量学习等任务,并提供超过 1000 个预训练骨干网络(backbone)、模块化损失函数、数据增强策略及分布式训练工具,同时兼容 timm 接口并支持一键导出至 ONNX 或 HuggingFace,从而实现从研究到实际应用的高效迁移。

链接: https://arxiv.org/abs/2511.04394
作者: Ke Du,Yimin Peng,Chao Gao,Fan Zhou,Siqiao Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code: this https URL

点击查看摘要

Abstract:DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at this https URL.
zh

[CV-24] BoRe-Depth: Self-supervised Monocular Depth Estimation with Boundary Refinement for Embedded Systems IROS2025

【速读】:该论文旨在解决单目深度估计在嵌入式系统中面临的两大挑战:一是现有轻量级方法深度估计性能较差,二是物体边界模糊问题严重。解决方案的关键在于提出一种参数量仅为8.7M的新型模型BoRe-Depth,其核心创新包括:首先设计了增强型特征自适应融合模块(Enhanced Feature Adaptive Fusion Module, EFAF),通过自适应融合深度特征以增强边界细节表征能力;其次在编码器中引入语义知识,提升目标识别与边界感知能力;最终在NVIDIA Jetson Orin平台上实现50.7 FPS的高效运行,显著优于此前轻量化模型,并在多个挑战性数据集上验证了其优越性。

链接: https://arxiv.org/abs/2511.04388
作者: Chang Liu,Juan Li,Sheng Zhang,Chang Liu,Jie Li,Xu Zhang
机构: Beijing Institute of Technology (北京理工大学); Yangtze Delta Region Academy of Beijing Institute of Technology Jiaxing (北京理工大学嘉兴长三角研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures, published to IROS 2025

点击查看摘要

Abstract:Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low?cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at this https URL.
zh

[CV-25] Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

【速读】:该论文旨在解决医学视觉问答(Medical Visual Question Answering, Medical VQA)中回答准确性与可解释性不足的问题,特别是如何在多任务框架下实现精准的视觉定位(visual grounding)、逻辑推理(reasoning)与自然语言解释生成。解决方案的关键在于构建一个基于LoRA微调的Florence-2模型的多任务学习框架,整合三个精心设计的数据集:Kvasir-VQA-x1用于问答训练,合成增强的结构化医学推理数据集支持解释生成,以及文本到区域配对数据用于视觉定位。通过联合优化这三个任务,模型能够同时提升答案准确率和视觉定位性能,从而实现更可靠、可解释的医学图像理解。

链接: https://arxiv.org/abs/2511.04384
作者: Itbaan Safwan,Muhammad Annas Shaikh,Muhammad Haaris,Ramail Khan,Muhammad Atif Tahir
机构: Institute of Business Administration (IBA), Karachi, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This is a working paper submitted for Medico 2025: Visual Question Answering (with multimodal explanations) for Gastrointestinal Imaging at MediaEval 2025. 5 pages, 3 figures and 1 table

点击查看摘要

Abstract:We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.
zh

[CV-26] GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies

【速读】:该论文旨在解决自主机器人从人类示范中学习新技能时面临的两大挑战:一是现有视觉-语言-动作(Vision-Language Action, VLA)模型缺乏高层符号规划能力,限制了其在长周期任务中的表现;二是符号化方法(如动作模型学习,Action Model Learning, AML)存在泛化性和可扩展性不足的问题。解决方案的关键在于提出一种新的神经符号框架——GraSP-VLA,其核心创新是利用连续场景图(Continuous Scene Graph)表示从观测中生成符号化表示,并在推理阶段动态构建新的规划域,同时作为低层VLA策略的调度器,从而显著提升可执行动作序列的长度和任务复杂度。

链接: https://arxiv.org/abs/2511.04357
作者: Maëlic Neau,Zoe Falomir,Paulo E. Santos,Anne-Gwenn Bosser,Cédric Buche
机构: Umeå University (于默奥大学); PrioriAnalytica; Bretagne INP - ENIB (布列塔尼INP - ENIB); IMT Atlantique (IMT大西洋); CNRS IRL 2010 CROSSING
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deploying autonomous robots that can learn new skills from demonstrations is an important challenge of modern robotics. Existing solutions often apply end-to-end imitation learning with Vision-Language Action (VLA) models or symbolic approaches with Action Model Learning (AML). On the one hand, current VLA models are limited by the lack of high-level symbolic planning, which hinders their abilities in long-horizon tasks. On the other hand, symbolic approaches in AML lack generalization and scalability perspectives. In this paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that uses a Continuous Scene Graph representation to generate a symbolic representation of human demonstrations. This representation is used to generate new planning domains during inference and serves as an orchestrator for low-level VLA policies, scaling up the number of actions that can be reproduced in a row. Our results show that GraSP-VLA is effective for modeling symbolic representations on the task of automatic planning domain generation from observations. In addition, results on real-world experiments show the potential of our Continuous Scene Graph representation to orchestrate low-level VLA policies in long-horizon tasks.
zh

[CV-27] A MATLAB tutorial on deep feature extraction combined with chemometrics for analytical applications

【速读】:该论文旨在解决分析化学中如何高效提取和分析成像数据(如传统彩色相机、高光谱相机和显微镜图像)的空间信息以支持探索性和预测性研究的问题,尤其是在传统化学计量学方法难以胜任的情况下。其关键解决方案是提供一个结构化的、分步骤的教程,指导用户利用现成的开源深度学习模型从成像数据中提取多尺度深层特征(deep features),而非从头训练模型;该教程强调将这些特征与其它数据源(如光谱信息)整合,并附带MATLAB代码示例,使研究人员能够直接在自身数据集上实践应用。

链接: https://arxiv.org/abs/2511.04349
作者: Puneet Mishra,Martijntje Vollebregt,Yizhou Ma,Maria Font-i-Furnols
机构: Wageningen University and Research (瓦赫宁根大学与研究); IRTA-Food Quality and Technology (食品质量与技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background In analytical chemistry, spatial information about materials is commonly captured through imaging techniques, such as traditional color cameras or with advanced hyperspectral cameras and microscopes. However, efficiently extracting and analyzing this spatial information for exploratory and predictive purposes remains a challenge, especially when using traditional chemometric methods. Recent advances in deep learning and artificial intelligence have significantly enhanced image processing capabilities, enabling the extraction of multiscale deep features that are otherwise challenging to capture with conventional image processing techniques. Despite the wide availability of open-source deep learning models, adoption in analytical chemistry remains limited because of the absence of structured, step-by-step guidance for implementing these models. Results This tutorial aims to bridge this gap by providing a step-by-step guide for applying deep learning approaches to extract spatial information from imaging data and integrating it with other data sources, such as spectral information. Importantly, the focus of this work is not on training deep learning models for image processing but on using existing open source models to extract deep features from imaging data. Significance The tutorial provides MATLAB code tutorial demonstrations, showcasing the processing of imaging data from various imaging modalities commonly encountered in analytical chemistry. Readers must run the tutorial steps on their own datasets using the codes presented in this tutorial. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.04349 [cs.CV] (or arXiv:2511.04349v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.04349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-28] Evaluating the Impact of Weather-Induced Sensor Occlusion on BEVFusion for 3D Object Detection

【速读】:该论文旨在解决传感器遮挡(occlusion)对基于鸟瞰图(Bird’s Eye View, BEV)的3D目标检测性能影响的问题,特别是在复杂环境条件下(如雾、霾或物理遮挡)导致相机和激光雷达(LiDAR)数据质量下降时,如何评估多模态融合架构的鲁棒性。其解决方案的关键在于通过在nuScenes数据集上系统性地模拟不同强度的遮挡,量化单一传感器与融合设置下检测精度(以mAP和nuScenes Detection Score, NDS衡量)的变化,从而揭示BEVFusion模型对不同传感器失效的依赖程度:实验表明,当相机被中度遮挡时,仅依赖相机的检测性能显著下降(mAP从35.6%降至20.9%),而LiDAR在重度遮挡下才出现明显性能退化(mAP从64.7%降至34.1%);在融合场景中,相机遮挡仅造成轻微性能损失(mAP从68.5%降至65.7%),而LiDAR遮挡则引发大幅下降(mAP从68.5%降至50.1%),说明当前BEV融合方法更依赖LiDAR信息,凸显了未来需开发更具鲁棒性的遮挡感知融合策略与评估框架。

链接: https://arxiv.org/abs/2511.04347
作者: Sanjay Kumar,Tim Brophy,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D object detection is essential for automated vehicles to navigate safely in complex real-world environments. Bird’s Eye View (BEV) representations, which project multi-sensor data into a top-down spatial format, have emerged as a powerful approach for robust perception. Although BEV-based fusion architectures have demonstrated strong performance through multimodal integration, the effects of sensor occlusions, caused by environmental conditions such as fog, haze, or physical obstructions, on 3D detection accuracy remain underexplored. In this work, we investigate the impact of occlusions on both camera and Light Detection and Ranging (LiDAR) outputs using the BEVFusion architecture, evaluated on the nuScenes dataset. Detection performance is measured using mean Average Precision (mAP) and the nuScenes Detection Score (NDS). Our results show that moderate camera occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is based only on the camera. On the other hand, LiDAR sharply drops in performance only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%), with a severe impact on long-range detection. In fused settings, the effect depends on which sensor is occluded: occluding the camera leads to a minor 4.1% drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8% drop (to 50.1%), revealing the model’s stronger reliance on LiDAR for the task of 3D object detection. Our results highlight the need for future research into occlusion-aware evaluation methods and improved sensor fusion techniques that can maintain detection accuracy in the presence of partial sensor failure or degradation due to adverse environmental conditions.
zh

[CV-29] Comparative Study of CNN Architectures for Binary Classification of Horses and Motorcycles in the VOC 2008 Dataset

【速读】:该论文旨在解决图像分类任务中因类别不平衡(class imbalance)导致的模型性能偏差问题,特别是在VOC 2008数据集中对马(horse)和摩托车(motorcycle)进行二分类时,少数类检测效果较差的问题。其解决方案的关键在于采用针对少数类的数据增强技术(minority-class augmentation techniques),并通过对比多种现代卷积神经网络架构(如ResNet-50、ConvNeXt-Tiny、DenseNet-121及Vision Transformer)在多个性能指标上的表现,验证了数据增强策略对提升少数类检测精度的有效性,尤其在深层网络结构中效果更为显著。

链接: https://arxiv.org/abs/2511.04344
作者: Muhammad Annas Shaikh,Hamza Zaman,Arbaz Asif
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of nine convolutional neural network architectures for binary classification of horses and motorcycles in the VOC 2008 dataset. We address the significant class imbalance problem by implementing minority-class augmentation techniques. Our experiments compare modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and Vision Transformer across multiple performance metrics. Results demonstrate substantial performance variations, with ConvNeXt-Tiny achieving the highest Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle detection. We observe that data augmentation significantly improves minority class detection, particularly benefiting deeper architectures. This study provides insights into architecture selection for imbalanced binary classification tasks and quantifies the impact of data augmentation strategies in mitigating class imbalance issues in object detection.
zh

[CV-30] Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography

【速读】:该论文旨在解决医学影像中肿瘤自动分割的精度与计算效率之间的矛盾问题,特别是在高分辨率三维(3D)计算机断层扫描(Computed Tomography, CT)图像中,传统卷积神经网络因处理大量体素(voxel)导致计算资源消耗过大,难以在临床环境中实现高效部署。其解决方案的关键在于提出一种两阶段方法:首先对体素进行稀疏化处理,随后采用子流形稀疏卷积神经网络(submanifold sparse convolutional networks),从而在保持高分辨率输入和原生3D模型架构的前提下,显著降低GPU显存占用和推理时间,同时达到当前最优的分割精度。

链接: https://arxiv.org/abs/2511.04334
作者: Saúl Alonso-Monsalve,Leigh H. Whitehead,Adam Aurisano,Lorena Escudero Sanchez
机构: ETH Zürich (苏黎世联邦理工学院); University of Cambridge (剑桥大学); University of Cincinnati (辛辛那提大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:The accurate delineation of tumours in radiological images like Computed Tomography is a very specialised and time-consuming task, and currently a bottleneck preventing quantitative analyses to be performed routinely in the clinical setting. For this reason, developing methods for the automated segmentation of tumours in medical imaging is of the utmost importance and has driven significant efforts in recent years. However, challenges regarding the impracticality of 3D scans, given the large amount of voxels to be analysed, usually requires the downsampling of such images or using patches thereof when applying traditional convolutional neural networks. To overcome this problem, in this paper we propose a new methodology that uses, divided into two stages, voxel sparsification and submanifold sparse convolutional networks. This method allows segmentations to be performed with high-resolution inputs and a native 3D model architecture, obtaining state-of-the-art accuracies while significantly reducing the computational resources needed in terms of GPU memory and time. We studied the deployment of this methodology in the context of Computed Tomography images of renal cancer patients from the KiTS23 challenge, and our method achieved results competitive with the challenge winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also offers significant computational improvements, achieving up to a 60% reduction in inference time and up to a 75% reduction in VRAM usage compared to an equivalent dense architecture, across both CPU and various GPU cards tested.
zh

[CV-31] RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)扩散模型在使用简洁提示(prompt)时难以保持视频质量的问题,其根本原因在于预训练文本编码器对语义理解能力有限,且无法在线重述提示以更好地匹配用户意图,从而限制了模型的可扩展性和可用性。解决方案的关键在于提出 RISE-T2V 框架,该框架将提示重述(prompt rephrasing)与语义特征提取整合为一个无缝步骤,而非传统分离的两步流程;其核心创新是引入 Rephrasing Adapter 模块,使扩散模型能够利用大型语言模型(Large Language Model, LLM)在下一个词预测过程中产生的隐藏状态作为条件,隐式地将基础提示重构为更完整的表征,从而更精准地捕捉用户意图,显著提升生成视频的质量与一致性。

链接: https://arxiv.org/abs/2511.04317
作者: Xiangjun Zhang,Litong Gong,Yinglin Zheng,Yansong Liu,Wentao Jiang,Mingyi Xu,Biao Wang,Tiezheng Ge,Ming Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 16 figures

点击查看摘要

Abstract:Most text-to-video(T2V) diffusion models depend on pre-trained text encoders for semantic alignment, yet they often fail to maintain video quality when provided with concise prompts rather than well-designed ones. The primary issue lies in their limited textual semantics understanding. Moreover, these text encoders cannot rephrase prompts online to better align with user intentions, which limits both the scalability and usability of the models, To address these challenges, we introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single and seamless step instead of two separate steps. RISE-T2V is universal and can be applied to various pre-trained LLMs and video diffusion models(VDMs), significantly enhancing their capabilities for T2V tasks. We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states during the next token prediction of the LLM as a condition for video generation. By employing a Rephrasing Adapter, the video generation model can implicitly rephrase basic prompts into more comprehensive representations that better match the user’s intent. Furthermore, we leverage the powerful capabilities of LLMs to enable video generation models to accomplish a broader range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a versatile framework applicable to different video diffusion model architectures, significantly enhancing the ability of T2V models to generate high-quality videos that align with user intent. Visual results are available on the webpage at this https URL.
zh

[CV-32] Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data

【速读】:该论文旨在解决 offshore infrastructure(海上基础设施)检测模型在训练数据稀缺、类别分布不均衡时性能下降的问题,尤其是在低样本量的未覆盖区域难以实现可靠识别。其关键解决方案是利用合成数据与真实 Sentinel-1 卫星影像相结合的方式增强训练集的多样性与平衡性,从而提升基于 YOLOv10 的目标检测模型在跨区域场景下的泛化能力。实验表明,引入合成数据后模型 F1 分数从 0.85 提升至 0.90,并成功在三个未参与训练的地理区域(墨西哥湾、北海、波斯湾)中检测出共计 3,529 座海上平台,验证了该策略对全球尺度下海上设施监测的可行性与有效性。

链接: https://arxiv.org/abs/2511.04304
作者: Robin Spanier,Thorsten Hoeser,Claudia Kuenzer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:The recent and ongoing expansion of marine infrastructure, including offshore wind farms, oil and gas platforms, artificial islands, and aquaculture facilities, highlights the need for effective monitoring systems. The development of robust models for offshore infrastructure detection relies on comprehensive, balanced datasets, but falls short when samples are scarce, particularly for underrepresented object classes, shapes, and sizes. By training deep learning-based YOLOv10 object detection models with a combination of synthetic and real Sentinel-1 satellite imagery acquired in the fourth quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of Guinea, and Coast of Brazil), this study investigates the use of synthetic training data to enhance model performance. We evaluated this approach by applying the model to detect offshore platforms in three unseen regions (Gulf of Mexico, North Sea, Persian Gulf) and thereby assess geographic transferability. This region-holdout evaluation demonstrated that the model generalises beyond the training areas. In total, 3,529 offshore platforms were detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and 1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which improved to 0.90 upon incorporating synthetic data. We analysed how synthetic data enhances the representation of unbalanced classes and overall model performance, taking a first step toward globally transferable detection of offshore infrastructure. This study underscores the importance of balanced datasets and highlights synthetic data generation as an effective strategy to address common challenges in remote sensing, demonstrating the potential of deep learning for scalable, global offshore infrastructure monitoring.
zh

[CV-33] Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment

【速读】:该论文旨在解决农业领域中除草剂田间试验图像分析的挑战,即如何在复杂多变的环境中准确识别植物物种并评估除草剂引起的损伤。传统通用视觉基础模型(vision foundation models)在农业场景下表现受限,难以处理物种间细微差异及损伤类型的精细分类。其解决方案的关键在于:通过自监督学习方法,在大规模、精心标注的农业数据集上对通用视觉基础模型进行领域特定微调,从而构建一个专用于除草剂试验图像的领域特定基础模型(domain-specific foundation model)。该模型不仅显著提升了物种识别(F1分数从0.91提升至0.94)和损伤分类性能(从0.26提升至0.33),还在未见条件(如新地点、不同时间)和域偏移场景(如无人机影像)下展现出更强泛化能力,并大幅降低人工标注需求(在未见条件下使用80%更少标签即可获得5.4%更高的F1分数),为除草剂田间试验提供了一种高效、自动化的分析方案。

链接: https://arxiv.org/abs/2511.04288
作者: Leire Benito-Del-Valle,Artzai Picón,Daniel Mugica,Manuel Ramos,Eva Portillo,Javier Romero,Carlos Javier Jimenez,Ramón Navarra-Mestre
机构: TECNALIA, Basque Research and Technology Alliance (巴斯克研究与技术联盟); University of the Basque Country (巴斯克大学); BASF Digital Solutions S.L. (巴斯夫数字解决方案有限公司); BASF Espanola S.L. (巴斯夫西班牙有限公司); BASF SE (巴斯夫股份公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Herbicide field trials require accurate identification of plant species and assessment of herbicide-induced damage across diverse environments. While general-purpose vision foundation models have shown promising results in complex visual domains, their performance can be limited in agriculture, where fine-grained distinctions between species and damage types are critical. In this work, we adapt a general-purpose vision foundation model to herbicide trial characterization. Trained using a self-supervised learning approach on a large, curated agricultural dataset, the model learns rich and transferable representations optimized for herbicide trials images. Our domain-specific model significantly outperforms the best general-purpose foundation model in both species identification (F1 score improvement from 0.91 to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions (new locations and other time), it achieves even greater gains (species identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In domain-shift scenarios, such as drone imagery, it maintains strong performance (species classification from 0.49 to 0.60). Additionally, we show that domain-specific pretraining enhances segmentation accuracy, particularly in low-annotation regimes. An annotation-efficiency analysis reveals that, under unseen conditions, the domain-specific model achieves 5.4% higher F1 score than the general-purpose model, while using 80% fewer labeled samples. These results demonstrate the generalization capabilities of domain-specific foundation models and their potential to significantly reduce manual annotation efforts, offering a scalable and automated solution for herbicide trial analysis. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.04288 [cs.CV] (or arXiv:2511.04288v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.04288 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Leire Benito Del Valle [view email] [v1] Thu, 6 Nov 2025 11:30:32 UTC (17,383 KB) Full-text links: Access Paper: View a PDF of the paper titled Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment, by Leire Benito-Del-Valle and 7 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-34] FastGS: Training 3D Gaussian Splatting in 100 Seconds FAST

【速读】:该论文旨在解决当前3D高斯溅射(3D Gaussian Splatting, 3DGS)加速方法在训练过程中无法有效调控高斯分布数量的问题,导致计算冗余和时间开销过大。其解决方案的关键在于提出一种名为FastGS的新颖、简洁且通用的加速框架,该框架基于多视角一致性(multi-view consistency)对每个高斯的重要性进行评估,并设计了一种无需预算机制(budgeting mechanism)的稠密化与剪枝策略,从而高效平衡训练速度与渲染质量。实验表明,FastGS在多个数据集和任务中均实现了显著的训练加速(最高达15.45×),同时保持了与先进方法相当的渲染效果。

链接: https://arxiv.org/abs/2511.04283
作者: Shiwei Ren,Tianci Wen,Yongchun Fang,Biao Lu
机构: NanKai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32 \times training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45 \times acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7 \times training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at this https URL
zh

[CV-35] DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

【速读】:该论文旨在解决视频跨模态行人重识别(Video-based Visible-Infrared person Re-identification, VVI-ReID)中因忽视步态特征而导致的时空一致性建模不足问题。现有方法多聚焦于提取模态不变的外观特征,但忽略了步态(gait)不仅具有模态不变性,还蕴含丰富的时序动态信息,对跨模态视频匹配至关重要。解决方案的关键在于提出一种基于DINOv2驱动的步态表征学习框架(DinoGRL),其核心创新包括:1)设计语义感知的轮廓与步态学习模型(SASGL),利用DINOv2提供的通用语义先验增强轮廓表示,并联合优化ReID目标实现语义丰富且任务自适应的步态特征学习;2)构建渐进式双向多粒度增强模块(PBMGE),通过在多个空间粒度上实现步态与外观流之间的双向交互,充分挖掘二者互补性,从而在全局表征中融合局部细节,生成高判别力的特征。

链接: https://arxiv.org/abs/2511.04281
作者: Yujie Yang,Shuang Li,Jun Ye,Neng Dong,Fan Li,Huafeng Li
机构: Kunming University of Science and Technology (昆明理工大学); Chongqing University of Post and Telecommunications (重庆邮电大学); China University of Mining Technology (中国矿业大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.
zh

[CV-36] Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

【速读】:该论文旨在解决生成式 AI(Generative AI)图像和深度伪造(deepfake)内容的来源归属与真实性验证问题,尤其是在扩散模型(diffusion models)输出中存在难以察觉的统计痕迹(signal leaks)的情况下。解决方案的关键在于提出 Proto-LeakNet,一种基于潜在空间信号泄漏感知且可解释的归属分析框架:它在扩散模型的潜在域中通过重模拟部分前向扩散过程来暴露残余的生成器特异性线索,并利用时序注意力编码器聚合多步潜在特征,结合特征加权原型头构建嵌入空间以实现透明化归属判断。该方法仅需封闭数据训练即可有效区分已知与未知生成器,且对后处理具有鲁棒性,在宏观AUC上达到98.13%,显著优于现有最先进方法。

链接: https://arxiv.org/abs/2511.04260
作者: Claudio Giusti,Luca Guarnera,Sebastiano Battiato
机构: University of Catania (卡塔尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, 5 tables

点击查看摘要

Abstract:The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates closed-set classification with a density-based open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Operating in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability between known and unseen generators. These results demonstrate that modeling signal-leak bias in latent space enables reliable and interpretable AI-image and deepfake forensics. The code for the whole work will be available upon submission.
zh

[CV-37] MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection

【速读】:该论文旨在解决医学影像中解剖学标志点(anatomical landmark)检测任务的性能瓶颈问题,传统方法依赖于领域特定模型,而忽略了人类中心的基础模型(human-centric foundation models)在空间姿态定位上的强先验能力。解决方案的关键在于通过多数据集预训练策略,将原本用于人体姿态估计的Sapiens模型适配至医学影像场景,构建出MedSapiens模型,从而在多个基准数据集上实现新的最优性能,尤其在小样本设置下仍显著优于现有最先进方法。

链接: https://arxiv.org/abs/2511.04255
作者: Marawan Elbatel,Anbang Wang,Keyuan Liu,Kaouther Mouheb,Enrique Almar-Munoz,Lizhuo Lin,Yanqi Yang,Karim Lekadir,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学); The University of Hong Kong (香港大学); Erasmus MC (埃拉斯姆斯医学中心); Medical University of Innsbruck (因斯布鲁克医科大学); University of Barcelona (巴塞罗那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at this https URL .
zh

[CV-38] AStF: Motion Style Transfer via Adaptive Statistics Fusor

【速读】:该论文旨在解决人体运动风格迁移(motion style transfer)中因仅依赖均值和方差建模而导致的动态模式与时空一致性表征不足的问题。传统方法虽在图像风格迁移中有效,但无法充分捕捉运动数据的复杂时序特征。解决方案的关键在于提出一种自适应统计融合模块(Adaptive Statistics Fusor, AStF),其核心由风格解耦模块(Style Disentanglement Module, SDM)和高阶多统计注意力机制(High-Order Multi-Statistics Attention, HOS-Attn)构成,并引入运动一致性正则化判别器(Motion Consistency Regularization, MCR)进行训练。该方法通过引入偏度(skewness)和峰度(kurtosis)等高阶统计量,显著提升了对动态风格时空统计特性的建模能力,从而实现更真实、连贯的运动风格迁移效果。

链接: https://arxiv.org/abs/2511.04192
作者: Hanmo Chen,Chenghao Xu,Jiexi Yan,Cheng Deng
机构: Hangzhou Institute of Technology, Xidian University (西安电子科技大学杭州研究院); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human motion style transfer allows characters to appear less rigidity and more realism with specific style. Traditional arbitrary image style transfer typically process mean and variance which is proved effective. Meanwhile, similar methods have been adapted for motion style transfer. However, due to the fundamental differences between images and motion, relying on mean and variance is insufficient to fully capture the complex dynamic patterns and spatiotemporal coherence properties of motion data. Building upon this, our key insight is to bring two more coefficient, skewness and kurtosis, into the analysis of motion style. Specifically, we propose a novel Adaptive Statistics Fusor (AStF) which consists of Style Disentanglement Module (SDM) and High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in conjunction with a Motion Consistency Regularization (MCR) discriminator. Experimental results show that, by providing a more comprehensive model of the spatiotemporal statistical patterns inherent in dynamic styles, our proposed AStF shows proficiency superiority in motion style transfers over state-of-the-arts. Our code and model are available at this https URL.
zh

[CV-39] Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification

【速读】:该论文旨在解决医学图像分类中特征表示能力不足的问题,尤其是在利用传统手工设计特征时性能受限的挑战。其解决方案的关键在于将预训练通用视觉编码器(General Vision Encoders, GVEs)提取的特征用于构建协方差描述子(Covariance Descriptors),并将其与手工特征进行对比;同时引入专为对称正定(Symmetric Positive Definite, SPD)矩阵设计的分类网络SPDNet,以提升分类性能。实验表明,基于GVE特征(如DINOv2和MedSAM)生成的协方差描述子显著优于手工特征,且SPDNet结合DINOv2特征在多个医学图像数据集上超越了当前最先进方法,验证了该策略在医学图像分析中的有效性。

链接: https://arxiv.org/abs/2511.04190
作者: Josef Mayr,Anna Reithmeir,Maxime Di Folco,Julia A. Schnabel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Submitted to the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Covariance descriptors capture second-order statistics of image features. They have shown strong performance in general computer vision tasks, but remain underexplored in medical imaging. We investigate their effectiveness for both conventional and learning-based medical image classification, with a particular focus on SPDNet, a classification network specifically designed for symmetric positive definite (SPD) matrices. We propose constructing covariance descriptors from features extracted by pre-trained general vision encoders (GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and MedSAM - are evaluated across eleven binary and multi-class datasets from the MedMNSIT benchmark. Our results show that covariance descriptors derived from GVE features consistently outperform those derived from handcrafted features. Moreover, SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features. Our findings highlight the potential of combining covariance descriptors with powerful pretrained vision encoders for medical image analysis.
zh

[CV-40] Systematic Evaluation of Preprocessing Techniques for Accurate Image Registration in Digital Pathology

【速读】:该论文旨在解决数字病理学中不同成像模态(如苏木精-伊红染色HE图像与非线性多模态图像)之间图像配准(image registration)精度不足的问题。其核心挑战在于颜色分布差异导致的结构对齐困难,从而影响生物标志物分析和组织重建等下游应用。解决方案的关键在于通过预处理阶段引入颜色转换技术(color transformation),特别是使用CycleGAN进行色彩迁移,显著降低了配准误差(以相对目标配准误差rTRE衡量)。实验表明,在所有测试方法中,CycleGAN在两种场景下均取得最低的MMrTRE和AMrTRE值,并且在人工标记点评估中也表现最优,证明了颜色转换作为前置步骤能有效提升跨模态图像的空间对齐能力,为数字病理学中的多模态图像融合提供了可靠的技术支持。

链接: https://arxiv.org/abs/2511.04171
作者: Fatemehzahra Darzi,Rodrigo Escobar Diaz Guerrero,Thomas Bocklitz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 Figures

点击查看摘要

Abstract:Image registration refers to the process of spatially aligning two or more images by mapping them into a common coordinate system, so that corresponding anatomical or tissue structures are matched across images. In digital pathology, registration enables direct comparison and integration of information from different stains or imaging modalities, sup-porting applications such as biomarker analysis and tissue reconstruction. Accurate registration of images from different modalities is an essential step in digital pathology. In this study, we investigated how various color transformation techniques affect image registration between hematoxylin and eosin (HE) stained images and non-linear multimodal images. We used a dataset of 20 tissue sample pairs, with each pair undergoing several preprocessing steps, including different color transformation (CycleGAN, Macenko, Reinhard, Vahadane), inversion, contrast adjustment, intensity normalization, and denoising. All images were registered using the VALIS registration method, which first applies rigid registration and then performs non-rigid registration in two steps on both low and high-resolution images. Registration performance was evaluated using the relative Target Registration Error (rTRE). We reported the median of median rTRE values (MMrTRE) and the average of median rTRE values (AMrTRE) for each method. In addition, we performed a custom point-based evaluation using ten manually selected key points. Registration was done separately for two scenarios, using either the original or inverted multimodal images. In both scenarios, CycleGAN color transformation achieved the lowest registration errors, while the other methods showed higher errors. These findings show that applying color transformation before registration improves alignment between images from different modalities and supports more reliable analysis in digital pathology.
zh

[CV-41] Learning from Online Videos at Inference Time for Computer-Use Agents

【速读】:该论文旨在解决计算机使用代理(computer-use agents)在执行需要特定领域程序知识的任务时,相较于人类用户仍存在显著性能差距的问题,尤其是在涉及特定应用程序、平台和多步骤工作流的场景中。其解决方案的关键在于提出一种基于视频教程的推理时学习框架:通过视觉语言模型(VLM)从在线视频中自动提取结构化示范轨迹(demonstration trajectories),包括识别UI操作、分割动作子序列并赋予每个子序列文本目标;同时设计两级动态选择机制,在推理过程中按需将最相关的局部轨迹作为上下文引导注入代理决策过程,从而实现对当前子目标的精准辅助。实验证明该方法显著优于仅依赖文本教程或转录内容的基线模型,且分析表明轨迹分割、动作过滤与视觉信息的有效利用是提升代理性能的核心要素。

链接: https://arxiv.org/abs/2511.04137
作者: Yujian Liu,Ze Wang,Hao Chen,Ximeng Sun,Xiaodong Yu,Jialian Wu,Jiang Liu,Emad Barsoum,Zicheng Liu,Shiyu Chang
机构: UC Santa Barbara; Advanced Micro Devices, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time. Our code is available at this https URL.
zh

[CV-42] DMSORT: An efficient parallel maritime multi-object tracking architecture for unmanned vessel platforms DATE

【速读】:该论文旨在解决复杂海面环境下由于相机运动导致的视觉退化问题,从而提升多目标跟踪(Multi-Object Tracking, MOT)的准确性与鲁棒性,以保障船舶航行安全和海上监视效能。其解决方案的关键在于提出了一种高效的双分支海面SORT(Dual-branch Maritime SORT, DMSORT)方法:一是通过并行结构设计,将目标检测与重识别(ReID)分支和动态相机运动估计分支分离处理;二是引入可逆列式检测网络(Reversible Columnar Detection Network, RCDN)增强多尺度特征利用以提升检测鲁棒性,同时设计轻量级Transformer外观提取器(Lightweight Transformer-based Appearance Extractor, Li-TAE)捕捉全局上下文信息;三是通过构建投影变换解耦平台运动与目标内在运动,在卡尔曼滤波中实施平台运动补偿,稳定真实目标轨迹;最终结合聚类优化的特征融合模块,有效融合运动与外观线索,在噪声、遮挡和漂移条件下保持高身份一致性。

链接: https://arxiv.org/abs/2511.04128
作者: Shengyu Tang,Zeyuan Lu,Jiazhi Dong,Changdong Yu,Xiaoyu Wang,Yaohui Lyu,Weihao Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Updated version of the Ocean Engineering (Elsevier, 2025) paper with minor corrections

点击查看摘要

Abstract:Accurate perception of the marine environment through robust multi-object tracking (MOT) is essential for ensuring safe vessel navigation and effective maritime surveillance. However, the complicated maritime environment often causes camera motion and subsequent visual degradation, posing significant challenges to MOT. To address this challenge, we propose an efficient Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the framework is a parallel tracker with affine compensation, which incorporates an object detection and re-identification (ReID) branch, along with a dedicated branch for dynamic camera motion estimation. Specifically, a Reversible Columnar Detection Network (RCDN) is integrated into the detection module to leverage multi-level visual features for robust object detection. Furthermore, a lightweight Transformer-based appearance extractor (Li-TAE) is designed to capture global contextual information and generate robust appearance features. Another branch decouples platform-induced and target-intrinsic motion by constructing a projective transformation, applying platform-motion compensation within the Kalman filter, and thereby stabilizing true object trajectories. Finally, a clustering-optimized feature fusion module effectively combines motion and appearance cues to ensure identity consistency under noise, occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT attains the fastest runtime among existing ReID-based MOT frameworks while maintaining high identity consistency and robustness to jitter and occlusion. Code is available at: this https URL.
zh

[CV-43] Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)

【速读】:该论文旨在解决网球比赛过程中缺乏自动化、高精度分析工具的问题,以实现对球员运动轨迹、球速、击球准确率及反应时间等关键指标的实时量化评估。其解决方案的关键在于构建一个端到端的深度学习流水线:利用YOLOv8进行球员检测、定制训练的YOLOv5模型实现网球跟踪,并基于ResNet50架构识别球场关键点以提供空间参考,从而生成带标注视频与详细性能指标,为教练、转播方和运动员提供可操作的赛事动态洞察。

链接: https://arxiv.org/abs/2511.04126
作者: Venkata Manikanta Desu,Syed Fawaz Ali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures, planning to submit for a coneference

点击查看摘要

Abstract:This study presents a complete pipeline for automated tennis match analysis. Our framework integrates multiple deep learning models to detect and track players and the tennis ball in real time, while also identifying court keypoints for spatial reference. Using YOLOv8 for player detection, a custom-trained YOLOv5 model for ball tracking, and a ResNet50-based architecture for court keypoint detection, our system provides detailed analytics including player movement patterns, ball speed, shot accuracy, and player reaction times. The experimental results demonstrate robust performance in varying court conditions and match scenarios. The model outputs an annotated video along with detailed performance metrics, enabling coaches, broadcasters, and players to gain actionable insights into the dynamics of the game.
zh

[CV-44] xt to Sketch Generation with Multi-Styles NEURIPS2025

【速读】:该论文旨在解决现有草图生成方法在风格控制上的不足,即多数方法仅能进行通用合成,缺乏对草图风格的精确调控能力。其解决方案的关键在于提出一种无需训练的扩散模型框架(M3S),通过文本提示和参考风格草图实现显式风格引导;创新性地将参考特征作为辅助信息,结合线性平滑处理与风格-内容引导机制,有效减少参考草图中的内容泄露,尤其在参考与目标草图结构相似度较低时仍能提升生成质量。此外,通过引入联合AdaIN模块整合多参考草图特征,实现了可控的多风格生成能力。

链接: https://arxiv.org/abs/2511.04123
作者: Tengjie Li,Shikui Tu,Lei Xu
机构: Shanghai Jiao Tong University (上海交通大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at this https URL.
zh

[CV-45] ortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration NEURIPS2025

【速读】:该论文旨在解决扩散模型(diffusion model)在采样过程中计算效率低的问题,尤其是在不进行额外训练的前提下如何加速生成过程并保持高保真度。解决方案的关键在于提出一种名为“乌龟与野兔引导”(Tortoise and Hare Guidance, THG)的无训练加速策略:通过将Classifier-Free Guidance (CFG) 的常微分方程(ODE)重构为多速率系统,发现噪声估计项与附加引导项对数值误差的敏感性差异——后者具有更强的鲁棒性,存在显著冗余。基于此洞察,THG 采用不同时间步长分别求解两个分支:噪声估计沿原细粒度时间步长(乌龟方程)积分,而引导项仅在粗粒度时间步长(野兔方程)上积分,从而大幅减少函数评估次数(NFE),同时保持图像质量几乎不变(ΔImageReward ≤ 0.032)。此外,还引入误差感知的时间步采样器和引导尺度调度器以进一步提升稳定性与效率。

链接: https://arxiv.org/abs/2511.04117
作者: Yunghee Lee,Byeonghyun Pak,Junwha Hong,Hoseong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 8 figures. NeurIPS 2025. Project page: this https URL

点击查看摘要

Abstract:In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity ( \Delta ImageReward \leq 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at this https URL.
zh

[CV-46] SpatialLock: Precise Spatial Control in Text-to-Image Synthesis

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成中对象定位控制精度不足的问题,即现有方法未能充分利用位置信息,导致对物体空间布局的理解不够准确。其解决方案的关键在于提出一种名为SpatialLock的新框架,该框架通过两个核心组件协同实现精确的空间控制:一是位置感知注入(Position-Engaged Injection, PoI),通过注意力机制直接整合空间信息以增强模型对位置锚定信息的学习;二是位置引导学习(Position-Guided Learning, PoG),借助基于感知的监督信号进一步优化对象定位。二者结合显著提升了生成图像中对象的空间排列精度和视觉质量,在多个数据集上实现了超过0.9的交并比(IOU)得分,达到当前最优水平。

链接: https://arxiv.org/abs/2511.04112
作者: Biao Liu,Yuanzhi Liang
机构: The Sugon Group(曙光集团); TeleAI, China Telecom(中国电信TeleAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.
zh

[CV-47] When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因复杂解剖结构和标注数据稀缺导致的性能瓶颈问题。传统卷积神经网络(CNN)擅长局部特征提取,但难以建模长距离依赖关系;而标准视觉Transformer虽能捕捉全局上下文,却存在数据需求高、计算成本大的缺陷。解决方案的关键在于提出UKAST架构——一种类U-Net设计,将基于有理函数的科尔莫戈罗夫-阿诺德网络(Kolmogorov-Arnold Networks, KANs)嵌入Swin Transformer编码器中,利用有理基函数和组有理KAN(Group Rational KANs, GR-KANs)提升表达能力与数据效率,在仅小幅增加参数量的情况下显著降低浮点运算次数(FLOPs),从而在多个2D/3D医学图像分割基准上实现最优性能,尤其在小样本场景下表现突出,有效缓解了Vision Transformer对大规模标注数据的依赖。

链接: https://arxiv.org/abs/2511.04084
作者: Nishchal Sapkota,Haoyan Shi,Yejia Zhang,Xianshi Ma,Bofang Zheng,Danny Z. Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: this https URL
zh

[CV-48] Adversarial and Score-Based CT Denoising: CycleGAN vs Noise2Score

【速读】:该论文旨在解决无配对(unpaired)和自监督(self-supervised)条件下CT图像去噪问题,即在缺乏干净与噪声图像配对数据的情况下提升去噪性能。其关键解决方案在于对比两种训练数据高效的方法:基于CycleGAN的残差翻译器和Noise2Score(N2S)得分匹配去噪器。实验表明,采用标准U-Net结构并优化超参数(λ_cycle = 30, λ_iden = 2, ngf = ndf = 64)的CycleGAN模型在图像质量上表现最优(PSNR从34.66 dB提升至38.913 dB,SSIM从0.9234提升至0.971),而N2S则在极低信噪比输入下表现出更强鲁棒性,验证了其作为无需成对数据的可靠替代方案的潜力。

链接: https://arxiv.org/abs/2511.04083
作者: Abu Hanif Muhammad Syarubany
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study CT image denoising in the unpaired and self-supervised regimes by evaluating two strong, training-data-efficient paradigms: a CycleGAN-based residual translator and a Noise2Score (N2S) score-matching denoiser. Under a common evaluation protocol, a configuration sweep identifies a simple standard U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf = 64) as the most reliable setting; we then train it to convergence with a longer schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234 SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs, highlighting its utility when clean pairs are unavailable. Overall, CycleGAN offers the strongest final image quality, whereas Noise2Score provides a robust pair-free alternative with competitive performance. Source code is available at this https URL.
zh

[CV-49] Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment

【速读】:该论文旨在解决从神经信号(如EEG、MEG和fMRI)中解码视觉语义的难题,其核心挑战在于个体差异性和视觉特征的纠缠性,以及现有方法依赖纯视觉嵌入导致语义维度捕捉不足、可解释性弱和鲁棒性差的问题。解决方案的关键在于提出Bratrix框架,首次实现端到端的语言锚定视觉-脑部对齐:通过将视觉刺激分解为层次化的视觉与语言语义成分,并将视觉和脑部表征投影至共享潜在空间,从而生成对齐的视觉-语言和脑部-语言嵌入;同时引入不确定性感知模块以增强对噪声神经信号的鲁棒性,并结合可学习的语言锚定语义矩阵提升跨模态相关性及两阶段训练策略(单模态预训练+多模态微调),显著提升了对齐精度与下游任务性能。

链接: https://arxiv.org/abs/2511.04078
作者: Zehui Feng,Chenqi Zhang,Mingru Wang,Minuo Wei,Shiwei Cheng,Cuntai Guan,Ting Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 16 figures, under review as a conference paper

点击查看摘要

Abstract:Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.
zh

[CV-50] A Hybrid Deep Learning Model for Robust Biometric Authentication from Low-Frame-Rate PPG Signals

【速读】:该论文旨在解决基于光电容积脉搏波(Photoplethysmography, PPG)信号的生物特征认证中因运动伪影、光照变化及个体生理差异导致的信号质量下降问题,从而实现高鲁棒性的身份识别。解决方案的关键在于构建一个轻量级且高效的混合深度学习模型——CVT-ConvMixer-LSTM,该模型通过连续小波变换(Continuous Wavelet Transform, CWT)将一维PPG信号转化为二维时频标量图(scalogram),以捕捉瞬态心血管动力学特征;随后利用卷积视觉Transformer(Convolutional Vision Transformer, CVT)与ConvMixer分支提取空间特征,并结合长短期记忆网络(Long Short-Term Memory, LSTM)建模时间依赖性,最终在CFIHSR数据集上实现了98%的认证准确率,展现出对噪声和跨被试变异的良好鲁棒性。

链接: https://arxiv.org/abs/2511.04037
作者: Arfina Rahman,Mahesh Banavar
机构: Clarkson University (克拉克森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: This work has been submitted to IEEE Transactions on Biometrics, Behavior, and Identity Science (TBIOM) for possible publication

点击查看摘要

Abstract:Photoplethysmography (PPG) signals, which measure changes in blood volume in the skin using light, have recently gained attention in biometric authentication because of their non-invasive acquisition, inherent liveness detection, and suitability for low-cost wearable devices. However, PPG signal quality is challenged by motion artifacts, illumination changes, and inter-subject physiological variability, making robust feature extraction and classification crucial. This study proposes a lightweight and cost-effective biometric authentication framework based on PPG signals extracted from low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The raw PPG signals undergo a standard preprocessing pipeline involving baseline drift removal, motion artifact suppression using Principal Component Analysis (PCA), bandpass filtering, Fourier-based resampling, and amplitude normalization. To generate robust representations, each one-dimensional PPG segment is converted into a two-dimensional time-frequency scalogram via the Continuous Wavelet Transform (CWT), effectively capturing transient cardiovascular dynamics. We developed a hybrid deep learning model, termed CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision Transformer (CVT) and ConvMixer branches with temporal features from a Long Short-Term Memory network (LSTM). The experimental results on 46 subjects demonstrate an authentication accuracy of 98%, validating the robustness of the model to noise and variability between subjects. Due to its efficiency, scalability, and inherent liveness detection capability, the proposed system is well-suited for real-world mobile and embedded biometric security applications.
zh

[CV-51] Near-Lossless 3D Voxel Representation Free from Iso-surface

【速读】:该论文旨在解决现有三维网格体素化表示方法在几何保真度上的局限性问题,特别是基于等值面(isosurface)的方法通常依赖于水密化处理或渲染优化,从而不可避免地导致几何细节损失。其解决方案的关键在于提出一种名为“忠实轮廓”(Faithful Contouring)的稀疏体素化表示方法,该方法无需将网格转换为场函数或在重网格化过程中提取等值面,即可支持2048及以上分辨率的任意网格,并通过保留锐利边缘和内部结构实现近无损的几何保真度,尤其适用于复杂几何与拓扑结构的场景。

链接: https://arxiv.org/abs/2511.04029
作者: Yihao Luo,Xianglong He,Chuanyu Pan,Yiwen Chen,Jiaqi Wu,Yangguang Li,Wanli Ouyang,Yuanming Hu,Guang Yang,ChoonHwai Yap
机构: Imperial College London (帝国理工学院); Tsinghua University (清华大学); Meshy; Nanyang Technological University (南洋理工大学); Mathmagic; The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the 10^-5 level; for mesh reconstruction, it yields a 93% reduction in Chamfer Distance and a 35% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.
zh

[CV-52] MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

【速读】:该论文旨在解决医学影像中视觉模型性能受限于在域外自然图像上预训练的骨干网络所导致的领域差异问题(domain gap)。为克服这一根本挑战,作者提出了一种专为胸部影像优化的基础 Vision Transformer 模型 MedDChest。其关键解决方案在于:首先,从头预训练 MedDChest 于包含超过 120 万张多模态医学图像(包括 X 射线和 CT)的大规模、精心筛选的数据集;其次,引入 Guided Random Resized Crops(引导式随机缩放裁剪)这一新颖的内容感知数据增强策略,通过偏向解剖学相关区域采样来提升医学图像裁剪效率,从而显著改善模型对医疗影像特征的捕捉能力。实验证明,该方法在多个下游诊断任务中均优于基于 ImageNet 预训练的强基准模型,确立了大规模域内预训练与领域特定增强相结合的有效性。

链接: https://arxiv.org/abs/2511.04016
作者: Mahmoud Soliman,Islam Osman,Mohamed S. Shehata,Rasika Rajapakshe
机构: University of British Columbia (不列颠哥伦比亚大学); University of Moratuwa (莫拉图瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model’s effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.
zh

[CV-53] GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在未见领域(unseen domains)中泛化能力不足的问题,尤其关注如何高效地对预训练ViT进行参数高效微调(Parameter-Efficient Fine-Tuning, PEFT),以避免传统微调方法带来的高计算成本及性能退化。其解决方案的关键在于提出GNN-MoE框架,该框架结合了专家混合(Mixture-of-Experts, MoE)机制与高效的Kronecker适配器(Kronecker adapters),并通过一种新颖的图神经网络(Graph Neural Network, GNN)路由器(包括GCN、GAT和SAGE)在patch间图结构上动态分配不同区域到特定专家,从而实现基于上下文感知的路由策略。这种基于图结构的路由机制利用patch之间的空间关系,提升了模型对域偏移(domain shift)的适应能力,在保持极低参数开销的同时实现了最先进的或具有竞争力的域泛化性能。

链接: https://arxiv.org/abs/2511.04008
作者: Mahmoud Soliman,Omar Abdelaziz,Ahmed Radwan,Anand,Mohamed Shehata
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Domain generalization (DG) seeks robust Vision Transformer (ViT) performance on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging; standard fine-tuning is costly and can impair generalization. We propose GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT, SAGE) operates on inter-patch graphs to dynamically assign patches to specialized experts. This context-aware GNN routing leverages inter-patch relationships for better adaptation to domain shifts. GNN-MoE achieves state-of-the-art or competitive DG benchmark performance with high parameter efficiency, highlighting the utility of graph-based contextual routing for robust, lightweight DG.
zh

[CV-54] PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection

【速读】:该论文旨在解决当前文本到视频生成模型在物理合理性方面的严重不足问题,即生成内容常出现物体动力学不一致、交互逻辑混乱及运动模式失真等现象,从而限制其在具身智能(Embodied AI)、机器人技术和高仿真度模拟等领域的应用。解决方案的关键在于提出PhysCorr框架,其核心创新包括:1)构建首个双维度奖励模型PhysicsRM,用于量化单个物体内部稳定性与物体间相互作用的物理一致性;2)设计PhyDPO直接偏好优化流程,通过对比反馈和物理感知重加权机制,引导生成过程趋向物理合理的输出。该方法具备模型无关性和可扩展性,可无缝集成至多种视频扩散模型和基于Transformer的骨干网络中。

链接: https://arxiv.org/abs/2511.03997
作者: Peiyao Wang,Weining Wang,Qi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.
zh

[CV-55] CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation

【速读】:该论文旨在解决3D高斯场中基于自然语言的区域定位任务(即Referring 3D Gaussian Splatting Segmentation, R3DGS)中存在的跨视角一致性问题,现有方法因依赖2D渲染伪监督和视图特定特征学习,难以保证多视角下语义一致性和几何合理性。其解决方案的关键在于提出一个完全可微的框架Camera Aware Referring Field (CaRF),核心创新包括:1)Gaussian Field Camera Encoding (GFCE),通过将相机几何信息融入高斯文本交互机制,显式建模视图相关的差异并增强几何推理能力;2)In Training Paired View Supervision (ITPVS),在训练过程中对校准视图下的每个高斯点 logits 进行跨视角对齐,有效缓解单视角过拟合并暴露视图间不一致性以供优化。实验表明,该方法在多个基准上显著提升mIoU指标,推动更可靠、一致的3D场景理解。

链接: https://arxiv.org/abs/2511.03992
作者: Yuwen Tao,Kanglei Zhou,Xin Tan,Yuan Xie
机构: Shanghai Pinghe High School (上海平和高中); Tsinghua University (清华大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret free-form language expressions and localize the corresponding 3D regions in Gaussian fields. While recent advances have introduced cross-modal alignment between language and 3D geometry, existing pipelines still struggle with cross-view consistency due to their reliance on 2D rendered pseudo supervision and view specific feature learning. In this work, we present Camera Aware Referring Field (CaRF), a fully differentiable framework that operates directly in the 3D Gaussian space and achieves multi view consistency. Specifically, CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates camera geometry into Gaussian text interactions to explicitly model view dependent variations and enhance geometric reasoning. Building on this, In Training Paired View Supervision (ITPVS) is proposed to align per Gaussian logits across calibrated views during training, effectively mitigating single view overfitting and exposing inter view discrepancies for optimization. Extensive experiments on three representative benchmarks demonstrate that CaRF achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively. Moreover, this work promotes more reliable and view consistent 3D scene understanding, with potential benefits for embodied AI, AR/VR interaction, and autonomous perception.
zh

[CV-56] Simple 3D Pose Features Support Human and Machine Social Scene Understanding

【速读】:该论文旨在解决人类如何从视觉输入中快速识别社会互动行为的机制尚不明确,且当前人工智能视觉模型在社会互动识别任务上表现不佳的问题。其解决方案的关键在于揭示人类依赖显式的三维(3D)身体姿态信息进行社会判断,并提出通过提取仅包含人脸3D位置与朝向的简化社会姿态特征(3D social pose features),即可显著提升AI模型的社会互动识别性能。研究发现,这些结构化的3D姿态特征不仅能够媲美完整3D关节信息的预测能力,还能解释不同AI视觉模型在匹配人类社会判断上的差异,表明人类社会场景理解依赖于显式3D姿态表征,而非传统深度学习模型中隐含的复杂特征。

链接: https://arxiv.org/abs/2511.03988
作者: Wenshuo Qin,Leyla Isik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: 28 pages, 6 figures

点击查看摘要

Abstract:Humans can quickly and effortlessly extract a variety of information about others’ social interactions from visual input, ranging from visuospatial cues like whether two people are facing each other to higher-level information. Yet, the computations supporting these abilities remain poorly understood, and social interaction recognition continues to challenge even the most advanced AI vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose information to make social interaction judgments, which is absent in most AI vision models. To test this, we combined state-of-the-art pose and depth estimation algorithms to extract 3D joint positions of people in short video clips depicting everyday human actions and compared their ability to predict human social interaction judgments with current AI vision models. Strikingly, 3D joint positions outperformed most current AI vision models, revealing that key social information is available in explicit body position but not in the learned features of most vision models, including even the layer-wise embeddings of the pose models used to extract joint positions. To uncover the critical pose features humans use to make social judgments, we derived a compact set of 3D social pose features describing only the 3D position and direction of faces in the videos. We found that these minimal descriptors matched the predictive strength of the full set of 3D joints and significantly improved the performance of off-the-shelf AI vision models when combined with their embeddings. Moreover, the degree to which 3D social pose features were represented in each off-the-shelf AI vision model predicted the model’s ability to match human social judgments. Together, our findings provide strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.
zh

[CV-57] Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images

【速读】:该论文旨在解决现有场景重建方法在处理遮挡区域时的局限性问题,即当前方法仅能恢复可见的3D表面,导致对被遮挡结构(如墙壁、地板和天花板)的重建缺失。尽管生成式模型已在部分观测下重建完整物体方面取得进展,但对场景中结构性元素(如平面、重复且简单的建筑构件)的建模仍缺乏关注。解决方案的关键在于构建一个名为“Room Envelopes”的合成数据集,该数据集为每张RGB图像提供两个点云标注:一个是可见表面点云,另一个是移除家具等固定设施后的结构布局点云(即第一层结构表面)。这种标注设计使得可以直接监督前馈单目几何估计器,使其同时预测可见表面与结构布局表面,从而实现对场景整体范围及其中物体形状与位置的准确理解。

链接: https://arxiv.org/abs/2511.03970
作者: Sam Bahrami,Dylan Campbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset – Room Envelopes – that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene’s extent, as well as the shape and location of its objects.
zh

[CV-58] A Linear Fractional Transformation Model and Calibration Method for Light Field Camera

【速读】:该论文旨在解决光场相机(Light Field Camera)在三维重建中内部参数标定的准确性与挑战性问题,特别是主透镜(Main Lens)与微透镜阵列(Micro Lens Array, MLA)之间的耦合效应导致的标定困难。解决方案的关键在于引入一个线性分式变换(Linear Fractional Transformation, LFT)参数 α,用于解耦主透镜与MLA的参数关系,并提出基于最小二乘法的解析解结合非线性优化的标定流程,同时设计了从原始图像中提取特征的方法。该方法显著提升了标定精度,并加速了原始光场图像的仿真过程,从而为数据驱动的深度学习方法提供了高效支持。

链接: https://arxiv.org/abs/2511.03962
作者: Zhong Chen,Changfeng Chen
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate calibration of internal parameters is a crucial yet challenging prerequisite for 3D reconstruction using light field cameras. In this paper, we propose a linear fractional transformation(LFT) parameter \alpha to decoupled the main lens and micro lens array (MLA). The proposed method includes an analytical solution based on least squares, followed by nonlinear refinement. The method for detecting features from the raw images is also introduced. Experimental results on both physical and simulated data have verified the performance of proposed method. Based on proposed model, the simulation of raw light field images becomes faster, which is crucial for data-driven deep learning methods. The corresponding code can be obtained from the author’s website.
zh

[CV-59] Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization

【速读】:该论文旨在解决多视角图像重建中几何精度与视觉真实感难以兼顾的问题,现有方法通常将几何优化(如多视图立体匹配,Multi-View Stereo)与外观优化(如新视角合成,Novel View Synthesis)解耦,导致下游编辑任务(如重光照、形状变形)受限。其解决方案的关键在于提出一种统一的高斯引导的可微分网格渲染框架,通过联合优化网格顶点位置、面片结构与顶点颜色,利用输入图像的光度一致性约束以及法向量和深度图的几何正则化项,实现几何与外观的协同优化,从而获得高质量3D重建结果并支持高效下游编辑。

链接: https://arxiv.org/abs/2511.03950
作者: Zhejia Cai,Puhua Jiang,Shiwei Mao,Hongkun Cao,Ruqi Huang
机构: SIGS, Tsinghua University (清华大学深圳国际研究生院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Reconstructing real-world objects from multi-view images is essential for applications in 3D editing, AR/VR, and digital content creation. Existing methods typically prioritize either geometric accuracy (Multi-View Stereo) or photorealistic rendering (Novel View Synthesis), often decoupling geometry and appearance optimization, which hinders downstream editing tasks. This paper advocates an unified treatment on geometry and appearance optimization for seamless Gaussian-mesh joint optimization. More specifically, we propose a novel framework that simultaneously optimizes mesh geometry (vertex positions and faces) and vertex colors via Gaussian-guided mesh differentiable rendering, leveraging photometric consistency from input images and geometric regularization from normal and depth maps. The obtained high-quality 3D reconstruction can be further exploit in down-stream editing tasks, such as relighting and shape deformation. The code will be publicly available upon acceptance.
zh

[CV-60] Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization

【速读】:该论文旨在解决时序动作定位(Temporal Action Localization)中边界检测精度不足与计算资源分配不均的问题。现有方法对所有边界采用统一计算量,忽视了不同边界间难度差异,导致效率与精度难以兼顾。解决方案的关键在于两项互补性贡献:一是边界距离回归(Boundary Distance Regression, BDR),通过有符号距离回归替代传统分类策略,实现信息论最优的边界定位,显著提升边界峰值锐度(提升43%)并兼容现有模型;二是自适应时序精修(Adaptive Temporal Refinement, ATR),基于连续深度选择 τ[0,1]\tau \in [0,1] 实现端到端可微分的计算资源动态分配,无需强化学习即可在THUMOS14上以18%更少的浮点运算量(FLOPs)获得2.9%的mAP@0.7提升,且短时动作收益更显著(达4.2%)。

链接: https://arxiv.org/abs/2511.03943
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection \tau \in [0,1] , enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5% mAP@0.7 at 162G FLOPs, compared to 53.6% at 198G for uniform processing, providing a 2.9% improvement with 18% less compute. Gains scale with boundary heterogeneity, showing 4.2% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.
zh

[CV-61] NVIDIA Nemotron Nano V2 VL

【速读】:该论文旨在解决当前视觉语言模型在真实场景中对文档理解、长视频内容解析及推理任务的性能瓶颈问题。解决方案的关键在于引入Nemotron Nano V2 VL模型,其基于混合Mamba-Transformer架构(hybrid Mamba-Transformer LLM)并结合创新的token压缩技术,在保持高精度的同时显著提升长序列输入下的推理吞吐量,从而有效增强模型在复杂文档和长视频场景中的处理能力。

链接: https://arxiv.org/abs/2511.03929
作者: NVIDIA:Amala Sanjay Deshmukh,Kateryna Chumachenko,Tuomas Rintamaki,Matthieu Le,Tyler Poon,Danial Mohseni Taheri,Ilia Karmanov,Guilin Liu,Jarno Seppanen,Guo Chen,Karan Sapra,Zhiding Yu,Adi Renduchintala,Charles Wang,Peter Jin,Arushi Goel,Mike Ranzinger,Lukas Voegtle,Philipp Fischer,Timo Roman,Wei Ping,Boxin Wang,Zhuolin Yang,Nayeon Lee,Shaokun Zhang,Fuxiao Liu,Zhiqi Li,Di Zhang,Greg Heinrich,Hongxu(Danny)Yin,Song Han,Pavlo Molchanov,Parth Mannan,Yao Xu,Jane Polak Scowcroft,Tom Balough,Subhashree Radhakrishnan,Paris Zhang,Sean Cha,Ratnesh Kumar,Zaid Pervaiz Bhat,Jian Zhang,Darragh Hanley,Pritam Biswas,Jesse Oliver,Kevin Vasques,Roger Waleffe,Duncan Riach,Oluwatobi Olabiyi,Ameya Sunil Mahabaleshwarkar,Bilal Kartal,Pritam Gundecha,Khanh Nguyen,Alexandre Milesi,Eugene Khvedchenia,Ran Zilberstein,Ofri Masad,Natan Bagrov,Nave Assaf,Tomer Asida,Daniel Afrimi,Amit Zuker,Netanel Haber,Zhiyu Cheng,Jingyu(Justin)Xin, Di (Allan)Wu,Nik Spirin,Maryam Moosaei,Roman Ageev,Vanshil Atul Shah,Yuting Wu,Daniel Korzekwa,Unnikrishnan Kizhakkemadam Sreekumar,Wanli Jiang,Padmavathy Subramanian,Alejandra Rico,Sandip Bhaskar,Saeid Motiian,Kedi Wu,Annie Surla,Chia-Chih Chen,Hayden Wolff,Matthew Feinberg,Melissa Corpuz,Marek Wawrzos,Eileen Long,Aastha Jhunjhunwala,Paul Hendricks,Farzan Memarian,Benika Hall,Xin-Yu Wang,David Mosallanezhad,Soumye Singhal,Luis Vega,Katherine Cheung,Krzysztof Pawelec,Michael Evans,Katherine Luna,Jie Lou,Erick Galinkin
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Nemotron Nano V2 VL, the latest model of the Nemotron vision-language series designed for strong real-world document understanding, long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers significant improvements over our previous model, Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major enhancements in model architecture, datasets, and training recipes. Nemotron Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and innovative token reduction techniques to achieve higher inference throughput in long document and video scenarios. We are releasing model checkpoints in BF16, FP8, and FP4 formats and sharing large parts of our datasets, recipes and training code.
zh

[CV-62] I Detect What I Dont Know: Incremental Anomaly Learning with Stochastic Weight Averag ing-Gaussian for Oracle-Free Medical Imaging

【速读】:该论文旨在解决医学影像中未知异常检测(unknown anomaly detection)的问题,其核心挑战在于标注异常样本稀缺且专家标注成本高昂。为应对这一问题,作者提出了一种无需标签、无需人工干预(oracle-free)的无监督增量学习框架,关键创新在于通过轻量级适配器(lightweight adapter)与不确定性门控(uncertainty-gated sample admission)机制实现对正常样本集的逐步扩展。具体而言,模型基于预训练视觉主干网络(frozen pretrained vision backbone)引入小型卷积适配器以实现快速领域适应,同时利用紧凑的共核(coreset)存储嵌入向量用于高效的k近邻异常评分;在增量扩展过程中,采用双概率门控策略——即样本到现有共核的距离需满足校准后的z-score阈值,且基于SWAG(Stochastic Weight Averaging-Gaussian)的内生不确定性(epistemic uncertainty)低于种子数据校准的边界——从而防止错误纳入异常样本和分布漂移,无需依赖生成式重建或回放缓冲区。该方案显著提升了多个真实医疗影像数据集上的性能,验证了其在标签稀缺场景下的有效性与效率。

链接: https://arxiv.org/abs/2511.03912
作者: Nand Kumar Yadav,Rodrigue Rizk,William CW Chen,KC Santosh(AI Research Lab, Department of Computer Science and Biomedical and Translational Sciences, Sanford School of Medicine, University Of South Dakota, Vermillion, SD, USA)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert supervision. We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels. Starting from a small, verified seed of normal images, our method alternates between lightweight adapter updates and uncertainty-gated sample admission. A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead. Extracted embeddings are stored in a compact coreset enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during incremental expansion is enforced by dual probabilistic gates, a sample is admitted into the normal memory only if its distance to the existing coreset lies within a calibrated z-score threshold, and its SWAG-based epistemic uncertainty remains below a seed-calibrated bound. This mechanism prevents drift and false inclusions without relying on generative reconstruction or replay buffers. Empirically, our system steadily refines the notion of normality as unlabeled data arrive, producing substantial gains over baselines. On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These results highlight the effectiveness and efficiency of the proposed framework for real-world, label-scarce medical imaging applications.
zh

[CV-63] Improving Diagnostic Performance on Small and Imbalanced Datasets Using Class-Based Input Image Composition

【速读】:该论文旨在解决小样本数据集和输入图像质量差导致深度学习模型误判率高的问题,尤其是在医学影像诊断中因类别不平衡而难以准确识别细微疾病模式的挑战。解决方案的关键在于提出基于类别的图像合成方法(Class-Based Image Composition),通过将同一类别的多张图像融合生成复合输入图像(Composite Input Images, CoImg),从而增强类内差异性、提升每张训练样本的信息密度,并改善模型对细微病变特征的区分能力。实验表明,该方法在OCTDL数据集上的改进版本Co-OCTDL上显著提升了性能,实现了99.6%的准确率、0.995的F1分数和0.9996的AUC值,同时大幅降低误判率,验证了其在弱数据条件下的有效性。

链接: https://arxiv.org/abs/2511.03891
作者: Hlali Azzeddine,Majid Ben Yakhlef,Soulaiman El Hazzat
机构: Sidi Mohamed Ben Abdellah University (西迪穆罕默德本阿卜杜拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Small, imbalanced datasets and poor input image quality can lead to high false predictions rates with deep learning models. This paper introduces Class-Based Image Composition, an approach that allows us to reformulate training inputs through a fusion of multiple images of the same class into combined visual composites, named Composite Input Images (CoImg). That enhances the intra-class variance and improves the valuable information density per training sample and increases the ability of the model to distinguish between subtle disease patterns. Our method was evaluated on the Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et al., 2024), which contains 2,064 high-resolution optical coherence tomography (OCT) scans of the human retina, representing seven distinct diseases with a significant class imbalance. We constructed a perfectly class-balanced version of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout composite image. To assess the effectiveness of this new representation, we conducted a comparative analysis between the original dataset and its variant using a VGG16 model. A fair comparison was ensured by utilizing the identical model architecture and hyperparameters for all experiments. The proposed approach markedly improved diagnostic this http URL enhanced Dataset achieved near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared to a baseline model trained on raw dataset. The false prediction rate was also significantly lower, this demonstrates that the method can producehigh-quality predictions even for weak datasets affected by class imbalance or small sample size.
zh

[CV-64] Desert Waste Detection and Classification Using Data-Based and Model-Based Enhanced YOLOv12 DL Model

【速读】:该论文旨在解决沙漠等极端环境中固体废物收集效率低、人工成本高且存在安全隐患的问题,同时弥补现有基于计算机视觉的垃圾检测研究多集中于城市场景和可回收物、忽视有机及危险废弃物的不足。其解决方案的关键在于提出一种改进的实时目标检测框架,基于剪枝后的轻量化YOLOv12模型,融合自对抗训练(Self-Adversarial Training, SAT)与专用数据增强策略,在 DroneTrashNet 数据集上显著提升了精度、召回率和平均精度均值(mAP),同时保持低延迟和紧凑模型体积,适用于资源受限的无人机平台部署,实现了准确性和效率之间的最优平衡。

链接: https://arxiv.org/abs/2511.03888
作者: Abdulmumin Sa’ad,Sulaimon Oyeniyi Adebayo,Abdul Jabbar Siddiqui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:The global waste crisis is escalating, with solid waste generation expected to increase by 70% by 2050. Traditional waste collection methods, particularly in remote or harsh environments like deserts, are labor-intensive, inefficient, and often hazardous. Recent advances in computer vision and deep learning have opened the door to automated waste detection systems, yet most research focuses on urban environments and recyclable materials, overlooking organic and hazardous waste and underexplored terrains such as deserts. In this work, we propose an enhanced real-time object detection framework based on a pruned, lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies. Using the DroneTrashNet dataset, we demonstrate significant improvements in precision, recall, and mean average precision (mAP), while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Benchmarking our model against state-of-the-art lightweight YOLO variants further highlights its optimal balance of accuracy and efficiency. Our results validate the effectiveness of combining data-centric and model-centric enhancements for robust, real-time waste detection in desert environments.
zh

[CV-65] Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

【速读】:该论文旨在解决在X-ray引导的脊柱手术中,如何利用模仿学习(imitation learning)方法实现机器人控制策略的可行性问题。由于多视角X射线图像的复杂性,传统模仿学习在该场景下的适用性尚不明确。解决方案的关键在于构建一个高保真度的仿真环境(in silico sandbox),用于自动化生成真实感强的双平面X射线引导下椎管穿刺过程数据,并基于此训练出仅依赖视觉信息进行迭代对齐的规划与开环控制策略。实验表明,该策略在首次尝试中成功率达68.5%,能保持安全的椎弓根内轨迹,且对复杂解剖结构和初始条件具有鲁棒性,验证了模仿学习在该任务中的潜力,同时也揭示了入口点精度不足等局限性。

链接: https://arxiv.org/abs/2511.03882
作者: Florence Klitzner,Blanca Inigo,Benjamin D. Killeen,Lalithkumar Seenivasan,Michelle Song,Axel Krieger,Mathias Unberath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation. This is because interpretation of multi-view X-rays is complex. We examine opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy generalized to complex anatomy, including fractures, and remained robust to varied initializations. Rollouts on real bi-planar X-rays further suggest that the model can produce plausible trajectories, despite training exclusively in simulation. While these preliminary results are promising, we also identify limitations, especially in entry point precision. Full closed-look control will require additional considerations around how to provide sufficiently frequent feedback. With more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.
zh

[CV-66] Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets

【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型在医学图像识别任务中泛化能力不足的问题,特别是在新冠(COVID-19)胸部X光片(Chest X-rays, CXRs)检测中,模型难以适应来自新临床来源的分布外(out-of-distribution, OOD)数据。其核心问题是模型倾向于利用训练数据中的“捷径”特征(如设备特异性伪影),而非可靠的生物标志物,从而导致在真实世界部署时性能显著下降。解决方案的关键在于在训练过程中引入基础噪声注入技术(包括高斯噪声、散斑噪声、泊松噪声和盐椒噪声),通过增强模型对输入扰动的鲁棒性,有效缩小分布内(in-distribution, ID)与分布外(OOD)性能之间的差距——实证结果表明,该方法可将AUC、F1分数、准确率、召回率和特异性等关键指标上的性能差异从0.10–0.20显著降低至0.01–0.06。

链接: https://arxiv.org/abs/2511.03855
作者: Duong Mai,Lawrence Hall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Abstract accepted for oral presentation at SPIE Medical Imaging 2026: Computer-Aided Diagnosis

点击查看摘要

Abstract:Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at this https URL
zh

[CV-67] SILVI: Simple Interface for Labeling Video Interactions

【速读】:该论文旨在解决动物行为研究中视频数据标注的局限性问题,即现有开源标注工具要么仅支持行为标签而缺乏个体定位,要么具备定位能力但无法捕捉个体间的交互关系,从而阻碍了对社会性和个体化动物行为的深入分析。解决方案的关键在于提出SILVI(Structured Interaction and Behavior Labeling for Video Interpretation),这是一个集成行为标注与交互识别功能的开源标注软件,能够直接在视频中同时标注个体行为和相互作用,并生成结构化输出以用于训练和验证计算机视觉模型,从而实现从原始视频到动态场景图的自动化提取。

链接: https://arxiv.org/abs/2511.03819
作者: Ozan Kanbertay(1),Richard Vogg(1 and 2),Elif Karakoc(2),Peter M. Kappeler(2 and 3),Claudia Fichtel(2),Alexander S. Ecker(1) ((1) Institute of Computer Science and Campus Institute Data Science, University of Göttingen, (2) Behavioral Ecology amp; Sociobiology Unit, German Primate Center, Göttingen, Germany, (3) Department of Sociobiology/Anthropology, University of Göttingen, Göttingen, Germany)
机构: 1. University of Konstanz (康斯坦茨大学); 2. Max Planck Institute for Ornithology (马克斯·普朗克鸟类学研究所); 3. Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Computer vision methods are increasingly used for the automated analysis of large volumes of video data collected through camera traps, drones, or direct observations of animals in the wild. While recent advances have focused primarily on detecting individual actions, much less work has addressed the detection and annotation of interactions – a crucial aspect for understanding social and individualized animal behavior. Existing open-source annotation tools support either behavioral labeling without localization of individuals, or localization without the capacity to capture interactions. To bridge this gap, we present SILVI, an open-source labeling software that integrates both functionalities. SILVI enables researchers to annotate behaviors and interactions directly within video data, generating structured outputs suitable for training and validating computer vision models. By linking behavioral ecology with computer vision, SILVI facilitates the development of automated approaches for fine-grained behavioral analyses. Although developed primarily in the context of animal behavior, SILVI could be useful more broadly to annotate human interactions in other videos that require extracting dynamic scene graphs. The software, along with documentation and download instructions, is available at: this https URL.
zh

[CV-68] Whats in Common? Multimodal Models Hallucinate When Reasoning Across Scenes NEURIPS

【速读】:该论文旨在解决当前多模态语言模型在真实场景中进行跨图像推理时存在的幻觉问题,即模型虽在感知类基准上表现优异,但在需要理解多个图像间共性关系的复杂推理任务中性能显著下降。其解决方案的关键在于构建了一个名为Common-O的新基准,该基准包含超过10.5k张未出现在训练数据中的野外场景图像,并通过类比人类认知测试设计“what’s in common?”的问题来评估模型的跨场景推理能力。实验表明,尽管多数模型能准确识别单图对象,但跨场景推理仍极具挑战性,且模型更易在存在相似对象的场景中产生幻觉,提示其可能依赖训练中习得的对象共现模式;此外,模型规模带来的提升有限,而采用多图像输入进行显式训练则展现出更大改进潜力。

链接: https://arxiv.org/abs/2511.03768
作者: Candace Ross,Florian Bordes,Adina Williams,Polina Kirichenko,Mark Ibrahim
机构: FAIR at Meta (Facebook AI Research at Meta)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures. Accepted to NeurIPS Datasets Benchmarks 2025

点击查看摘要

Abstract:Multimodal language models possess a remarkable ability to handle an open-vocabulary’s worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking “what’s in common?”. We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35% on Common-O – and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.
zh

[CV-69] LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices DATE2026

【速读】:该论文旨在解决边缘设备上卷积神经网络(Convolutional Neural Networks, CNNs)因领域偏移(domain shift)导致性能下降的问题,同时在严格受限的内存、计算和能耗预算下实现高效的微调。其核心挑战在于传统全参数微调(full fine-tuning)不可行,而现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法难以兼顾模型结构保持与高精度适应能力。解决方案的关键在于提出LoRA-Edge方法:通过Tensor-Train奇异值分解(TT-SVD)对预训练卷积层进行低秩压缩,在仅更新输出侧核心且零初始化的前提下保留辅助路径静默状态,并将更新结果融合回密集核结构,从而在不增加推理成本的情况下将可训练参数减少两个数量级;实验表明,该设计在多种人体活动识别(Human Activity Recognition, HAR)数据集和CNN骨干网络上均能逼近全微调精度(误差<4.7%),同时显著优于同类PEFT基线方法。

链接: https://arxiv.org/abs/2511.03765
作者: Hyunseok Kwak,Kyeongwon Lee,Jae-Jin Lee,Woojoo Lee
机构: Chung-Ang University (中央大学); Electronics and Telecommunications Research Institute (电子与电信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: 8 pages, 6 figures, 2 tables, DATE 2026 accepted paper

点击查看摘要

Abstract:On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.
zh

[CV-70] A convolutional neural network deep learning method for model class selection

【速读】:该论文旨在解决结构健康监测(Structural Health Monitoring, SHM)中模型类别识别的问题,即在缺乏系统输入信息或完整系统辨识的情况下,如何准确区分线性与非线性动力系统的模型类别。解决方案的关键在于利用仅响应信号(response-only)及其类别标签,训练一个一维卷积神经网络(1D Convolutional Neural Network),从而实现对新样本的模型类别自动分类;同时引入基于物理约束的卡尔曼滤波(Kalman filter)算法对加速度和位移数据进行融合处理,增强模型对微小信号变化(如阻尼或迟滞行为引起的差异)的敏感性和鲁棒性,适用于包括三维建筑有限元模型在内的多种复杂系统。

链接: https://arxiv.org/abs/2511.03743
作者: Marios Impraimakis
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 31 pages, 16 figures, published in Earthquake Engineering Structural Dynamics

点击查看摘要

Abstract:The response-only model class selection capability of a novel deep convolutional neural network method is examined herein in a simple, yet effective, manner. Specifically, the responses from a unique degree of freedom along with their class information train and validate a one-dimensional convolutional neural network. In doing so, the network selects the model class of new and unlabeled signals without the need of the system input information, or full system identification. An optional physics-based algorithm enhancement is also examined using the Kalman filter to fuse the system response signals using the kinematics constraints of the acceleration and displacement data. Importantly, the method is shown to select the model class in slight signal variations attributed to the damping behavior or hysteresis behavior on both linear and nonlinear dynamic systems, as well as on a 3D building finite element model, providing a powerful tool for structural health monitoring applications.
zh

[CV-71] μNeuFMT: Optical-Property-Adaptive Fluorescence Molecular Tomography via Implicit Neural Representation

【速读】:该论文旨在解决荧光分子断层成像(Fluorescence Molecular Tomography, FMT)中因重建问题固有的不适定性(ill-posedness)以及对组织光学参数(optical properties)先验知识依赖性强或不准确所导致的成像精度不足问题。传统方法通常需要精确已知的组织光学参数,而深度学习方法虽具潜力但受限于监督学习范式,泛化能力差。解决方案的关键在于提出一种自监督的FMT重建框架μNeuFMT,其核心创新在于将隐式神经场景表示与显式光子传播物理建模相结合,并在重建过程中联合优化荧光分布与光学系数(μ),从而无需依赖精确的先验光学参数或预训练数据即可实现鲁棒且高精度的重建。

链接: https://arxiv.org/abs/2511.04510
作者: Shihan Zhao,Jianru Zhang,Yanan Wu,Linlin Li,Siyuan Shen,Xingjun Zhu,Guoyan Zheng,Jiahua Jiang,Wuwei Ren
机构: ShanghaiTech University (上海科技大学); University of Birmingham (伯明翰大学); Shanghai Jiao Tong University (上海交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Fluorescence Molecular Tomography (FMT) is a promising technique for non-invasive 3D visualization of fluorescent probes, but its reconstruction remains challenging due to the inherent ill-posedness and reliance on inaccurate or often-unknown tissue optical properties. While deep learning methods have shown promise, their supervised nature limits generalization beyond training data. To address these problems, we propose \mu NeuFMT, a self-supervised FMT reconstruction framework that integrates implicit neural-based scene representation with explicit physical modeling of photon propagation. Its key innovation lies in jointly optimize both the fluorescence distribution and the optical properties ( \mu ) during reconstruction, eliminating the need for precise prior knowledge of tissue optics or pre-conditioned training data. We demonstrate that \mu NeuFMT robustly recovers accurate fluorophore distributions and optical coefficients even with severely erroneous initial values (0.5 \times to 2 \times of ground truth). Extensive numerical, phantom, and in vivo validations show that \mu NeuFMT outperforms conventional and supervised deep learning approaches across diverse heterogeneous scenarios. Our work establishes a new paradigm for robust and accurate FMT reconstruction, paving the way for more reliable molecular imaging in complex clinically related scenarios, such as fluorescence guided surgery.
zh

[CV-72] Shape Deformation Networks for Automated Aortic Valve Finite Element Meshing from 3D CT Images

【速读】:该论文旨在解决从3D CT图像中生成高质量且跨患者一致的主动脉瓣(aortic valve)网格模型的问题。传统方法常产生拓扑不规则的三角形网格,导致单元形状质量差和患者间对应关系不一致,从而影响生物力学分析与个体化仿真精度。解决方案的关键在于提出一种基于深度神经网络的模板拟合(template-fitting)流程,通过将所有患者的主动脉瓣重采样为一个统一的四边形(quad)网格模板,确保拓扑结构一致性和节点-节点、单元-单元的对应关系;同时,由于结构化四边形网格的引入,可简化损失函数设计(仅需几何重建项和平滑正则项),显著提升训练效率与网格平滑性及形状质量,从而实现更高效、鲁棒的主动脉瓣建模。

链接: https://arxiv.org/abs/2511.03890
作者: Linchen Qian,Jiasong Chen,Ruonan Gong,Wei Sun,Minliang Liu,Liang Liang
机构: University of Miami (迈阿密大学); Sutra Medical Inc (Sutra 医疗公司); Texas Tech University (德州理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate geometric modeling of the aortic valve from 3D CT images is essential for biomechanical analysis and patient-specific simulations to assess valve health or make a preoperative plan. However, it remains challenging to generate aortic valve meshes with both high-quality and consistency across different patients. Traditional approaches often produce triangular meshes with irregular topologies, which can result in poorly shaped elements and inconsistent correspondence due to inter-patient anatomical variation. In this work, we address these challenges by introducing a template-fitting pipeline with deep neural networks to generate structured quad (i.e., quadrilateral) meshes from 3D CT images to represent aortic valve geometries. By remeshing aortic valves of all patients with a common quad mesh template, we ensure a uniform mesh topology with consistent node-to-node and element-to-element correspondence across patients. This consistency enables us to simplify the learning objective of the deep neural networks, by employing a loss function with only two terms (i.e., a geometry reconstruction term and a smoothness regularization term), which is sufficient to preserve mesh smoothness and element quality. Our experiments demonstrate that the proposed approach produces high-quality aortic valve surface meshes with improved smoothness and shape quality, while requiring fewer explicit regularization terms compared to the traditional methods. These results highlight that using structured quad meshes for the template and neural network training not only ensures mesh correspondence and quality but also simplifies the training process, thus enhancing the effectiveness and efficiency of aortic valve modeling.
zh

[CV-73] Computed Tomography (CT)-derived Cardiovascular Flow Estimation Using Physics-Informed Neural Networks Improves with Sinogram-based Training: A Simulation Study

【速读】:该论文旨在解决基于CT成像的血流速度估计问题,即如何从对比剂动态演变的CT视频中准确估算血流速度。传统方法依赖于重建图像进行流场估计,但会引入滤波反投影(Filtered Backprojection)过程中的误差,从而影响精度。解决方案的关键在于提出一种名为SinoFlow的新框架,其核心创新是直接使用原始sinogram数据训练物理信息神经网络(Physics-Informed Neural Networks, PINN),避免了图像重建环节的误差传播,显著提升了血流估计的准确性与鲁棒性,尤其在不同 gantry 旋转速度和脉冲模式成像条件下均表现出更低的均方误差和速度误差。

链接: https://arxiv.org/abs/2511.03876
作者: Jinyuxuan Guo,Gurnoor Singh Khurana,Alejandro Gonzalo Grande,Juan C. del Alamo,Francisco Contijoch
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Background: Non-invasive imaging-based assessment of blood flow plays a critical role in evaluating heart function and structure. Computed Tomography (CT) is a widely-used imaging modality that can robustly evaluate cardiovascular anatomy and function, but direct methods to estimate blood flow velocity from movies of contrast evolution have not been developed. Purpose: This study evaluates the impact of CT imaging on Physics-Informed Neural Networks (PINN)-based flow estimation and proposes an improved framework, SinoFlow, which uses sinogram data directly to estimate blood flow. Methods: We generated pulsatile flow fields in an idealized 2D vessel bifurcation using computational fluid dynamics and simulated CT scans with varying gantry rotation speeds, tube currents, and pulse mode imaging settings. We compared the performance of PINN-based flow estimation using reconstructed images (ImageFlow) to SinoFlow. Results: SinoFlow significantly improved flow estimation performance by avoiding propagating errors introduced by filtered backprojection. SinoFlow was robust across all tested gantry rotation speeds and consistently produced lower mean squared error and velocity errors than ImageFlow. Additionally, SinoFlow was compatible with pulsed-mode imaging and maintained higher accuracy with shorter pulse widths. Conclusions: This study demonstrates the potential of SinoFlow for CT-based flow estimation, providing a more promising approach for non-invasive blood flow assessment. The findings aim to inform future applications of PINNs to CT images and provide a solution for image-based estimation, with reasonable acquisition parameters yielding accurate flow estimates. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2511.03876 [eess.IV] (or arXiv:2511.03876v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2511.03876 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Francisco Contijoch [view email] [v1] Wed, 5 Nov 2025 21:44:47 UTC (3,380 KB) Full-text links: Access Paper: View a PDF of the paper titled Computed Tomography (CT)-derived Cardiovascular Flow Estimation Using Physics-Informed Neural Networks Improves with Sinogram-based Training: A Simulation Study, by Jinyuxuan Guo and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: eess.IV prev | next new | recent | 2025-11 Change to browse by: cs cs.CV cs.LG eess physics physics.med-ph References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

人工智能

[AI-0] Addressing divergent representations from causal interventions on neural networks

【速读】:该论文试图解决的问题是:当前基于因果干预的机制可解释性方法在对模型内部表示进行操纵时,可能导致表示偏离目标模型在自然状态下的分布(即产生“分布外”表示),从而影响解释结果对原始模型的忠实性。解决方案的关键在于识别并区分两类偏差——“无害”偏差(如权重零空间中的偏移或行为决策边界内的协方差变化)和“有害”偏差(激活隐藏网络路径并引发潜在行为改变),并通过改进Counterfactual Latent (CL) 损失函数来正则化干预过程,使其更贴近自然分布,从而降低有害偏差的发生概率,同时保持干预的可解释能力。

链接: https://arxiv.org/abs/2511.04638
作者: Satchel Grant,Simon Jerome Han,Alexa Tartaglini,Christopher Potts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two classes of such divergences: harmless' divergences that occur in the null-space of the weights and from covariance within behavioral decision boundaries, and pernicious’ divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we modify the Counterfactual Latent (CL) loss from Grant (2025) that regularizes interventions to remain closer to the natural distributions, reducing the likelihood of harmful divergences while preserving the interpretive power of interventions. Together, these results highlight a path towards more reliable interpretability methods.
zh

[AI-1] Question the Questions: Auditing Representation in Online Deliberative Processes

【速读】:该论文旨在解决在公民参与式决策过程中(如公民大会和 deliberative polls),如何从大量参与者提出的议题中筛选出一小部分代表性最强的问题,以确保专家小组的回答能够有效反映全体参与者的利益诉求。其核心挑战在于,在时间有限的条件下,如何科学评估并选择一组能实现“公正代表”(justified representation, JR)的问题集合。解决方案的关键在于提出了一种基于社会选择理论中JR概念的审计框架,并设计了首个适用于一般效用场景下的JR审计算法,其中最高效的算法时间复杂度为 $ O(mn\log n) $,其中 $ n $ 为参与者数,$ m $ 为提议问题数。通过将该方法应用于历史审议数据,对比实际由主持人选定、整数线性规划优化选取及大语言模型(LLM)生成的问题集的代表性表现,验证了该框架的有效性,并揭示了当前LLM在支持审议过程中的潜力与局限。

链接: https://arxiv.org/abs/2511.04588
作者: Soham De,Lodewijk Gelauff,Ashish Goel,Smitha Milli,Ariel Procaccia,Alice Siu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:A central feature of many deliberative processes, such as citizens’ assemblies and deliberative polls, is the opportunity for participants to engage directly with experts. While participants are typically invited to propose questions for expert panels, only a limited number can be selected due to time constraints. This raises the challenge of how to choose a small set of questions that best represent the interests of all participants. We introduce an auditing framework for measuring the level of representation provided by a slate of questions, based on the social choice concept known as justified representation (JR). We present the first algorithms for auditing JR in the general utility setting, with our most efficient algorithm achieving a runtime of O(mn\log n) , where n is the number of participants and m is the number of proposed questions. We apply our auditing methods to historical deliberations, comparing the representativeness of (a) the actual questions posed to the expert panel (chosen by a moderator), (b) participants’ questions chosen via integer linear programming, © summary questions generated by large language models (LLMs). Our results highlight both the promise and current limitations of LLMs in supporting deliberative processes. By integrating our methods into an online deliberation platform that has been used for over hundreds of deliberations across more than 50 countries, we make it easy for practitioners to audit and improve representation in future deliberations.
zh

[AI-2] Integrating Temporal and Structural Context in Graph Transformers for Relational Deep Learning

【速读】:该论文旨在解决现有图模型在处理关系型数据时的两个关键局限:一是仅关注空间结构而将时间信息视为过滤条件而非建模信号,二是通常仅支持单一预测任务。为应对这些问题,论文提出了一种基于子图采样的方法,通过检索超出局部邻域的节点来增强全局上下文以捕捉时序相关性;并设计了关系图感知器(Relational Graph Perceiver, RGP),其核心创新在于引入一个基于交叉注意力机制的潜在瓶颈(latent bottleneck),能够将不同类型的节点和边信息高效融合至统一潜在空间,从而构建整个关系系统的全局上下文。此外,RGP还配备了一个灵活的交叉注意力解码器,支持在单个模型中联合学习具有不相交标签空间的多种预测任务。

链接: https://arxiv.org/abs/2511.04557
作者: Divyansha Lachi,Mahmoud Mohammadi,Joe Meyer,Vinam Arora,Tom Palczewski,Eva L. Dyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In domains such as healthcare, finance, and e-commerce, the temporal dynamics of relational data emerge from complex interactions-such as those between patients and providers, or users and products across diverse categories. To be broadly useful, models operating on these data must integrate long-range spatial and temporal dependencies across diverse types of entities, while also supporting multiple predictive tasks. However, existing graph models for relational data primarily focus on spatial structure, treating temporal information merely as a filtering constraint to exclude future events rather than a modeling signal, and are typically designed for single-task prediction. To address these gaps, we introduce a temporal subgraph sampler that enhances global context by retrieving nodes beyond the immediate neighborhood to capture temporally relevant relationships. In addition, we propose the Relational Graph Perceiver (RGP), a graph transformer architecture for relational deep learning that leverages a cross-attention-based latent bottleneck to efficiently integrate information from both structural and temporal contexts. This latent bottleneck integrates signals from different node and edge types into a common latent space, enabling the model to build global context across the entire relational system. RGP also incorporates a flexible cross-attention decoder that supports joint learning across tasks with disjoint label spaces within a single model. Experiments on RelBench, SALT, and CTU show that RGP delivers state-of-the-art performance, offering a general and scalable solution for relational deep learning with support for diverse predictive tasks.
zh

[AI-3] Optimizing Sensor Placement in Urban Storm Sewers: A Data-Driven Sparse Sensing Approach

【速读】:该论文旨在解决在资源受限条件下(如时间、预算和技术限制),如何高效监测城市排水管网并预测水流状态的问题。其核心挑战在于实现高时空分辨率的洪水预测与监控,同时减少对传感器数量和部署成本的依赖。解决方案的关键在于提出一种数据驱动的稀疏传感(Data-Driven Sparse Sensing, DSS)框架,该框架结合EPA-SWMM模型生成训练数据,并利用奇异值分解(Singular Value Decomposition, SVD)进行降维处理,再通过QR分解优化传感器位置分配,从而在77个节点中识别出仅需3个最优监测点即可实现峰值流量的高精度重构(Nash-Sutcliffe效率NSE为0.92–0.95)。该方法兼具计算效率与物理可解释性,在保证重建精度的同时显著降低监测成本,且具备良好的抗测量不确定性及传感器故障鲁棒性,为有限传感资源下的城市内涝早期预警与实时控制提供了可行路径。

链接: https://arxiv.org/abs/2511.04556
作者: Zihang Ding,Kun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 32 pages (including supplementary information), 11 figures (and 7 figures in supplementary). Submitted to Nature Water. Partially presented at HydroML 2025 Symposium, Minnesota Water Resources Conference 2025, and will be presented at AGU Fall Meeting 2025

点击查看摘要

Abstract:Urban surface water flooding, triggered by intense rainfall overwhelming drainage systems, is increasingly frequent and widespread. While flood prediction and monitoring in high spatial-temporal resolution are desired, practical constraints in time, budget, and technology hinder its full implementation. How to monitor urban drainage networks and predict flow conditions under constrained resource is a major challenge. This study presents a data-driven sparse sensing (DSS) framework, integrated with EPA-SWMM, to optimize sensor placement and reconstruct peak flowrates in a stormwater system, using the Woodland Avenue catchment in Duluth, Minnesota, as a case study. We utilized a SWMM model to generate a training dataset of peak flowrate profiles across the stormwater network. Furthermore, we applied DSS - leveraging singular value decomposition for dimensionality reduction and QR factorization for sensor allocation - to identify the optimal monitoring nodes based on the simulated training dataset. We then validated the representativeness of these identified monitoring nodes by comparing the DSS-reconstructed peak flowrate profiles with those obtained from SWMM. Three optimally placed sensors among 77 nodes achieved satisfactory reconstruction performance with Nash-Sutcliffe Efficiency (NSE) values of 0.92-0.95 (25th to 75th percentiles). In addition, the model showed good robustness to uncertainty in measurements. Its robustness to sensor failures is location-dependent and improves with the number of sensors deployed. The framework balances computational efficiency and physical interpretability, enabling high-accuracy flow reconstruction with minimal sensors. This DSS framework can be further integrated with predictive models to realize flood early warning and real-time control under limited sensing and monitoring resource.
zh

[AI-4] LLM -as-a-Judge: Toward World Models for Slate Recommendation Systems

【速读】:该论文旨在解决跨域用户偏好建模问题,这是在条目推荐(slate recommendation)研究中的一个核心挑战,即如何有效推荐一组有序的项目。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)作为用户偏好的世界模型(world models),通过成对推理(pairwise reasoning)来捕捉和预测不同场景下用户的偏好关系,从而提升推荐系统的泛化能力与准确性。

链接: https://arxiv.org/abs/2511.04541
作者: Baptiste Bonin,Maxime Heuillet,Audrey Durand
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modeling user preferences across domains remains a key challenge in slate recommendation (i.e. recommending an ordered sequence of items) research. We investigate how Large Language Models (LLM) can effectively act as world models of user preferences through pairwise reasoning over slates. We conduct an empirical study involving several LLMs on three tasks spanning different datasets. Our results reveal relationships between task performance and properties of the preference function captured by LLMs, hinting towards areas for improvement and highlighting the potential of LLMs as world models in recommender systems.
zh

[AI-5] Alternative Fairness and Accuracy Optimization in Criminal Justice AAAI2026

【速读】:该论文旨在解决算法公平性(algorithmic fairness)在刑事司法领域中的核心挑战,特别是群体公平、个体公平与过程公平之间的冲突及其实际应用困境。其解决方案的关键在于对传统群体公平性标准进行简化改进:不再强制要求受保护群体间完全的误差率平等,而是通过最小化加权误差损失函数,并允许假阴性率差异控制在一个较小容差范围内,从而提升可解性、预测准确性和伦理决策透明度。此方法将技术设计与制度合法性相结合,为公共决策系统部署风险评估工具提供了可操作的框架。

链接: https://arxiv.org/abs/2511.04505
作者: Shaolong Wu,James Blume,Geshi Yeung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted for presentation at the AAAI 2026 AI Governance Workshop (AIGOV). 24 pages

点击查看摘要

Abstract:Algorithmic fairness has grown rapidly as a research area, yet key concepts remain unsettled, especially in criminal justice. We review group, individual, and process fairness and map the conditions under which they conflict. We then develop a simple modification to standard group fairness. Rather than exact parity across protected groups, we minimize a weighted error loss while keeping differences in false negative rates within a small tolerance. This makes solutions easier to find, can raise predictive accuracy, and surfaces the ethical choice of error costs. We situate this proposal within three classes of critique: biased and incomplete data, latent affirmative action, and the explosion of subgroup constraints. Finally, we offer a practical framework for deployment in public decision systems built on three pillars: need-based decisions, Transparency and accountability, and narrowly tailored definitions and solutions. Together, these elements link technical design to legitimacy and provide actionable guidance for agencies that use risk assessment and related tools.
zh

[AI-6] Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Training

【速读】:该论文旨在解决低秩预训练任务中难以维持低秩结构与优化目标的问题,即现有基于低秩优化的参数高效训练方法在预训练阶段效果不佳。其解决方案的关键在于提出一种名为Q3R(Quadratic Reweighted Rank Regularizer)的新颖低秩诱导训练策略,该方法基于二次正则化项,通过最大化一个平滑的对数行列式作为秩的代理目标,受迭代重加权最小二乘(IRLS)框架启发。Q3R能够在保持模型预测性能接近密集模型的前提下,实现指定低秩目标的权重矩阵训练,且计算开销小、兼容现有架构,已在视觉Transformer(ViT)等模型上验证了有效性,例如在CIFAR-10任务中压缩60%和80%参数时仅造成约1.3%和4%的精度下降。

链接: https://arxiv.org/abs/2511.04485
作者: Ipsita Ghosh,Ethan Nguyen,Christian Kümmerle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Parameter-efficient training, based on low-rank optimization, has become a highly successful tool for fine-tuning large deep-learning models. However, these methods fail at low-rank pre-training tasks where maintaining the low-rank structure and the objective remains a challenging task. We propose the Quadratic Reweighted Rank Regularizer dubbed Q3R, which leads to a novel low-rank inducing training strategy inspired by the iteratively reweighted least squares (IRLS) framework. Q3R is based on a quadratic regularizer term which majorizes a smoothed log determinant serving as rank surrogate objective. Unlike other low-rank training techniques, Q3R is able to train weight matrices with prescribed, low target ranks of models that achieve comparable predictive performance as dense models, with small computational overhead, while remaining fully compatible with existing architectures. For example, we demonstrated one experiment where we are able to truncate 60% and 80% of the parameters of a ViT-Tiny model with ~1.3% and ~4% accuracy drop in CIFAR-10 performance respectively. The efficacy of Q3R is confirmed on Transformers across both image and language tasks, including for low-rank fine-tuning.
zh

[AI-7] Promoting Sustainable Web Agents : Benchmarking and Estimating Energy Consumption through Empirical and Theoretical Analysis AAAI2026

【速读】:该论文旨在解决当前Web代理(Web agents)研究中忽视可持续性问题的现状,特别是其在运行过程中产生的能源消耗和二氧化碳(CO₂)排放缺乏系统评估。研究表明,不同设计哲学的Web代理在能耗上存在显著差异,且高能耗并不必然带来更优性能。解决方案的关键在于推动评估范式的转变——引入专门衡量能源消耗的指标,并通过理论估算与实证基准测试相结合的方式,量化Web代理的环境影响,从而促进更加透明、高效和可持续的代理系统开发。

链接: https://arxiv.org/abs/2511.04481
作者: Lars Krupp,Daniel Geißler,Vishal Banwari,Paul Lukowicz,Jakob Karolus
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 AISI

点击查看摘要

Abstract:Web agents, like OpenAI’s Operator and Google’s Project Mariner, are powerful agentic systems pushing the boundaries of Large Language Models (LLM). They can autonomously interact with the internet at the user’s behest, such as navigating websites, filling search masks, and comparing price lists. Though web agent research is thriving, induced sustainability issues remain largely unexplored. To highlight the urgency of this issue, we provide an initial exploration of the energy and CO_2 cost associated with web agents from both a theoretical -via estimation- and an empirical perspective -by benchmarking. Our results show how different philosophies in web agent creation can severely impact the associated expended energy, and that more energy consumed does not necessarily equate to better results. We highlight a lack of transparency regarding disclosing model parameters and processes used for some web agents as a limiting factor when estimating energy consumption. Our work contributes towards a change in thinking of how we evaluate web agents, advocating for dedicated metrics measuring energy consumption in benchmarks.
zh

[AI-8] Generate Evaluate Iterate: Synthetic Data for Human-in-the-Loop Refinement of LLM Judges

【速读】:该论文旨在解决大语言模型作为评判者(LLM-as-a-judge)范式中因缺乏多样且具代表性的数据而导致评估标准优化受限的问题。其核心解决方案是集成合成数据生成工具到LLM-as-a-judge工作流中,使用户能够通过配置领域、角色、长度及期望结果(包括边界案例)来定制化生成挑战性测试用例,并支持AI辅助的内联编辑功能;同时,该工具通过揭示生成背后的提示词和解释,提升透明度与可解释性,从而在不增加额外工作量的前提下显著提升评估数据的多样性与有效性。

链接: https://arxiv.org/abs/2511.04478
作者: Hyo Jin Do,Zahra Ashktorab,Jasmina Gajcin,Erik Miehling,Martín Santillán Cooper,Qian Pan,Elizabeth M. Daly,Werner Geyer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 29 pages, 4 figures

点击查看摘要

Abstract:The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but its effectiveness is often limited by the scarcity of diverse, representative data for refining criteria. We present a tool that integrates synthetic data generation into the LLM-as-a-judge workflow, empowering users to create tailored and challenging test cases with configurable domains, personas, lengths, and desired outcomes, including borderline cases. The tool also supports AI-assisted inline editing of existing test cases. To enhance transparency and interpretability, it reveals the prompts and explanations behind each generation. In a user study (N=24), 83% of participants preferred the tool over manually creating or selecting test cases, as it allowed them to rapidly generate diverse synthetic data without additional workload. The generated synthetic data proved as effective as hand-crafted data for both refining evaluation criteria and aligning with human preferences. These findings highlight synthetic data as a promising alternative, particularly in contexts where efficiency and scalability are critical.
zh

[AI-9] Fraud-Proof Revenue Division on Subscription Platforms ICML

【速读】:该论文旨在解决订阅制内容平台中因用户欺诈行为导致的收入分配不公平问题,尤其关注现有基于机器学习的欺诈检测方法在与恶意行为者博弈过程中存在的局限性。其核心挑战在于如何设计一种机制,从结构上抑制用户对平台收入分配规则的操纵动机,而非依赖事后检测。解决方案的关键在于提出一个满足三种“抗操纵性公理”的新型收入分配规则——ScaledUserProp,该规则通过调整用户贡献度的缩放比例,在保证公平性的前提下显著降低操纵可能性,并且实验证明其在真实和合成数据上均优于现有主流分配机制。

链接: https://arxiv.org/abs/2511.04465
作者: Abheek Ghosh,Tzeh Yuan Neoh,Nicholas Teh,Giannis Tyrovolas
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
备注: Appears in the 42nd International Conference on Machine Learning (ICML), 2025

点击查看摘要

Abstract:We study a model of subscription-based platforms where users pay a fixed fee for unlimited access to content, and creators receive a share of the revenue. Existing approaches to detecting fraud predominantly rely on machine learning methods, engaging in an ongoing arms race with bad actors. We explore revenue division mechanisms that inherently disincentivize manipulation. We formalize three types of manipulation-resistance axioms and examine which existing rules satisfy these. We show that a mechanism widely used by streaming platforms, not only fails to prevent fraud, but also makes detecting manipulation computationally intractable. We also introduce a novel rule, ScaledUserProp, that satisfies all three manipulation-resistance axioms. Finally, experiments with both real-world and synthetic streaming data support ScaledUserProp as a fairer alternative compared to existing rules.
zh

[AI-10] Beyond Shortest Path: Agent ic Vehicular Routing with Semantic Context

【速读】:该论文旨在解决传统车辆路径规划系统在处理多目标优化时缺乏对人类驾驶员复杂语义和动态情境(如多步骤任务、情境约束或紧急需求)的感知与整合能力的问题。其解决方案的关键在于提出一种混合代理助手PAVe(Personalized Agentic Vehicular Routing),该方案将经典多目标Dijkstra算法生成的候选路径与基于大语言模型(LLM)的语义推理层相结合,通过预处理的地理空间POI缓存对用户意图进行解析,并据此对路径进行个性化调整,从而实现高精度、可扩展且适应性强的城市出行优化。

链接: https://arxiv.org/abs/2511.04464
作者: Carnot Braun,Rafael O. Jarczewski,Gabriel U. Talasso,Leandro A. Villas,Allan M. de Souza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional vehicle routing systems efficiently optimize singular metrics like time or distance, and when considering multiple metrics, they need more processes to optimize . However, they lack the capability to interpret and integrate the complex, semantic, and dynamic contexts of human drivers, such as multi-step tasks, situational constraints, or urgent needs. This paper introduces and evaluates PAVe (Personalized Agentic Vehicular Routing), a hybrid agentic assistant designed to augment classical pathfinding algorithms with contextual reasoning. Our approach employs a Large Language Model (LLM) agent that operates on a candidate set of routes generated by a multi-objective (time, CO2) Dijkstra algorithm. The agent evaluates these options against user-provided tasks, preferences, and avoidance rules by leveraging a pre-processed geospatial cache of urban Points of Interest (POIs). In a benchmark of realistic urban scenarios, PAVe successfully used complex user intent into appropriate route modifications, achieving over 88% accuracy in its initial route selections with a local model. We conclude that combining classical routing algorithms with an LLM-based semantic reasoning layer is a robust and effective approach for creating personalized, adaptive, and scalable solutions for urban mobility optimization.
zh

[AI-11] Deep Dictionary-Free Method for Identifying Linear Model of Nonlinear System with Input Delay

【速读】:该论文旨在解决具有输入时滞的非线性动力系统在预测、估计和控制中的难题,这类系统因内在复杂性及延迟对行为的影响而难以处理,传统线性控制方法往往失效。其解决方案的关键在于提出一种基于LSTM增强的深度Koopman模型(LSTM-enhanced Deep Koopman model),通过引入长短期记忆(Long Short-Term Memory, LSTM)层来捕捉历史依赖关系,并将时滞系统动态高效编码到潜在空间中,从而实现对非线性系统的时间延迟特性的线性表示。该方法无需预定义字典(dictionary-free),避免了传统扩展动态模态分解(extended Dynamic Mode Decomposition, eDMD)对已知动态先验的依赖,显著提升了在真实非线性动力未知情况下的预测精度,同时在已知动力条件下性能可与eDMD相当。

链接: https://arxiv.org/abs/2511.04451
作者: Patrik Valábek,Marek Wadinger,Michal Kvasnica,Martin Klaučo
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Nonlinear dynamical systems with input delays pose significant challenges for prediction, estimation, and control due to their inherent complexity and the impact of delays on system behavior. Traditional linear control techniques often fail in these contexts, necessitating innovative approaches. This paper introduces a novel approach to approximate the Koopman operator using an LSTM-enhanced Deep Koopman model, enabling linear representations of nonlinear systems with time delays. By incorporating Long Short-Term Memory (LSTM) layers, the proposed framework captures historical dependencies and efficiently encodes time-delayed system dynamics into a latent space. Unlike traditional extended Dynamic Mode Decomposition (eDMD) approaches that rely on predefined dictionaries, the LSTM-enhanced Deep Koopman model is dictionary-free, which mitigates the problems with the underlying dynamics being known and incorporated into the dictionary. Quantitative comparisons with extended eDMD on a simulated system demonstrate highly significant performance gains in prediction accuracy in cases where the true nonlinear dynamics are unknown and achieve comparable results to eDMD with known dynamics of a system.
zh

[AI-12] he Peril of Preference: Why GRPO fails on Ordinal Rewards

【速读】:该论文旨在解决Group-relative Policy Optimization (GRPO) 在使用序数奖励(ordinal rewards)进行强化学习训练时存在的根本性缺陷:其基于组平均的基准策略会错误地为失败轨迹分配正优势值,从而强化错误行为。这一问题在引入非二元反馈以提供部分奖励时尤为突出,导致模型收敛不稳定且难以获得高质量解。解决方案的关键在于提出Correctness Relative Policy Optimization (CoRPO),其核心创新是采用一种自适应基准机制——该机制首先强制设定一个最低质量阈值,确保失败样本不会被正向强化;一旦策略稳定达到此阈值,基准自动切换至相对偏好模式,驱动模型从“可接受”走向“最优”解。这种方法实现了从二元反馈到序数奖励的平滑过渡,并为后续更密集的每步监督提供了坚实基础。

链接: https://arxiv.org/abs/2511.04439
作者: Anisha Garg,Ganesh Venkatesh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Group-relative Policy Optimization’s (GRPO) simplicity makes it highly desirable for adapting LLMs to become experts at specific tasks. But this simplicity also makes it ill-specified as we seek to enhance RL training with richer, non-binary feedback. When using ordinal rewards to give partial credit, GRPO’s simplicity starts to hurt, as its group-average baseline often assigns a positive advantage to failed trajectories and reinforces incorrect behavior. We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw. CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced. Once the policy consistently meets this threshold, the baseline automatically transitions to a relative preference mode, pushing the model to find optimal solutions rather than just “acceptable” ones. We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization. This work represents a critical step in our broader research program to enable LLMs to learn genuinely new capabilities through reinforcement learning. We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback - progressing from binary to ordinal rewards in this work, and onward to denser, per-step supervision. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.04439 [cs.AI] (or arXiv:2511.04439v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.04439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-13] Deep Koopman Economic Model Predictive Control of a Pasteurisation Unit

【速读】:该论文旨在解决热密集型工业过程(如实验室规模的巴氏杀菌单元,PU)在运行中因非线性动态复杂性和经济成本优化难题而导致的资源利用效率低下的问题。其核心挑战在于如何在保持高精度建模的同时,实现高效的经济模型预测控制(EMPC),以最小化能源消耗、物料损失及执行器磨损等综合经济成本。解决方案的关键在于引入基于深度Koopman算子理论的建模方法:通过神经网络从实验数据中学习系统状态空间的线性表示,从而将非线性动力学转化为可进行凸优化的形式,同时显著提升开环预测精度(较传统N4SID子空间辨识提高45%)。该方法被嵌入EMPC框架中,并结合松弛变量确保可行性,最终在存在外部扰动(如泵故障和冷料批次引入)的情况下,使总经济成本降低32%,且稳态能耗减少10.2%,验证了其在资源高效控制方面的优越性。

链接: https://arxiv.org/abs/2511.04437
作者: Patrik Valábek,Michaela Horváthová,Martin Klaučo
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a deep Koopman-based Economic Model Predictive Control (EMPC) for efficient operation of a laboratory-scale pasteurization unit (PU). The method uses Koopman operator theory to transform the complex, nonlinear system dynamics into a linear representation, enabling the application of convex optimization while representing the complex PU accurately. The deep Koopman model utilizes neural networks to learn the linear dynamics from experimental data, achieving a 45% improvement in open-loop prediction accuracy over conventional N4SID subspace identification. Both analyzed models were employed in the EMPC formulation that includes interpretable economic costs, such as energy consumption, material losses due to inadequate pasteurization, and actuator wear. The feasibility of EMPC is ensured using slack variables. The deep Koopman EMPC and N4SID EMPC are numerically validated on a nonlinear model of multivariable PU under external disturbance. The disturbances include feed pump fail-to-close scenario and the introduction of a cold batch to be pastuerized. These results demonstrate that the deep Koopmand EMPC achieves a 32% reduction in total economic cost compared to the N4SID baseline. This improvement is mainly due to the reductions in material losses and energy consumption. Furthermore, the steady-state operation via Koopman-based EMPC requires 10.2% less electrical energy. The results highlight the practical advantages of integrating deep Koopman representations with economic optimization to achieve resource-efficient control of thermal-intensive plants.
zh

[AI-14] Speed at the Cost of Quality? The Impact of LLM Agent Assistance on Software Development

【速读】:该论文旨在解决当前关于大型语言模型(Large Language Models, LLMs)代理助手在软件开发中实际效果的实证证据不足问题,特别是其对开发速度(development velocity)和代码质量的影响。研究通过采用先进的双重差分法(difference-in-differences design),将使用流行LLM代理工具Cursor的GitHub项目与未使用Cursor且特征匹配的对照组进行比较,从而识别出因果效应。解决方案的关键在于利用大规模真实世界开源项目数据构建准实验设计,并结合面板广义矩估计(panel generalized method of moments)进一步揭示静态分析警告增加和代码复杂度上升是导致长期开发速度下降的核心机制。

链接: https://arxiv.org/abs/2511.04427
作者: Hao He,Courtney Miller,Shyam Agarwal,Christian Kästner,Bogdan Vasilescu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated the promise to revolutionize the field of software engineering. Among other things, LLM agents are rapidly gaining momentum in their application to software development, with practitioners claiming a multifold productivity increase after adoption. Yet, empirical evidence is lacking around these claims. In this paper, we estimate the causal effect of adopting a widely popular LLM agent assistant, namely Cursor, on development velocity and software quality. The estimation is enabled by a state-of-the-art difference-in-differences design comparing Cursor-adopting GitHub projects with a matched control group of similar GitHub projects that do not use Cursor. We find that the adoption of Cursor leads to a significant, large, but transient increase in project-level development velocity, along with a significant and persistent increase in static analysis warnings and code complexity. Further panel generalized method of moments estimation reveals that the increase in static analysis warnings and code complexity acts as a major factor causing long-term velocity slowdown. Our study carries implications for software engineering practitioners, LLM agent assistant designers, and researchers.
zh

[AI-15] Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness

【速读】:该论文旨在解决深度学习模型在子群体分布偏移(subpopulation shift)场景下表现脆弱的问题,即模型对少数群体的泛化能力差,根源在于其依赖了数据中的虚假相关性(spurious correlations)。解决方案的关键在于提出一种名为“虚假相关性感知嵌入正则化”(Spurious Correlation-Aware Embedding Regularization, SCER)的新方法,通过在嵌入空间层面直接约束特征表示,抑制虚假线索的影响。理论分析表明,最差组误差与分类器对虚假方向和核心方向(core directions)的依赖强度密切相关,SCER据此在嵌入层施加约束,促使模型聚焦于鲁棒的核心特征,从而提升最差组的准确率。

链接: https://arxiv.org/abs/2511.04401
作者: Subeen Park,Joowang Kim,Hakyung Lee,Sunjae Yoo,Kyungwoo Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models achieve strong performance across various domains but often rely on spurious correlations, making them vulnerable to distribution shifts. This issue is particularly severe in subpopulation shift scenarios, where models struggle in underrepresented groups. While existing methods have made progress in mitigating this issue, their performance gains are still constrained. They lack a rigorous theoretical framework connecting the embedding space representations with worst-group error. To address this limitation, we propose Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness (SCER), a novel approach that directly regularizes feature representations to suppress spurious cues. We show theoretically that worst-group error is influenced by how strongly the classifier relies on spurious versus core directions, identified from differences in group-wise mean embeddings across domains and classes. By imposing theoretical constraints at the embedding level, SCER encourages models to focus on core features while reducing sensitivity to spurious patterns. Through systematic evaluation on multiple vision and language, we show that SCER outperforms prior state-of-the-art studies in worst-group accuracy. Our code is available at \hrefthis https URLthis https URL.
zh

[AI-16] Post-Training LLM s as Better Decision-Making Agents : A Regret-Minimization Approach

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在交互式动态环境中进行决策时表现不佳的问题,尤其是在低遗憾(low regret)和探索-利用权衡(exploration-exploitation tradeoff)方面存在明显不足。传统方法要么依赖已知决策算法的行动序列蒸馏,要么依赖人工设计的思维链模板,难以灵活适应多样化的决策场景。解决方案的关键在于提出迭代遗憾最小化微调(Iterative Regret-Minimization Fine-Tuning, Iterative RMFT),其核心机制是在每轮迭代中通过模型自身生成多个决策轨迹,选取遗憾最低的k条轨迹作为监督信号对模型进行微调,从而让模型自主学习有效的决策策略与推理过程。该方法不依赖外部算法或人工模板,而是基于模型生成的自然语言推理与遗憾指标,实现更灵活、通用且可扩展的决策能力提升。

链接: https://arxiv.org/abs/2511.04393
作者: Chanwoo Park,Ziyang Chen,Asuman Ozdaglar,Kaiqing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as “agents” for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model’s own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs’ DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs’ decision-making capabilities.
zh

[AI-17] MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers

【速读】:该论文旨在解决现有音乐编辑模型在实际应用中面临的三大局限性:一是仅能处理由自身模型生成的合成音乐,难以适配真实世界音频;二是依赖高度精确的文本提示(prompt),导致用户交互成本高;三是需针对不同编辑任务进行特定训练,缺乏真正的零样本(zero-shot)能力。解决方案的关键在于引入MusRec,这是首个基于修正流(rectified flow)与扩散Transformer(diffusion transformer)架构的零样本文本到音乐编辑模型,能够高效且准确地对真实世界音乐执行多种编辑任务,同时在音乐内容保留、结构一致性及编辑保真度方面显著优于现有方法,为现实场景下的可控音乐编辑奠定了坚实基础。

链接: https://arxiv.org/abs/2511.04376
作者: Ali Boudaghi,Hadi Zare
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. Leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, the first zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.
zh

[AI-18] Monitor-Generate-Verify (MGV):Formalising Metacognitive Theory for Language Model Reasoning NEURIPS2025

【速读】:该论文试图解决测试时推理架构(如Generate-Verify范式)中缺乏监控机制所导致的“前缀主导陷阱”(prefix dominance trap)问题,即模型在早期生成阶段过早锁定次优推理路径且难以修正,从而造成约20%的准确率损失。解决方案的关键在于提出Monitor-Generate-Verify(MGV)框架,通过将Flavell及Nelson与Narens的元认知理论形式化为计算规范,引入显式的监控模块,在生成前捕获元认知体验(如难度评估和置信度判断),并利用验证反馈持续优化后续监控过程,从而实现对推理启动时机与策略的动态调节。

链接: https://arxiv.org/abs/2511.04341
作者: Nick Oh,Fernand Gobet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To-be presented at the Workshop on the Foundations of Reasoning in Language Models at NeurIPS 2025 (non-archival)

点击查看摘要

Abstract:Test-time reasoning architectures such as those following the Generate-Verify paradigm – where a model iteratively refines or verifies its own generated outputs – prioritise generation and verification but exclude the monitoring processes that determine when and how reasoning should begin. This omission may contribute to the prefix dominance trap, in which models commit early to suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy loss. We address this architectural gap by formalising Flavell’s and Nelson and Narens’ metacognitive theories into computational specifications, proposing the Monitor-Generate-Verify (MGV) framework. MGV extends the Generate-Verify paradigm by adding explicit monitoring that captures metacognitive experiences (from difficulty assessments to confidence judgements) before generation begins and refines future monitoring through verification feedback. Though we present no empirical validation, this work provides the first systematic computational translation of foundational metacognitive theories, offering a principled vocabulary for understanding reasoning system failures and suggesting specific architectural interventions for future test-time reasoning designs.
zh

[AI-19] LUME-DBN: Full Bayesian Learning of DBNs from Incomplete data in Intensive Care ECAI2025

【速读】:该论文旨在解决动态贝叶斯网络(Dynamic Bayesian Networks, DBNs)在处理纵向临床数据中缺失值时存在的局限性问题,即现有方法多基于静态贝叶斯网络的缺失数据处理策略,未能充分考虑数据的时间依赖特性,从而限制了对时间维度上不确定性量化的能力。其解决方案的关键在于提出一种基于吉布斯采样(Gibbs sampling)的全新贝叶斯框架,将每个缺失值视为服从高斯分布的未知参数,并在每次迭代中从其完整条件分布中采样,实现对缺失值的合理插补与不确定性估计,从而提升模型在重症监护等场景下的可信度与泛化能力。

链接: https://arxiv.org/abs/2511.04333
作者: Federico Pirola,Fabio Stella,Marco Grzegorczyk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures, 3 tables, presented at HC@AIxIA + HYDRA 2025 Workshop located at ECAI 2025 Conference

点击查看摘要

Abstract:Dynamic Bayesian networks (DBNs) are increasingly used in healthcare due to their ability to model complex temporal relationships in patient data while maintaining interpretability, an essential feature for clinical decision-making. However, existing approaches to handling missing data in longitudinal clinical datasets are largely derived from static Bayesian networks literature, failing to properly account for the temporal nature of the data. This gap limits the ability to quantify uncertainty over time, which is particularly critical in settings such as intensive care, where understanding the temporal dynamics is fundamental for model trustworthiness and applicability across diverse patient groups. Despite the potential of DBNs, a full Bayesian framework that integrates missing data handling remains underdeveloped. In this work, we propose a novel Gibbs sampling-based method for learning DBNs from incomplete data. Our method treats each missing value as an unknown parameter following a Gaussian distribution. At each iteration, the unobserved values are sampled from their full conditional distributions, allowing for principled imputation and uncertainty estimation. We evaluate our method on both simulated datasets and real-world intensive care data from critically ill patients. Compared to standard model-agnostic techniques such as MICE, our Bayesian approach demonstrates superior reconstruction accuracy and convergence properties. These results highlight the clinical relevance of incorporating full Bayesian inference in temporal models, providing more reliable imputations and offering deeper insight into model behavior. Our approach supports safer and more informed clinical decision-making, particularly in settings where missing data are frequent and potentially impactful.
zh

[AI-20] Differentially Private In-Context Learning with Nearest Neighbor Search NEURIPS

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在上下文学习(In-Context Learning, ICL)中因检索相关示例而引发的隐私泄露问题,特别是现有方法忽视了LLM流水线中用于获取上下文数据的相似性搜索(Similarity Search)环节所带来的隐私风险。解决方案的关键在于提出一种差分隐私(Differential Privacy, DP)框架,该框架通过在隐私感知条件下进行最近邻检索(Nearest Neighbor Search),并引入一个隐私过滤器(Privacy Filter)来追踪所选样本的累计隐私成本,从而确保整体隐私预算(Central Differential Privacy Budget)不被超出。此方法在文本分类和文档问答任务上显著优于现有基线,实现了更优的隐私-效用权衡。

链接: https://arxiv.org/abs/2511.04332
作者: Antti Koskela,Tejas Kulkarni,Laith Zumot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: NeurIPS Lock-LLM Workshop 2025

点击查看摘要

Abstract:Differentially private in-context learning (DP-ICL) has recently become an active research topic due to the inherent privacy risks of in-context learning. However, existing approaches overlook a critical component of modern large language model (LLM) pipelines: the similarity search used to retrieve relevant context data. In this work, we introduce a DP framework for in-context learning that integrates nearest neighbor search of relevant examples in a privacy-aware manner. Our method outperforms existing baselines by a substantial margin across all evaluated benchmarks, achieving more favorable privacy-utility trade-offs. To achieve this, we employ nearest neighbor retrieval from a database of context data, combined with a privacy filter that tracks the cumulative privacy cost of selected samples to ensure adherence to a central differential privacy budget. Experimental results on text classification and document question answering show a clear advantage of the proposed method over existing baselines.
zh

[AI-21] RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的医疗系统在药物安全性评估方面研究不足的问题,尤其受限于真实世界数据集的缺乏以及在临床咨询场景中对药物安全性的系统性评估尚不充分。其解决方案的关键在于构建一个模拟临床问诊流程并可量化评估药物安全性的框架——RxSafeBench,该框架包含一个高质量的药物安全数据库(RxRisk DB),涵盖6,725种禁忌症、28,781种药物相互作用和14,906对适应症-药物匹配,并通过两阶段过滤策略确保临床真实性与专业质量,最终形成2,443个高保真咨询场景用于测试LLMs的药物推荐安全性。实验证明,现有LLMs在整合隐含风险知识方面存在显著短板,凸显了改进提示工程和任务特定微调对于提升AI临床决策支持可靠性的必要性。

链接: https://arxiv.org/abs/2511.04328
作者: Jiahao Zhao,Luxin Xu,Minghuan Tan,Lichao Zhang,Ahmadreza Argha,Hamid Alinejad-Rokny,Min Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in BIBM2025

点击查看摘要

Abstract:Numerous medical systems powered by Large Language Models (LLMs) have achieved remarkable progress in diverse healthcare tasks. However, research on their medication safety remains limited due to the lack of real world datasets, constrained by privacy and accessibility issues. Moreover, evaluation of LLMs in realistic clinical consultation settings, particularly regarding medication safety, is still underexplored. To address these gaps, we propose a framework that simulates and evaluates clinical consultations to systematically assess the medication safety capabilities of LLMs. Within this framework, we generate inquiry diagnosis dialogues with embedded medication risks and construct a dedicated medication safety database, RxRisk DB, containing 6,725 contraindications, 28,781 drug interactions, and 14,906 indication-drug pairs. A two-stage filtering strategy ensures clinical realism and professional quality, resulting in the benchmark RxSafeBench with 2,443 high-quality consultation scenarios. We evaluate leading open-source and proprietary LLMs using structured multiple choice questions that test their ability to recommend safe medications under simulated patient contexts. Results show that current LLMs struggle to integrate contraindication and interaction knowledge, especially when risks are implied rather than explicit. Our findings highlight key challenges in ensuring medication safety in LLM-based systems and provide insights into improving reliability through better prompting and task-specific tuning. RxSafeBench offers the first comprehensive benchmark for evaluating medication safety in LLMs, advancing safer and more trustworthy AI-driven clinical decision support.
zh

[AI-22] AIM: Software and Hardware Co-design for Architecture-level IR-drop Mitigation in High-performance PIM ISCA2025

【速读】:该论文旨在解决高性能SRAM Processing-in-Memory (PIM)架构中因高频率和复杂电路设计导致的严重IR-drop问题,该问题会显著降低芯片性能并威胁可靠性。解决方案的关键在于提出一种软硬件协同设计框架AIM(Architecture-level IR-drop Mitigation),其核心创新包括:利用PIM的位串行和原位数据流特性建立工作负载与IR-drop之间的直接关联(Rtog和HR);通过LHR和WDS实现架构级IR-drop缓解同时保持计算精度;引入IR-Booster动态调整机制,结合软件级HR信息与硬件IR-drop监测以优化电压-频率(V-f)配置;最终通过HR-aware任务映射方法打通软硬件设计边界,实现性能与能效的最优提升。

链接: https://arxiv.org/abs/2511.04321
作者: Yuanpeng Zhang,Xing Hu,Xi Chen,Zhihang Yuan,Cong Li,Jingchen Zhu,Zhao Wang,Chenguang Zhang,Xin Si,Wei Gao,Qiang Wu,Runsheng Wang,Guangyu Sun
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 22 figures, accepted by ISCA 2025

点击查看摘要

Abstract:SRAM Processing-in-Memory (PIM) has emerged as the most promising implementation for high-performance PIM, delivering superior computing density, energy efficiency, and computational precision. However, the pursuit of higher performance necessitates more complex circuit designs and increased operating frequencies, which exacerbate IR-drop issues. Severe IR-drop can significantly degrade chip performance and even threaten reliability. Conventional circuit-level IR-drop mitigation methods, such as back-end optimizations, are resource-intensive and often compromise power, performance, and area (PPA). To address these challenges, we propose AIM, comprehensive software and hardware co-design for architecture-level IR-drop mitigation in high-performance PIM. Initially, leveraging the bit-serial and in-situ dataflow processing properties of PIM, we introduce Rtog and HR, which establish a direct correlation between PIM workloads and IR-drop. Building on this foundation, we propose LHR and WDS, enabling extensive exploration of architecture-level IR-drop mitigation while maintaining computational accuracy through software optimization. Subsequently, we develop IR-Booster, a dynamic adjustment mechanism that integrates software-level HR information with hardware-based IR-drop monitoring to adapt the V-f pairs of the PIM macro, achieving enhanced energy efficiency and performance. Finally, we propose the HR-aware task mapping method, bridging software and hardware designs to achieve optimal improvement. Post-layout simulation results on a 7nm 256-TOPS PIM chip demonstrate that AIM achieves up to 69.2% IR-drop mitigation, resulting in 2.29x energy efficiency improvement and 1.152x speedup.
zh

[AI-23] AdversariaLLM : A Unified and Modular Toolbox for LLM Robustness Research

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)安全与鲁棒性研究中因实现、数据集和评估方法碎片化而导致的可复现性和可比性难题。解决方案的关键在于提出 AdversariaLLM 工具箱,其设计核心为可复现性(reproducibility)、正确性(correctness)和可扩展性(extensibility),集成十二种对抗攻击算法、七个涵盖有害性、过度拒绝和效用评估的基准数据集,并提供通过 Hugging Face 访问多种开源权重 LLM 的接口;同时引入计算资源追踪、确定性结果生成及分布评估等高级功能,辅以 JudgeZoo 判定模块以支持独立使用,从而构建一个透明、可比且可复现的 LLM 安全研究基础平台。

链接: https://arxiv.org/abs/2511.04316
作者: Tim Beyer,Jonas Dornbusch,Jakob Steimle,Moritz Ladenburger,Leo Schwinn,Stephan Günnemann
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional evaluation techniques. \name also integrates judging through the companion package JudgeZoo, which can also be used independently. Together, these components aim to establish a robust foundation for transparent, comparable, and reproducible research in LLM safety.
zh

[AI-24] Probing the Probes: Methods and Metrics for Concept Alignment

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI)中概念激活向量(Concept Activation Vectors, CAVs)评估依赖分类准确率所带来的误导性问题。研究表明,高探针(probe)准确率并不能可靠反映CAV与其目标概念之间的对齐程度(concept alignment),因为探针更可能捕捉到伪相关(spurious correlations)而非真实语义概念。为应对这一挑战,论文提出一种基于空间线性归因(spatial linear attribution)的概念定位方法,并引入三类量化指标——硬准确率、分割得分和增强鲁棒性(augmentation robustness),以更准确地评估概念对齐情况。关键创新在于强调应以对齐度为基础进行评价,而非仅依赖探针准确率,并指出探针设计需结合模型架构与目标概念特性,从而提升CAV的语义可靠性。

链接: https://arxiv.org/abs/2511.04312
作者: Jacob Lysnæs-Larsen,Marte Eggen,Inga Strümke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 17 figures

点击查看摘要

Abstract:In explainable AI, Concept Activation Vectors (CAVs) are typically obtained by training linear classifier probes to detect human-understandable concepts as directions in the activation space of deep neural networks. It is widely assumed that a high probe accuracy indicates a CAV faithfully representing its target concept. However, we show that the probe’s classification accuracy alone is an unreliable measure of concept alignment, i.e., the degree to which a CAV captures the intended concept. In fact, we argue that probes are more likely to capture spurious correlations than they are to represent only the intended concept. As part of our analysis, we demonstrate that deliberately misaligned probes constructed to exploit spurious correlations, achieve an accuracy close to that of standard probes. To address this severe problem, we introduce a novel concept localization method based on spatial linear attribution, and provide a comprehensive comparison of it to existing feature visualization techniques for detecting and mitigating concept misalignment. We further propose three classes of metrics for quantitatively assessing concept alignment: hard accuracy, segmentation scores, and augmentation robustness. Our analysis shows that probes with translation invariance and spatial alignment consistently increase concept alignment. These findings highlight the need for alignment-based evaluation metrics rather than probe accuracy, and the importance of tailoring probes to both the model architecture and the nature of the target concept.
zh

[AI-25] GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

【速读】:该论文旨在解决计算机使用代理(Computer-Using Agents, CUAs)研究中存在的三大持续性瓶颈:真实世界CUA任务稀缺、多模态轨迹的自动化采集与标注流程缺失,以及缺乏统一的基准来联合评估GUI定位(GUI grounding)、屏幕解析(screen parsing)和动作预测(action prediction)。其解决方案的关键在于构建GUI-360°数据集及配套基准套件,采用大语言模型(LLM)增强的自动化流水线实现查询来源、环境模板构建、任务实例化、批量执行和LLM驱动的质量过滤,从而生成包含超过120万次执行动作步骤、涵盖多种Windows办公应用的高质量多模态轨迹数据,支持GUI定位、屏幕解析、动作预测及混合GUI+API动作空间的评测。

链接: https://arxiv.org/abs/2511.04307
作者: Jian Mu,Chaoyun Zhang,Chiming Ni,Lu Wang,Bo Qiao,Kartik Mathur,Qianhui Wu,Yuhang Xie,Xiaojun Ma,Mengyu Zhou,Si Qin,Liqun Li,Yu Kang,Minghua Ma,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce GUI-360 ^\circ , a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360 ^\circ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision–language models on GUI-360 ^\circ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360 ^\circ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on this https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.04307 [cs.AI] (or arXiv:2511.04307v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.04307 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-26] Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

【速读】:该论文旨在解决人类偏好数据收集成本高、效率低的问题,尤其是在生成式 AI(Generative AI)模型对齐过程中,如何在保证性能的同时提升样本效率。其解决方案的关键在于提出一种混合框架,将强化学习与人类反馈(Reinforcement Learning from Human Feedback, RLHF)的可扩展性与偏好基优化(Preference-Based Optimization, PBO)的查询效率相结合,通过在RLHF流程中引入一个基于采集策略(acquisition-driven)的模块,实现主动且高效的偏好数据收集,从而在高维任务和大语言模型(Large Language Model, LLM)微调场景中均显著提升样本利用率和整体性能。

链接: https://arxiv.org/abs/2511.04286
作者: Matteo Cercola,Valeria Capretti,Simone Formentin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF’s scalability with PBO’s query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.
zh

[AI-27] RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

【速读】:该论文旨在解决强化学习在可验证奖励(Reinforcement Learning for Verifiable Rewards, RLVR)训练大模型时出现的过拟合问题,即模型在训练过程中虽然获得较高奖励但泛化能力下降。其根本原因在于策略过度专业化(policy over-specialization)和训练中多样解的灾难性遗忘(catastrophic forgetting)。为应对这一挑战,论文提出RLoop框架,其核心创新在于通过迭代策略初始化构建一个探索与利用的良性循环:首先利用强化学习从当前策略出发探索解空间,随后筛选成功轨迹生成专家数据集,并通过拒绝采样微调(Rejection-sampling Fine-Tuning, RFT)优化初始策略,从而在每轮迭代中将瞬态策略变化转化为稳健性能提升。该机制有效保留并转化了跨步间策略多样性,显著缓解遗忘并提升泛化能力,实验表明相较标准强化学习平均准确率提升9%,pass@32提升超过15%。

链接: https://arxiv.org/abs/2511.04285
作者: Zeng Zhiyuan,Jiashuo Liu,Zhangyue Yin,Ge Zhang,Wenhao Huang,Xipeng Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
zh

[AI-28] On the Brittleness of CLIP Text Encoders

【速读】:该论文旨在解决视觉-语言模型(如CLIP)在多媒体信息检索任务中对非语义查询扰动缺乏鲁棒性的问题,尤其是在处理人工表达的查询时,微小的语法或语义变化会导致最佳匹配结果排名显著波动。解决方案的关键在于系统性地分析多种类型的非语义查询扰动(包括词汇、句法和语义层面)对CLIP变体模型性能的影响,并通过在TRECVID Ad-Hoc Video Search查询和V3C1视频数据集上的实证评估,揭示句法和语义扰动是导致不稳定性最主要的因素,而脆弱性主要集中在诸如标点符号和大小写等表面编辑上,从而强调了鲁棒性应作为评估视觉-语言模型的重要维度,超越传统基准准确率指标。

链接: https://arxiv.org/abs/2511.04247
作者: Allie Tran,Luca Rossetto
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted for publication at MMM’26

点击查看摘要

Abstract:Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.
zh

[AI-29] seqme: a Python library for evaluating biological sequence design

【速读】:该论文旨在解决生物序列设计计算方法评估缺乏统一软件库的问题,即当前尚无一个集成多种评估指标的开源工具来衡量设计序列对目标分布的保真度及所需特性的实现程度。解决方案的关键在于提出 seqme,一个模块化且高度可扩展的 Python 开源库,其核心创新是整合了三类模型无关的评估指标:基于序列(sequence-based)、基于嵌入(embedding-based)和基于属性(property-based)的指标,并支持小分子、DNA、非编码RNA(ncRNA)、mRNA、肽和蛋白质等多种生物序列类型,同时提供嵌入模型、属性预测模型、诊断与可视化功能,从而实现对一次性与迭代式设计方法的全面评估。

链接: https://arxiv.org/abs/2511.04239
作者: Rasmus Møller-Larsen,Adam Izdebski,Jan Olszewski,Pankhil Gawade,Michal Kmicikiewicz,Wojciech Zarzecki,Ewa Szczurek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Recent advances in computational methods for designing biological sequences have sparked the development of metrics to evaluate these methods performance in terms of the fidelity of the designed sequences to a target distribution and their attainment of desired properties. However, a single software library implementing these metrics was lacking. In this work we introduce seqme, a modular and highly extendable open-source Python library, containing model-agnostic metrics for evaluating computational methods for biological sequence design. seqme considers three groups of metrics: sequence-based, embedding-based, and property-based, and is applicable to a wide range of biological sequences: small molecules, DNA, ncRNA, mRNA, peptides and proteins. The library offers a number of embedding and property models for biological sequences, as well as diagnostics and visualization functions to inspect the results. seqme can be used to evaluate both one-shot and iterative computational design methods.
zh

[AI-30] Denoised Recommendation Model with Collaborative Signal Decoupling

【速读】:该论文旨在解决协同过滤(Collaborative Filtering, CF)算法在推荐系统中因用户-物品交互矩阵中的噪声而导致推荐性能不佳的问题。现有去噪方法多在单一图结构上进行处理,可能引发协同信号衰减:移除两个节点间的边会中断其他节点间的路径,削弱依赖路径的协同信息。解决方案的关键在于提出一种基于图神经网络(Graph Neural Network, GNN)的新型CF模型DRCSD,其核心包含两个模块:协同信号解耦模块(将信号按结构特征分解为不同阶次)和阶次级去噪模块(对每一阶次进行针对性去噪),并通过改进传统GNN的信息聚合机制,避免跨阶次信号干扰直至最终池化操作,从而提升模型对不稳定交互的鲁棒性与推荐精度。

链接: https://arxiv.org/abs/2511.04237
作者: Zefeng Li,Ning Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although the collaborative filtering (CF) algorithm has achieved remarkable performance in recommendation systems, it suffers from suboptimal recommendation performance due to noise in the user-item interaction matrix. Numerous noise-removal studies have improved recommendation models, but most existing approaches conduct denoising on a single graph. This may cause attenuation of collaborative signals: removing edges between two nodes can interrupt paths between other nodes, weakening path-dependent collaborative information. To address these limitations, this study proposes a novel GNN-based CF model called DRCSD for denoising unstable interactions. DRCSD includes two core modules: a collaborative signal decoupling module (decomposes signals into distinct orders by structural characteristics) and an order-wise denoising module (performs targeted denoising on each order). Additionally, the information aggregation mechanism of traditional GNN-based CF models is modified to avoid cross-order signal interference until the final pooling operation. Extensive experiments on three public real-world datasets show that DRCSD has superior robustness against unstable interactions and achieves statistically significant performance improvements in recommendation accuracy metrics compared to state-of-the-art baseline models.
zh

[AI-31] Shared Spatial Memory Through Predictive Coding

【速读】:该论文旨在解决多智能体系统中共享和重建一致空间记忆的挑战,尤其是在部分可观测性和有限带宽条件下容易导致协调失败的问题。解决方案的关键在于提出一种基于预测编码(predictive coding)的多智能体框架,将协调建模为最小化智能体间互不确定性(mutual uncertainty)的过程,并通过信息瓶颈(information bottleneck)目标引导智能体学习通信的内容、时机与策略。该框架以自发从自监督运动预测中涌现的类似网格细胞(grid-cell-like)内部空间编码为基础,使智能体逐步发展出高效的通信机制和专门编码同伴位置的神经群体——类比于海马体中的社会位置细胞(social place cells, SPCs)。在此基础上,结合分层强化学习策略主动探索以降低联合不确定性,从而实现鲁棒且高效的协作行为。

链接: https://arxiv.org/abs/2511.04235
作者: Zhengru Fang,Yu Guo,Jingjing Wang,Yuang Zhang,Haonan An,Yinhai Wang,Yuguang Fang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: We have prepared the open-source code and video demonstration pages: 1. Code: this http URL 2. Demo: this http URL

点击查看摘要

Abstract:Sharing and reconstructing a consistent spatial memory is a critical challenge in multi-agent systems, where partial observability and limited bandwidth often lead to catastrophic failures in coordination. We introduce a multi-agent predictive coding framework that formulate coordination as the minimization of mutual uncertainty among agents. Instantiated as an information bottleneck objective, it prompts agents to learn not only who and what to communicate but also when. At the foundation of this framework lies a grid-cell-like metric as internal spatial coding for self-localization, emerging spontaneously from self-supervised motion prediction. Building upon this internal spatial code, agents gradually develop a bandwidth-efficient communication mechanism and specialized neural populations that encode partners’ locations: an artificial analogue of hippocampal social place cells (SPCs). These social representations are further enacted by a hierarchical reinforcement learning policy that actively explores to reduce joint uncertainty. On the Memory-Maze benchmark, our approach shows exceptional resilience to bandwidth constraints: success degrades gracefully from 73.5% to 64.4% as bandwidth shrinks from 128 to 4 bits/step, whereas a full-broadcast baseline collapses from 67.6% to 28.6%. Our findings establish a theoretically principled and biologically plausible basis for how complex social representations emerge from a unified predictive drive, leading to social collective intelligence.
zh

[AI-32] Opus: A Quantitative Framework for Workflow Evaluation

【速读】:该论文旨在解决工作流(Workflow)质量与效率难以量化评估和优化的问题,尤其在自动化系统中缺乏统一的数学框架来衡量其正确性、可靠性与成本之间的权衡。解决方案的关键在于提出Opus工作流评估框架,该框架通过引入Opus工作流奖励函数(Opus Workflow Reward),将工作流的成功概率建模为对成本与收益的期望值;同时定义了Opus工作流规范惩罚项(Opus Workflow Normative Penalties),从结构耦合度(Cohesion, Coupling)、可观测性(Observability)和信息卫生(Information Hygiene)等维度捕捉工作流的结构性与语义质量。最终,该框架构建了一个联合奖励-惩罚优化模型,实现工作流的自动评分、排序与最优解搜索,支持嵌入强化学习(Reinforcement Learning)循环以驱动工作流的发现与迭代优化。

链接: https://arxiv.org/abs/2511.04220
作者: Alan Seroul,Théo Fagnoni,Inès Adnani,Dana O. Mohamed,Phillip Kingston
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This paper introduces the Opus Workflow Evaluation Framework, a probabilistic-normative formulation for quantifying Workflow quality and efficiency. It integrates notions of correctness, reliability, and cost into a coherent mathematical model that enables direct comparison, scoring, and optimization of Workflows. The framework combines the Opus Workflow Reward, a probabilistic function estimating expected performance through success likelihood, resource usage, and output gain, with the Opus Workflow Normative Penalties, a set of measurable functions capturing structural and informational quality across Cohesion, Coupling, Observability, and Information Hygiene. It supports automated Workflow assessment, ranking, and optimization within modern automation systems such as Opus and can be integrated into Reinforcement Learning loops to guide Workflow discovery and refinement. In this paper, we introduce the Opus Workflow Reward model that formalizes Workflow success as a probabilistic expectation over costs and outcomes. We define measurable Opus Workflow Normative Penalties capturing structural, semantic, and signal-related properties of Workflows. Finally, we propose a unified optimization formulation for identifying and ranking optimal Workflows under joint Reward-Penalty trade-offs.
zh

[AI-33] he Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

【速读】:该论文试图解决Transformer架构中强彩票票根(Strong Lottery Ticket, SLT)的存在性问题,特别是现有理论尚未涵盖多头注意力机制(Multi-Head Attention, MHA)这一核心组件。解决方案的关键在于:首先证明在随机初始化的MHA中,若其键(key)和值(value)的隐藏维度为 $ O(d\log(Hd^{3/2})) $(其中 $ d $ 为输入维度,$ H $ 为头数),则存在一个SLT能以高概率近似任意目标MHA;进而基于此理论,将强彩票票根假说扩展至不含归一化层的Transformer模型。该理论通过控制隐藏维度提升逼近精度,并在实验中验证了近似误差随隐藏维度增长呈指数下降趋势。

链接: https://arxiv.org/abs/2511.04217
作者: Hikari Otsuka,Daiki Chijiwa,Yasuyuki Okoshi,Daichi Fujiki,Susumu Takeuchi,Masato Motomura
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of H heads and input dimension d has the hidden dimension O(d\log(Hd^3/2)) for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.
zh

[AI-34] A Reinforced Evolution-Based Approach to Multi-Resource Load Balancing

【速读】:该论文旨在解决一个特定的多资源(d-resource)系统优化问题,其难点在于传统遗传算法在处理该问题时因可行性函数过于严格而失效。解决方案的关键在于对标准遗传算子进行多项改进与适应性调整,其中核心创新是引入了一个迁移算子(migration operator),该算子模拟生物随机遗传漂变(random genetic drift)机制,从而提升搜索效率和解的可行性。

链接: https://arxiv.org/abs/2511.04183
作者: Leszek Sliwko
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:This paper presents a reinforced genetic approach to a defined d-resource system optimization problem. The classical evolution schema was ineffective due to a very strict feasibility function in the studied problem. Hence, the presented strategy has introduced several modifications and adaptations to standard genetic routines, e.g.: a migration operator which is an analogy to the biological random genetic drift.
zh

[AI-35] Explaining Software Vulnerabilities with Large Language Models

【速读】:该论文旨在解决静态应用安全测试(Static Application Security Testing, SAST)工具在实际使用中因警告信息通用化而导致开发者难以理解或忽略关键安全漏洞的问题。解决方案的关键在于提出SAFE——一个集成开发环境(Integrated Development Environment, IDE)插件,利用GPT-4o大语言模型生成关于漏洞成因、影响及缓解策略的解释性内容,从而提升SAST工具对初学者到中级开发者的可理解性和可用性。

链接: https://arxiv.org/abs/2511.04179
作者: Oshando Johnson,Alexandra Fomina,Ranjith Krishnamurthy,Vaibhav Chaudhari,Rohith Kumar Shanmuganathan,Eric Bodden
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The prevalence of security vulnerabilities has prompted companies to adopt static application security testing (SAST) tools for vulnerability detection. Nevertheless, these tools frequently exhibit usability limitations, as their generic warning messages do not sufficiently communicate important information to developers, resulting in misunderstandings or oversight of critical findings. In light of recent developments in Large Language Models (LLMs) and their text generation capabilities, our work investigates a hybrid approach that uses LLMs to tackle the SAST explainability challenges. In this paper, we present SAFE, an Integrated Development Environment (IDE) plugin that leverages GPT-4o to explain the causes, impacts, and mitigation strategies of vulnerabilities detected by SAST tools. Our expert user study findings indicate that the explanations generated by SAFE can significantly assist beginner to intermediate developers in understanding and addressing security vulnerabilities, thereby improving the overall usability of SAST tools.
zh

[AI-36] When Empowerment Disempowers

【速读】:该论文试图解决的问题是:在多人类场景中,基于赋能(empowerment)的目标无关性目标在单人环境下表现良好,但在多人环境中可能导致对其他人类的“去赋能”(disempowerment)现象,从而引发AI行为与人类意图的不一致。解决方案的关键在于引入一个开源的多人类网格世界测试套件Disempower-Grid,并通过实证发现:优化单一人类赋能的强化学习(RL)代理会显著削弱另一人类的环境控制力和奖励,进而提出联合赋能(joint empowerment)作为缓解策略——尽管这会以牺牲用户奖励为代价。这一发现揭示了AI对齐领域的一个关键挑战:在单智能体设定下看似对齐的目标,在多智能体情境中可能产生系统性偏差。

链接: https://arxiv.org/abs/2511.04177
作者: Claire Yang,Maya Cakmak,Max Kleiman-Weiner
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Empowerment, a measure of an agent’s ability to control its environment, has been proposed as a universal goal-agnostic objective for motivating assistive behavior in AI agents. While multi-human settings like homes and hospitals are promising for AI assistance, prior work on empowerment-based assistance assumes that the agent assists one human in isolation. We introduce an open source multi-human gridworld test suite Disempower-Grid. Using Disempower-Grid, we empirically show that assistive RL agents optimizing for one human’s empowerment can significantly reduce another human’s environmental influence and rewards - a phenomenon we formalize as disempowerment. We characterize when disempowerment occurs in these environments and show that joint empowerment mitigates disempowerment at the cost of the user’s reward. Our work reveals a broader challenge for the AI alignment community: goal-agnostic objectives that seem aligned in single-agent settings can become misaligned in multi-agent contexts.
zh

[AI-37] Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLM s and Human Judgment

【速读】:该论文旨在解决生成式 AI(Generative AI)在软件工程任务中应用时,其价值偏好与人类判断之间是否存在偏差的问题,特别是评估大型语言模型(Large Language Models, LLMs)在负责任人工智能(Responsible AI)价值观上的对齐程度。研究通过四个任务系统性地比较了LLMs与两类人类群体——美国代表性样本和AI从业者——在关键责任价值选择、重要性评估、价值权衡及软件需求优先级排序中的表现。解决方案的关键在于构建多维度的评估框架,揭示LLMs在陈述价值主张(任务1–3)与实际行为(任务4)之间存在的不一致性,从而指出当前LLMs在需求工程中直接应用的风险,并强调需建立系统性的基准测试、解释与监控机制以保障其价值对齐性。

链接: https://arxiv.org/abs/2511.04157
作者: Asma Yamani,Malak Baslyman,Moataz Ahmed
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed in software engineering tasks such as requirements elicitation, design, and evaluation, raising critical questions regarding their alignment with human judgments on responsible AI values. This study investigates how closely LLMs’ value preferences align with those of two human groups: a US-representative sample and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key responsible AI values, (T2) rating their importance in specific contexts, (T3) resolving trade-offs between competing values, and (T4) prioritizing software requirements that embody those values. The results show that LLMs generally align more closely with AI practitioners than with the US-representative sample, emphasizing fairness, privacy, transparency, safety, and accountability. However, inconsistencies appear between the values that LLMs claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4), revealing gaps in faithfulness between stated and applied behavior. These findings highlight the practical risk of relying on LLMs in requirements engineering without human oversight and motivate the need for systematic approaches to benchmark, interpret, and monitor value alignment in AI-assisted software development.
zh

[AI-38] Scaffolding Metacognition in Programming Education: Understanding Student-AI Interactions and Design Implications

【速读】:该论文试图解决的问题是:生成式 AI 工具(如 ChatGPT)在编程教育中对学生元认知过程(metacognitive processes)的影响尚不明确,现有研究多关注代码正确性和工具可用性,而忽视了学生使用 AI 助手是否支持或绕过关键的元认知策略。解决方案的关键在于通过元认知视角分析超过10,000条对话日志,并结合学生与教师问卷调查,识别提示词(prompts)和响应如何匹配元认知阶段与策略;进而提炼出设计原则,指导开发能够增强而非替代学生元认知参与的 AI 编程助教系统。

链接: https://arxiv.org/abs/2511.04144
作者: Boxuan Ma,Huiyong Li,Gen Li,Li Chen,Cheng Tang,Yinjie Xie,Chenghao Gu,Atsushi Shimada,Shin’ichi Konomi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI tools such as ChatGPT now provide novice programmers with unprecedented access to instant, personalized support. While this holds clear promise, their influence on students’ metacognitive processes remains underexplored. Existing work has largely focused on correctness and usability, with limited attention to whether and how students’ use of AI assistants supports or bypasses key metacognitive processes. This study addresses that gap by analyzing student-AI interactions through a metacognitive lens in university-level programming courses. We examined more than 10,000 dialogue logs collected over three years, complemented by surveys of students and educators. Our analysis focused on how prompts and responses aligned with metacognitive phases and strategies. Synthesizing these findings across data sources, we distill design considerations for AI-powered coding assistants that aim to support rather than supplant metacognitive engagement. Our findings provide guidance for developing educational AI tools that strengthen students’ learning processes in programming education.
zh

[AI-39] sting the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms

【速读】:该论文旨在解决语音人工智能(Voice AI)测试可靠性不足的问题,即当前缺乏系统性方法来客观评估测试平台(无论是内部工具还是外部平台)的有效性,从而在语音AI大规模部署至每日数十亿交互场景时形成关键的测量盲区。其解决方案的关键在于提出首个基于人类中心基准测试的系统性框架,通过结合心理测量学技术(如成对比较生成Elo评分、自举置信区间和置换检验)与严格的统计验证,实现对测试平台两个核心能力的量化:一是生成真实对话的能力(仿真质量),二是准确评估代理响应的质量(评估质量)。该框架提供可复现的指标体系,并通过在三个主流商业语音AI测试平台上进行大规模人工判断(共21,600次评价)和真实对话验证(60个对话),实证表明不同平台间存在显著性能差异,从而为研究人员和组织提供了可信赖的测试能力评估工具,支撑语音AI规模化落地。

链接: https://arxiv.org/abs/2511.04133
作者: Miguel E. Andres,Vadim Fedorov,Rida Sadek,Enric Spagnolo-Arrizabalaga,Nadescha Trudel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Voice AI agents are rapidly transitioning to production deployments, yet systematic methods for ensuring testing reliability remain underdeveloped. Organizations cannot objectively assess whether their testing approaches (internal tools or external platforms) actually work, creating a critical measurement gap as voice AI scales to billions of daily interactions. We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking. Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (simulation quality) and accurately evaluating agent responses (evaluation quality). The framework combines established psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, and permutation tests) with rigorous statistical validation to provide reproducible metrics applicable to any testing approach. To validate the framework and demonstrate its utility, we conducted comprehensive empirical evaluation of three leading commercial platforms focused on Voice AI Testing using 21,600 human judgments across 45 simulations and ground truth validation on 60 conversations. Results reveal statistically significant performance differences with the proposed framework, with the top-performing platform, Evalion, achieving 0.92 evaluation quality measured as f1-score versus 0.73 for others, and 0.61 simulation quality using a league based scoring system (including ties) vs 0.43 for other platforms. This framework enables researchers and organizations to empirically validate the testing capabilities of any platform, providing essential measurement foundations for confident voice AI deployment at scale. Supporting materials are made available to facilitate reproducibility and adoption. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.04133 [cs.AI] (or arXiv:2511.04133v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.04133 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Miguel E. Andrés [view email] [v1] Thu, 6 Nov 2025 07:22:58 UTC (1,660 KB)
zh

[AI-40] Automated and Explainable Denial of Service Analysis for AI-Driven Intrusion Detection Systems

【速读】:该论文旨在解决分布式拒绝服务(DDoS)攻击检测中传统方法存在的可扩展性差和透明度不足的问题,这些问题限制了实时响应能力和对攻击路径的理解。解决方案的关键在于构建一个自动化框架,利用Tree-based Pipeline Optimization Tool (TPOT) 自动优化机器学习模型与特征选择流程,并结合SHapley Additive exPlanations (SHAP) 方法提升模型的可解释性,从而实现高准确率且具备解释能力的DDoS攻击检测。实验表明,诸如平均后向包长度和最小前向包头长度等关键特征在检测中起决定性作用。

链接: https://arxiv.org/abs/2511.04114
作者: Paul Badu Yakubu,Lesther Santana,Mohamed Rahouti,Yufeng Xin,Abdellah Chehri,Mohammed Aledhari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 2 figures, 11 tables, IET Information Security

点击查看摘要

Abstract:With the increasing frequency and sophistication of Distributed Denial of Service (DDoS) attacks, it has become critical to develop more efficient and interpretable detection methods. Traditional detection systems often struggle with scalability and transparency, hindering real-time response and understanding of attack vectors. This paper presents an automated framework for detecting and interpreting DDoS attacks using machine learning (ML). The proposed method leverages the Tree-based Pipeline Optimization Tool (TPOT) to automate the selection and optimization of ML models and features, reducing the need for manual experimentation. SHapley Additive exPlanations (SHAP) is incorporated to enhance model interpretability, providing detailed insights into the contribution of individual features to the detection process. By combining TPOT’s automated pipeline selection with SHAP interpretability, this approach improves the accuracy and transparency of DDoS detection. Experimental results demonstrate that key features such as mean backward packet length and minimum forward packet header length are critical in detecting DDoS attacks, offering a scalable and explainable cybersecurity solution.
zh

[AI-41] KGFR: A Foundation Retriever for Generalized Knowledge Graph Question Answering

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理知识密集型问题时因上下文长度限制和参数化知识不足而导致性能下降的问题,同时克服现有方法依赖特定数据集微调或图神经网络(Graph Neural Networks, GNN)检索器在大规模或未见知识图谱(Knowledge Graphs, KGs)上扩展性差的局限。其解决方案的关键在于提出LLM-KGFR协同框架,其中知识图谱基础检索器(Knowledge Graph Foundation Retriever, KGFR)通过LLM生成的关系描述进行编码,并基于问题中实体的角色初始化检索路径,从而实现对未见KG的零样本泛化;此外,引入非对称渐进传播(Asymmetric Progressive Propagation, APP)机制,在保持信息路径完整性的同时,通过有选择地限制高连接度节点来高效处理大规模图结构,最终通过节点、边和路径层面的接口构建可控的推理循环,使LLM能够迭代请求候选答案、支持事实及推理路径,显著提升可扩展性和通用性。

链接: https://arxiv.org/abs/2511.04093
作者: Yuanning Cui,Zequn Sun,Wei Hu,Zhangjie Fu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at reasoning but struggle with knowledge-intensive questions due to limited context and parametric knowledge. However, existing methods that rely on finetuned LLMs or GNN retrievers are limited by dataset-specific tuning and scalability on large or unseen graphs. We propose the LLM-KGFR collaborative framework, where an LLM works with a structured retriever, the Knowledge Graph Foundation Retriever (KGFR). KGFR encodes relations using LLM-generated descriptions and initializes entities based on their roles in the question, enabling zero-shot generalization to unseen KGs. To handle large graphs efficiently, it employs Asymmetric Progressive Propagation (APP)- a stepwise expansion that selectively limits high-degree nodes while retaining informative paths. Through node-, edge-, and path-level interfaces, the LLM iteratively requests candidate answers, supporting facts, and reasoning paths, forming a controllable reasoning loop. Experiments demonstrate that LLM-KGFR achieves strong performance while maintaining scalability and generalization, providing a practical solution for KG-augmented reasoning.
zh

[AI-42] An Automated Theorem Generator with Theoretical Foundation Based on Rectangular Standard Contradiction

【速读】:该论文旨在解决当前缺乏系统性理论框架以自动生成非平凡且逻辑有效的定理的问题。其解决方案的关键在于提出了一种基于“矩形标准矛盾”(rectangular standard contradiction)的新逻辑结构,并首次证明了该结构具备两个核心性质:一是其本身为标准矛盾(必然不可满足),二是具有非冗余性(移除任意子句后剩余子句集变为可满足)。基于此结构,论文进一步证明:将矩形标准矛盾划分为前提子集A与补集的否定¬H,即可构成一个有效定理A ⊢ ¬H,且所有此类定理逻辑等价。这一理论构建了自动化定理生成(ATG)的完整体系,并通过模板化算法实现了高效实现,使机器从“验证者”转变为“发现者”。

链接: https://arxiv.org/abs/2511.04092
作者: Yang Xu,Peiyao Liu,Shuwei Chen,Jun Liu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Logic (math.LO)
备注: 17 pages

点击查看摘要

Abstract:Currently, there is a lack of rigorous theoretical system for systematically generating non-trivial and logically valid theorems. Addressing this critical gap, this paper conducts research to propose a novel automated theorem generation theory and tool. Based on the concept of standard contradiction which possesses unique deductive advantages, this paper defines and proves, for the first time, a new logical structure known as rectangular standard contradiction. Centered on this structure, a complete Automated Theorem Generation (ATG) theory is put forward. Theoretical proofs clarify two core properties of rectangular standard contradiction: first, it is a standard contradiction (necessarily unsatisfiable); second, it exhibits non-redundancy (the remaining clause set becomes satisfiable after removing any clause). Leveraging these properties, this paper proves that partitioning a rectangular standard contradiction into a premise subset A and negation of its complement H , a valid theorem A \vdash \neg H can be formed, and all such theorems are logically equivalent. To implement this theory, an efficient template-based ATG algorithm is designed, and a Rectangular Automated Theorem Generator is developed. This research enables machines to transition from “verifiers” to “discoverers”, opening up new avenues for fundamental research in the fields of logic and artificial intelligence.
zh

[AI-43] Advancing Equitable AI: Evaluating Cultural Expressiveness in LLM s for Latin American Contexts

【速读】:该论文旨在解决当前人工智能(AI)系统因数据来源不均衡而对拉丁美洲等经济欠发达地区产生偏见的问题,尤其体现在语言 dominance(英语主导)和文化视角的西方中心主义倾向上。其解决方案的关键在于构建一个以拉丁美洲历史、社会政治背景和多元语言(包括西班牙语、葡萄牙语及克丘亚语、纳瓦特尔语等土著语言)为基础的文化敏感型数据集,并通过引入“文化表达力”(Cultural Expressiveness)这一新指标对多个语言模型进行评估与微调。实验表明,基于该数据集对Mistral-7B模型进行微调后,其文化表达力提升42.9%,显著改善了对拉丁美洲语境的理解能力,从而推动更具公平性的AI发展。

链接: https://arxiv.org/abs/2511.04090
作者: Brigitte A. Mora-Reyes,Jennifer A. Drewyor,Abel A. Reyes-Angulo
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems often reflect biases from economically advanced regions, marginalizing contexts in economically developing regions like Latin America due to imbalanced datasets. This paper examines AI representations of diverse Latin American contexts, revealing disparities between data from economically advanced and developing regions. We highlight how the dominance of English over Spanish, Portuguese, and indigenous languages such as Quechua and Nahuatl perpetuates biases, framing Latin American perspectives through a Western lens. To address this, we introduce a culturally aware dataset rooted in Latin American history and socio-political contexts, challenging Eurocentric models. We evaluate six language models on questions testing cultural context awareness, using a novel Cultural Expressiveness metric, statistical tests, and linguistic analyses. Our findings show that some models better capture Latin American perspectives, while others exhibit significant sentiment misalignment (p 0.001). Fine-tuning Mistral-7B with our dataset improves its cultural expressiveness by 42.9%, advancing equitable AI development. We advocate for equitable AI by prioritizing datasets that reflect Latin American history, indigenous knowledge, and diverse languages, while emphasizing community-centered approaches to amplify marginalized voices.
zh

[AI-44] DeNoise: Learning Robust Graph Representations for Unsupervised Graph-Level Anomaly Detection

【速读】:该论文针对无监督图级异常检测(Unsupervised Graph-level Anomaly Detection, UGAD)中训练数据可能被异常图污染的问题展开研究。现有基于图神经网络(Graph Neural Network, GNN)的方法通常隐式假设训练集仅包含正常图,但在实际场景中这一假设难以成立,少量异常图的干扰会导致模型学习到失真的表示并显著降低检测性能。为解决此问题,论文提出DeNoise框架,其核心创新在于:首先通过对抗性目标联合优化图编码器、属性解码器与结构解码器,以学习对噪声具有鲁棒性的嵌入;其次引入编码器锚定对齐去噪机制(encoder anchor-alignment denoising mechanism),将来自正常图的高信息量节点嵌入融合至所有图嵌入中,从而提升表示质量并抑制异常干扰;最后结合对比学习组件,在潜在空间中压缩正常图嵌入并排斥异常图嵌入,实现更清晰的分离。该方案在八个多领域真实数据集上验证了其在不同噪声强度下均能稳定学习可靠图级表示,并显著优于当前最优U-GAD基线方法。

链接: https://arxiv.org/abs/2511.04086
作者: Qingfeng Chen,Haojin Zeng,Jingyi Jie,Shichao Zhang,Debo Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid growth of graph-structured data in critical domains, unsupervised graph-level anomaly detection (UGAD) has become a pivotal task. UGAD seeks to identify entire graphs that deviate from normal behavioral patterns. However, most Graph Neural Network (GNN) approaches implicitly assume that the training set is clean, containing only normal graphs, which is rarely true in practice. Even modest contamination by anomalous graphs can distort learned representations and sharply degrade performance. To address this challenge, we propose DeNoise, a robust UGAD framework explicitly designed for contaminated training data. It jointly optimizes a graph-level encoder, an attribute decoder, and a structure decoder via an adversarial objective to learn noise-resistant embeddings. Further, DeNoise introduces an encoder anchor-alignment denoising mechanism that fuses high-information node embeddings from normal graphs into all graph embeddings, improving representation quality while suppressing anomaly interference. A contrastive learning component then compacts normal graph embeddings and repels anomalous ones in the latent space. Extensive experiments on eight real-world datasets demonstrate that DeNoise consistently learns reliable graph-level representations under varying noise intensities and significantly outperforms state-of-the-art UGAD baselines.
zh

[AI-45] Agent mandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents AAAI

【速读】:该论文旨在解决当前红istricting(选区划分)过程中存在的战略操纵问题,即现有计算方法虽能生成大量合法的选区方案,但忽视了选区方案选择阶段的博弈性,导致政党可能通过“挑选”技术合规却政治有利的地图来实现不公平优势。解决方案的关键在于提出Agentmandering框架,将选区划分重构为两个代表对立政治利益的智能体之间的回合制协商过程,借鉴博弈论中的“Choose-and-Freeze协议”,利用大语言模型(LLM)代理在受限且可解释的选择中交替冻结选区,从而嵌入战略互动机制,显著降低党派偏倚和不公,同时提升结果稳定性(在摇摆州场景下尤为明显)。

链接: https://arxiv.org/abs/2511.04076
作者: Hao Li,Haotian Chen,Ruoyuan Gong,Juanjuan Wang,Hao Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI AISI 2026

点击查看摘要

Abstract:Redistricting plays a central role in shaping how votes are translated into political power. While existing computational methods primarily aim to generate large ensembles of legally valid districting plans, they often neglect the strategic dynamics involved in the selection process. This oversight creates opportunities for partisan actors to cherry-pick maps that, while technically compliant, are politically advantageous. Simply satisfying formal constraints does not ensure fairness when the selection process itself can be manipulated. We propose \textbfAgentmandering, a framework that reimagines redistricting as a turn-based negotiation between two agents representing opposing political interests. Drawing inspiration from game-theoretic ideas, particularly the \textitChoose-and-Freeze protocol, our method embeds strategic interaction into the redistricting process via large language model (LLM) agents. Agents alternate between selecting and freezing districts from a small set of candidate maps, gradually partitioning the state through constrained and interpretable choices. Evaluation on post-2020 U.S. Census data across all states shows that Agentmandering significantly reduces partisan bias and unfairness, while achieving 2 to 3 orders of magnitude lower variance than standard baselines. These results demonstrate both fairness and stability, especially in swing-state scenarios. Our code is available at this https URL.
zh

[AI-46] Left Atrial Segmentation with nnU-Net Using MRI

【速读】:该论文旨在解决心脏磁共振成像(Cardiac MRI)中左心房(Left Atrial, LA)自动分割的准确性与效率问题,以支持心房颤动(Atrial Fibrillation, AF)消融治疗和生物物理心脏模型构建。传统手动勾画耗时且存在观察者间差异,难以满足大规模或时效性强的临床需求。解决方案的关键在于采用nnU-Net框架——一种自动化、自配置的深度学习分割架构,能够根据MRI数据特性自动优化预处理、网络结构和训练流程,无需人工调参,从而在Left Atrial Segmentation Challenge 2013数据集上实现平均Dice相似系数(Dice Similarity Coefficient, DSC)达93.5,显著优于以往方法,并展现出对左心房形态、对比度和图像质量变化的良好泛化能力。

链接: https://arxiv.org/abs/2511.04071
作者: Fatemeh Hosseinabadi,Seyedhassan Sharifi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate segmentation of the left atrium (LA) from cardiac MRI is critical for guiding atrial fibrillation (AF) ablation and constructing biophysical cardiac models. Manual delineation is time-consuming, observer-dependent, and impractical for large-scale or time-sensitive clinical workflows. Deep learning methods, particularly convolutional architectures, have recently demonstrated superior performance in medical image segmentation tasks. In this study, we applied the nnU-Net framework, an automated, self-configuring deep learning segmentation architecture, to the Left Atrial Segmentation Challenge 2013 dataset. The dataset consists of thirty MRI scans with corresponding expert-annotated masks. The nnU-Net model automatically adapted its preprocessing, network configuration, and training pipeline to the characteristics of the MRI data. Model performance was quantitatively evaluated using the Dice similarity coefficient (DSC), and qualitative results were compared against expert segmentations. The proposed nnU?Net model achieved a mean Dice score of 93.5, demonstrating high overlap with expert annotations and outperforming several traditional segmentation approaches reported in previous studies. The network exhibited robust generalization across variations in left atrial shape, contrast, and image quality, accurately delineating both the atrial body and proximal pulmonary veins.
zh

[AI-47] Pediatric Appendicitis Detection from Ultrasound Images

【速读】:该论文旨在解决儿童急性阑尾炎(pediatric appendicitis)的诊断难题,尤其是在临床症状重叠和超声图像质量参差不齐的情况下。其解决方案的关键在于基于预训练的ResNet架构开发了一种深度学习模型,通过自动提取超声图像中的判别性空间特征,实现对阑尾炎的精准分类。该模型在Regensburg儿科阑尾炎数据集上表现出优异性能,整体准确率达到93.44%,表明其能有效应对儿科超声成像中低对比度、斑点噪声及解剖变异等挑战。

链接: https://arxiv.org/abs/2511.04069
作者: Fatemeh Hosseinabadi,Seyedhassan Sharifi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pediatric appendicitis remains one of the most common causes of acute abdominal pain in children, and its diagnosis continues to challenge clinicians due to overlapping symptoms and variable imaging quality. This study aims to develop and evaluate a deep learning model based on a pretrained ResNet architecture for automated detection of appendicitis from ultrasound images. We used the Regensburg Pediatric Appendicitis Dataset, which includes ultrasound scans, laboratory data, and clinical scores from pediatric patients admitted with abdominal pain to Children Hospital. Hedwig in Regensburg, Germany. Each subject had 1 to 15 ultrasound views covering the right lower quadrant, appendix, lymph nodes, and related structures. For the image based classification task, ResNet was fine tuned to distinguish appendicitis from non-appendicitis cases. Images were preprocessed by normalization, resizing, and augmentation to enhance generalization. The proposed ResNet model achieved an overall accuracy of 93.44, precision of 91.53, and recall of 89.8, demonstrating strong performance in identifying appendicitis across heterogeneous ultrasound views. The model effectively learned discriminative spatial features, overcoming challenges posed by low contrast, speckle noise, and anatomical variability in pediatric imaging.
zh

[AI-48] Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models AACL2025

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理多数值属性时的内部表征机制不明确问题,尤其是数值属性如何被整合以及无关数值上下文如何干扰其表示与下游决策。解决方案的关键在于结合线性探测(linear probing)、偏相关分析(partial correlation analysis)和基于提示的脆弱性测试(prompt-based vulnerability tests),系统地揭示LLMs中数值属性共享潜在子空间的特性,并量化无关数值上下文对 magnitude representation 的一致性扰动及其随模型规模变化的下游影响。

链接: https://arxiv.org/abs/2511.04053
作者: Hirohane Takagi,Gouki Minegishi,Shota Kizawa,Issey Sukeda,Hitomi Yanaka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IJCNLP-AACL 2025 (Main). Code available at this https URL

点击查看摘要

Abstract:Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions:(1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2)How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.
zh

[AI-49] An LLM -based Framework for Human-Swarm Teaming Cognition in Disaster Search and Rescue

【速读】:该论文旨在解决大规模灾害搜索与救援(Search And Rescue, SAR)任务中,由于复杂地形和通信中断导致的无人飞行器(Unmanned Aerial Vehicle, UAV)蜂群协调困难问题,其核心瓶颈在于“意图到动作的鸿沟”——即在高压环境下,人类操作者将高层次救援目标转化为低层次蜂群指令时存在高错误率和认知负担。解决方案的关键是提出一种基于大语言模型(Large Language Model, LLM)与条件随机场(Conditional Random Field, CRF)融合的LLM-CRF系统,通过自然语言和多模态交互捕获操作者意图,并利用LLM作为认知引擎实现意图理解、任务分层分解与蜂群任务规划,形成闭环协作机制,使蜂群成为主动响应的伙伴,显著降低人工监控需求并提升SAR任务效率。

链接: https://arxiv.org/abs/2511.04042
作者: Kailun Ji(1),Xiaoyu Hu(1),Xinyu Zhang(1 and 2),Jun Chen(1 and 2) ((1) School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China, (2) Chongqing Institute for Brain and Intelligence, Guangyang Bay Laboratory, Chongqing, China)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale disaster Search And Rescue (SAR) operations are persistently challenged by complex terrain and disrupted communications. While Unmanned Aerial Vehicle (UAV) swarms offer a promising solution for tasks like wide-area search and supply delivery, yet their effective coordination places a significant cognitive burden on human operators. The core human-machine collaboration bottleneck lies in the ``intention-to-action gap’', which is an error-prone process of translating a high-level rescue objective into a low-level swarm command under high intensity and pressure. To bridge this gap, this study proposes a novel LLM-CRF system that leverages Large Language Models (LLMs) to model and augment human-swarm teaming cognition. The proposed framework initially captures the operator’s intention through natural and multi-modal interactions with the device via voice or graphical annotations. It then employs the LLM as a cognitive engine to perform intention comprehension, hierarchical task decomposition, and mission planning for the UAV swarm. This closed-loop framework enables the swarm to act as a proactive partner, providing active feedback in real-time while reducing the need for manual monitoring and control, which considerably advances the efficacy of the SAR task. We evaluate the proposed framework in a simulated SAR scenario. Experimental results demonstrate that, compared to traditional order and command-based interfaces, the proposed LLM-driven approach reduced task completion time by approximately 64.2% and improved task success rate by 7% . It also leads to a considerable reduction in subjective cognitive workload, with NASA-TLX scores dropping by 42.9% . This work establishes the potential of LLMs to create more intuitive and effective human-swarm collaborations in high-stakes scenarios.
zh

[AI-50] Detecting Silent Failures in Multi-Agent ic AI Trajectories

【速读】:该论文旨在解决多智能体人工智能(Multi-Agentic AI)系统中由于大语言模型(LLMs)固有的非确定性所引发的隐式故障(如输出漂移、循环和细节缺失)难以检测的问题。其解决方案的关键在于提出了一种异常检测任务,并构建了一个数据集构建流程,该流程能够捕捉用户行为、代理非确定性和LLM变化等关键因素;基于此流程,研究者构建了两个基准数据集(分别包含4,275和894条轨迹),并通过实证表明监督学习(XGBoost)与半监督学习(SVDD)方法在异常检测上均表现出优异性能,准确率分别达到98%和96%,从而为该领域提供了首个系统性的研究框架与可复现的基准资源。

链接: https://arxiv.org/abs/2511.04032
作者: Divya Pathak,Harshit Kumar,Anuska Roy,Felix George,Mudit Verma,Pratibha Moogi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Agentic AI systems, powered by large language models (LLMs), are inherently non-deterministic and prone to silent failures such as drift, cycles, and missing details in outputs, which are difficult to detect. We introduce the task of anomaly detection in agentic trajectories to identify these failures and present a dataset curation pipeline that captures user behavior, agent non-determinism, and LLM variation. Using this pipeline, we curate and label two benchmark datasets comprising \textbf4,275 and 894 trajectories from Multi-Agentic AI systems. Benchmarking anomaly detection methods on these datasets, we show that supervised (XGBoost) and semi-supervised (SVDD) approaches perform comparably, achieving accuracies up to 98% and 96%, respectively. This work provides the first systematic study of anomaly detection in Multi-Agentic AI systems, offering datasets, benchmarks, and insights to guide future research.
zh

[AI-51] Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限的物联网(Internet-of-Things, IoT)设备上部署困难的问题,核心挑战在于模型参数量庞大、自回归推理(autoregressive inference)过程中迭代生成token以及不断增长的键值缓存(key-value cache, KV cache)带来的内存压力。现有分割计算(split computing)方法无法有效应对这些特性。解决方案的关键在于提出首个面向自回归推理的分割计算框架,其核心创新包括:1)提出单点分割压缩(One-Point Split Compression, OPSC),通过混合精度量化策略将模型分为前后端并分配不同精度以避免内存溢出;2)设计两阶段中间压缩流水线,结合阈值分割(Threshold Splitting, TS)与逐token自适应比特量化(Token-wise Adaptive Bit Quantization, TAB-Q),在保留关键激活信息的同时显著降低通信开销;3)构建统一优化框架,联合选择最优分割点、量化配置和序列长度,在满足严格内存与延迟约束下实现性能最优。

链接: https://arxiv.org/abs/2511.04002
作者: Mingyu Sung,Vikas Palakonda,Suhwan Im,Sunghwan Moon,Il-Min Kim,Sangseok Yun,Jae-Mo Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.
zh

[AI-52] Accelerating scientific discovery with the common task framework

【速读】:该论文旨在解决科学与工程领域中机器学习(Machine Learning, ML)和人工智能(Artificial Intelligence, AI)算法在动态系统表征与控制中的评估难题,尤其是在有限数据和噪声测量条件下,如何对不同算法在预测、状态重构、泛化能力和控制等多样化科学目标上的性能进行客观比较。其解决方案的关键在于提出一个通用任务框架(Common Task Framework, CTF),该框架包含一系列具有实际应用背景的挑战性数据集,并为多种科学目标提供标准化的评价指标,从而成为推动ML/AI算法在传统领域(如语音识别、自然语言处理和计算机视觉)快速发展的关键使能技术,同时也为当前跨学科算法的比较与部署提供了客观基准。

链接: https://arxiv.org/abs/2511.04001
作者: J. Nathan Kutz,Peter Battaglia,Michael Brenner,Kevin Carlberg,Aric Hagberg,Shirley Ho,Stephan Hoyer,Henning Lange,Hod Lipson,Michael W. Mahoney,Frank Noe,Max Welling,Laure Zanna,Francis Zhu,Steven L. Brunton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Machine learning (ML) and artificial intelligence (AI) algorithms are transforming and empowering the characterization and control of dynamic systems in the engineering, physical, and biological sciences. These emerging modeling paradigms require comparative metrics to evaluate a diverse set of scientific objectives, including forecasting, state reconstruction, generalization, and control, while also considering limited data scenarios and noisy measurements. We introduce a common task framework (CTF) for science and engineering, which features a growing collection of challenge data sets with a diverse set of practical and common objectives. The CTF is a critically enabling technology that has contributed to the rapid advance of ML/AI algorithms in traditional applications such as speech recognition, language processing, and computer vision. There is a critical need for the objective metrics of a CTF to compare the diverse algorithms being rapidly developed and deployed in practice today across science and engineering.
zh

[AI-53] Hybrid Fuzzing with LLM -Guided Input Mutation and Semantic Feedback

【速读】:该论文旨在解决传统软件模糊测试(fuzzing)中因变异策略缺乏语义感知而导致的冗余测试用例多、深层程序状态探索缓慢的问题。其解决方案的关键在于提出一种融合静态分析与动态分析的混合模糊测试框架,通过大型语言模型(Large Language Model, LLM)引导输入变异并引入语义反馈机制:静态分析提取控制流和数据流信息生成结构化提示(prompt),驱动LLM生成语法正确且语义多样化的输入;执行过程中则以程序状态变化、异常类型和输出语义等信号增强传统覆盖率反馈,使模糊器能够优先选择触发新颖行为的输入,从而提升漏洞发现效率与深度。

链接: https://arxiv.org/abs/2511.03995
作者: Shiyin Lin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software fuzzing has become a cornerstone in automated vulnerability discovery, yet existing mutation strategies often lack semantic awareness, leading to redundant test cases and slow exploration of deep program states. In this work, I present a hybrid fuzzing framework that integrates static and dynamic analysis with Large Language Model (LLM)-guided input mutation and semantic feedback. Static analysis extracts control-flow and data-flow information, which is transformed into structured prompts for the LLM to generate syntactically valid and semantically diverse inputs. During execution, I augment traditional coverage-based feedback with semantic feedback signals-derived from program state changes, exception types, and output semantics-allowing the fuzzer to prioritize inputs that trigger novel program behaviors beyond mere code coverage. I implement our approach atop AFL++, combining program instrumentation with embedding-based semantic similarity metrics to guide seed selection. Evaluation on real-world open-source targets, including libpng, tcpdump, and sqlite, demonstrates that our method achieves faster time-to-first-bug, higher semantic diversity, and a competitive number of unique bugs compared to state-of-the-art fuzzers. This work highlights the potential of combining LLM reasoning with semantic-aware feedback to accelerate and deepen vulnerability discovery.
zh

[AI-54] Multiscale Astrocyte Network Calcium Dynamics for Biologically Plausible Intelligence in Anomaly Detection

【速读】:该论文旨在解决传统离线训练的网络异常检测系统在面对概念漂移(concept drift)及新型威胁(如零日攻击或变种攻击)时性能下降的问题。其解决方案的关键在于提出一种受星形胶质细胞(astrocyte)钙离子(Ca²⁺)信号传导机制启发的Ca²⁺调制学习框架,通过将多细胞星形胶质细胞动力学模拟器与深度神经网络(DNN)耦合,利用IP₃介导的Ca²⁺释放、SERCA泵摄取以及间隙连接介导的电导感知扩散等三种核心机制,实现对动态数据模式的快速、上下文敏感适应,从而提升检测准确率并降低误报率。

链接: https://arxiv.org/abs/2511.03993
作者: Berk Iskar,Michael Taynnan Barros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Network anomaly detection systems encounter several challenges with traditional detectors trained offline. They become susceptible to concept drift and new threats such as zero-day or polymorphic attacks. To address this limitation, we propose a Ca ^2+ -modulated learning framework that draws inspiration from astrocytic Ca ^2+ signaling in the brain, where rapid, context-sensitive adaptation enables robust information processing. Our approach couples a multicellular astrocyte dynamics simulator with a deep neural network (DNN). The simulator models astrocytic Ca ^2+ dynamics through three key mechanisms: IP _3 -mediated Ca ^2+ release, SERCA pump uptake, and conductance-aware diffusion through gap junctions between cells. Evaluation of our proposed network on CTU-13 (Neris) network traffic data demonstrates the effectiveness of our biologically plausible approach. The Ca ^2+ -gated model outperforms a matched baseline DNN, achieving up to \sim 98% accuracy with reduced false positives and negatives across multiple train/test splits. Importantly, this improved performance comes with negligible runtime overhead once Ca ^2+ trajectories are precomputed. While demonstrated here for cybersecurity applications, this Ca ^2+ -modulated learning framework offers a generic solution for streaming detection tasks that require rapid, biologically grounded adaptation to evolving data patterns.
zh

[AI-55] ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的自动化机器学习(Automated ML, AutoML)代理在评估候选模型架构时过度依赖重复的完整训练过程所带来的计算开销大、搜索空间扩展性差及迭代周期长的问题。其解决方案的关键在于提出一个名为ArchPilot的多智能体系统,该系统通过三个专业化智能体——编排智能体、生成智能体和评估智能体——协同工作:编排智能体采用受蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)启发并带有重启机制的算法进行自适应搜索;生成智能体负责迭代生成、优化与调试候选架构;评估智能体则通过代理训练(proxy training)运行、优化代理函数,并构建具有保真度感知的性能指标。这种多智能体协作机制显著减少了对昂贵全量训练的依赖,从而在有限预算下实现高效且高质量的机器学习工程。

链接: https://arxiv.org/abs/2511.03985
作者: Zhuowen Yuan,Tao Liu,Yang Yang,Yang Wang,Feng Qi,Kaushik Rangadurai,Bo Li,Shuang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity-aware performance metric. This multi-agent collaboration allows ArchPilot to prioritize high-potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, validating the effectiveness of our multi-agent system.
zh

[AI-56] PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction

【速读】:该论文旨在解决SARS-CoV-2病毒在演化过程中不断出现免疫逃逸变异株所带来的公共卫生与疫苗研发挑战,特别是如何从高噪声的病毒基因组序列中准确预测未来突变。其核心问题在于传统基于原始RNA序列的模型难以有效处理测序噪声并捕捉病毒进化的层级结构。解决方案的关键是提出PETRA(Pretrained Evolutionary TRAnsformer),该方法利用系统发育树推导出的进化轨迹作为输入而非原始核酸序列,从而有效降低测序噪声干扰,并建模病毒进化的层次化特征;同时引入加权训练框架以缓解全球序列数据在地理和时间上的分布不均问题,显著提升了对未来突变的预测性能,在核苷酸和刺突蛋白氨基酸突变预测上分别达到9.45%和17.10%的加权召回率(weighted recall@1),远优于现有最佳基线模型(分别为0.49%和6.64%)。

链接: https://arxiv.org/abs/2511.03976
作者: Xu Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: preprint

点击查看摘要

Abstract:Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory, characterized by the continual emergence of immune-evasive variants. This poses persistent challenges to public health and vaccine development. While large-scale generative pre-trained transformers (GPTs) have revolutionized the modeling of sequential data, their direct applications to noisy viral genomic sequences are limited. In this paper, we introduce PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees rather than raw RNA sequences. This method effectively mitigates sequencing noise and captures the hierarchical structure of viral evolution. With a weighted training framework to address substantial geographical and temporal imbalances in global sequence data, PETRA excels in predicting future SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide mutations and 17.10% for spike amino-acid mutations, compared to 0.49% and 6.64% respectively for the best baseline. PETRA also demonstrates its ability to aid in the real-time mutation prediction of major clades like 24F(XEC) and 25A(LP.8.1). The code is open sourced on this https URL Comments: preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN) Cite as: arXiv:2511.03976 [cs.LG] (or arXiv:2511.03976v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.03976 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-57] Extracting Causal Relations in Deep Knowledge Tracing

【速读】:该论文试图解决的问题是:当前对深度知识追踪(Deep Knowledge Tracing, DKT)模型性能提升机制的理解存在偏差,即普遍认为其优势源于建模课程中不同知识组件(Knowledge Components, KCs)之间的双向关系,而忽略了潜在的因果结构。解决方案的关键在于通过将练习关系图剪枝为有向无环图(Directed Acyclic Graph, DAG),并基于因果子集训练DKT,发现其预测能力与因果结构高度一致;同时提出一种利用DKT学习到的表示来提取练习关系DAG的替代方法,实证验证了DKT的有效性主要来源于其对知识组件间因果依赖关系的近似能力,而非简单的关联映射。

链接: https://arxiv.org/abs/2511.03948
作者: Kevin Hong,Kia Karbasi,Gregory Pottie
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for publication in the Proceedings of the 18th International Conference on Educational Data Mining, 6 pages, 1 figure

点击查看摘要

Abstract:A longstanding goal in computational educational research is to develop explainable knowledge tracing (KT) models. Deep Knowledge Tracing (DKT), which leverages a Recurrent Neural Network (RNN) to predict student knowledge and performance on exercises, has been proposed as a major advancement over traditional KT methods. Several studies suggest that its performance gains stem from its ability to model bidirectional relationships between different knowledge components (KCs) within a course, enabling the inference of a student’s understanding of one KC from their performance on others. In this paper, we challenge this prevailing explanation and demonstrate that DKT’s strength lies in its implicit ability to model prerequisite relationships as a causal structure, rather than bidirectional relationships. By pruning exercise relation graphs into Directed Acyclic Graphs (DAGs) and training DKT on causal subsets of the Assistments dataset, we show that DKT’s predictive capabilities align strongly with these causal structures. Furthermore, we propose an alternative method for extracting exercise relation DAGs using DKT’s learned representations and provide empirical evidence supporting our claim. Our findings suggest that DKT’s effectiveness is largely driven by its capacity to approximate causal dependencies between KCs rather than simple relational mappings.
zh

[AI-58] PEFA-AI: Advancing Open-source LLM s for RTL generation using Progressive Error Feedback Agent ic-AI

【速读】:该论文旨在解决无监督自动生成 Register Transfer Level (RTL) 代码的难题,即如何在无需人工干预的情况下,从自然语言描述中高效、准确地生成可综合、功能正确且编译通过的 RTL 代码。其解决方案的关键在于提出了一种名为渐进式错误反馈机制(Progressive Error Feedback System of Agents, PEFA) 的多智能体协作流程,该机制通过迭代式错误反馈不断优化生成过程,逐步提升任务复杂度,并结合专用大语言模型(LLM)与硬件仿真工具协同工作,从而实现高精度、高效率的 RTL 自动生成。

链接: https://arxiv.org/abs/2511.03934
作者: Athma Narayanan,Mahesh Subedar,Omesh Tickoo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Appeared in the Design Automation Conference (DAC) 2025, Workshop Poster on June 22, 2025

点击查看摘要

Abstract:We present an agentic flow consisting of multiple agents that combine specialized LLMs and hardware simulation tools to collaboratively complete the complex task of Register Transfer Level (RTL) generation without human intervention. A key feature of the proposed flow is the progressive error feedback system of agents (PEFA), a self-correcting mechanism that leverages iterative error feedback to progressively increase the complexity of the approach. The generated RTL includes checks for compilation, functional correctness, and synthesizable constructs. To validate this adaptive approach to code generation, benchmarking is performed using two opensource natural language-to-RTL datasets. We demonstrate the benefits of the proposed approach implemented on an open source agentic framework, using both open- and closed-source LLMs, effectively bridging the performance gap between them. Compared to previously published methods, our approach sets a new benchmark, providing state-of-the-art pass rates while being efficient in token counts.
zh

[AI-59] Collaborative Agents for Automated Program Repair in Ruby

【速读】:该论文旨在解决当前自动化程序修复(Automated Program Repair, APR)方法在Ruby语言上研究不足且计算成本高的问题,尤其是在面对Ruby开发者常遇到的错误类型(如运行时错误、编译错误和错误答案)时缺乏高效、轻量级的修复方案。其关键解决方案是提出RAMP框架,该框架将程序修复建模为一个反馈驱动的迭代过程,通过一组协作智能体(agents)协同生成针对性测试用例、反思错误并迭代优化候选修复方案,直至找到正确解;RAMP不依赖大规模多语言修复数据库或昂贵的微调,而是基于轻量级提示(prompting)与测试驱动反馈直接作用于Ruby代码,从而实现高效率与高精度的修复效果,在XCodeEval基准测试中达到67%的pass@1指标,并在五次迭代内快速收敛。

链接: https://arxiv.org/abs/2511.03925
作者: Nikta Akbarpour,Mahdieh Sadat Benis,Fatemeh Hendijani Fard,Ali Ouni,Mohamed Aymen Saied
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated Program Repair (APR) has advanced rapidly with Large Language Models (LLMs), but most existing methods remain computationally expensive, and focused on a small set of languages. Ruby, despite its widespread use in web development and the persistent challenges faced by its developers, has received little attention in APR research. In this paper, we introduce RAMP, a novel lightweight framework that formulates program repair as a feedback-driven, iterative process for Ruby. RAMP employs a team of collaborative agents that generate targeted tests, reflect on errors, and refine candidate fixes until a correct solution is found. Unlike prior approaches, RAMP is designed to avoid reliance on large multilingual repair databases or costly fine-tuning, instead operating directly on Ruby through lightweight prompting and test-driven feedback. Evaluation on the XCodeEval benchmark shows that RAMP achieves a pass@1 of 67% on Ruby, outper-forming prior approaches. RAMP converges quickly within five iterations, and ablation studies confirm that test generation and self-reflection are key drivers of its performance. Further analysis shows that RAMP is particularly effective at repairing wrong answers, compilation errors, and runtime errors. Our approach provides new insights into multi-agent repair strategies, and establishes a foundation for extending LLM-based debugging tools to under-studied languages.
zh

[AI-60] Evolutionary Optimization Trumps Adam Optimization on Embedding Space Exploration

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成过程中难以进行高效可控优化的问题,尤其是在不进行昂贵再训练的前提下实现对特定目标(如视觉美感与提示词一致性)的精准调控。其解决方案的关键在于引入一种无梯度优化方法——可分离协方差矩阵自适应进化策略(sep-CMA-ES),并将其应用于Stable Diffusion XL Turbo的提示嵌入向量(prompt embedding vector)空间中进行探索。通过结合LAION美学预测器V2与CLIPScore构建加权适应度函数,实现了对图像美学和提示契合度之间的灵活权衡,实验表明sep-CMA-ES在多个指标上显著优于传统的Adam优化器,验证了进化算法在嵌入空间优化中的有效性与高效性。

链接: https://arxiv.org/abs/2511.03913
作者: Domício Pereira Neto,João Correia,Penousal Machado
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 3 tables, 6 appendix figures, 1 appendix table

点击查看摘要

Abstract:Deep generative models, especially diffusion architectures, have transformed image generation; however, they are challenging to control and optimize for specific goals without expensive retraining. Embedding Space Exploration, especially with Evolutionary Algorithms (EAs), has been shown to be a promising method for optimizing image generation, particularly within Diffusion Models. Therefore, in this work, we study the performance of an evolutionary optimization method, namely Separable Covariance Matrix Adaptation Evolution Strategy (sep-CMA-ES), against the widely adopted Adaptive Moment Estimation (Adam), applied to Stable Diffusion XL Turbo’s prompt embedding vector. The evaluation of images combines the LAION Aesthetic Predictor V2 with CLIPScore into a weighted fitness function, allowing flexible trade-offs between visual appeal and adherence to prompts. Experiments on a subset of the Parti Prompts (P2) dataset showcase that sep-CMA-ES consistently yields superior improvements in aesthetic and alignment metrics in comparison to Adam. Results indicate that the evolutionary method provides efficient, gradient-free optimization for diffusion models, enhancing controllability without the need for fine-tuning. This study emphasizes the potential of evolutionary methods for embedding space exploration of deep generative models and outlines future research directions.
zh

[AI-61] SnappyMeal: Design and Longitudinal Evaluation of a Multimodal AI Food Logging Application

【速读】:该论文旨在解决当前食物记录方法(如手写日记和基于应用的记录)灵活性差、用户依从性低且营养信息汇总不准确的问题。解决方案的关键在于提出SnappyMeal系统,该系统是一种基于人工智能的膳食追踪工具,通过多模态输入(如图像、文本和购物收据)实现更灵活的食物记录,并引入目标依赖的后续问题以智能获取缺失上下文信息,同时结合用户购物收据与营养数据库的信息检索机制提升准确性。

链接: https://arxiv.org/abs/2511.03907
作者: Liam Bakar,Zachary Englhardt,Vidya Srinivas,Girish Narayanswamy,Dilini Nissanka,Shwetak Patel,Vikram Iyer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 24 pages, 15 figures

点击查看摘要

Abstract:Food logging, both self-directed and prescribed, plays a critical role in uncovering correlations between diet, medical, fitness, and health outcomes. Through conversations with nutritional experts and individuals who practice dietary tracking, we find current logging methods, such as handwritten and app-based journaling, are inflexible and result in low adherence and potentially inaccurate nutritional summaries. These findings, corroborated by prior literature, emphasize the urgent need for improved food logging methods. In response, we propose SnappyMeal, an AI-powered dietary tracking system that leverages multimodal inputs to enable users to more flexibly log their food intake. SnappyMeal introduces goal-dependent follow-up questions to intelligently seek missing context from the user and information retrieval from user grocery receipts and nutritional databases to improve accuracy. We evaluate SnappyMeal through publicly available nutrition benchmarks and a multi-user, 3-week, in-the-wild deployment capturing over 500 logged food instances. Users strongly praised the multiple available input methods and reported a strong perceived accuracy. These insights suggest that multimodal AI systems can be leveraged to significantly improve dietary tracking flexibility and context-awareness, laying the groundwork for a new class of intelligent self-tracking applications.
zh

[AI-62] Secure Code Generation at Scale with Reflexion

【速读】:该论文旨在解决生成式 AI(Generative AI)在代码生成过程中存在的安全性问题,即模型生成的代码虽能正常运行,但可能包含安全漏洞。其核心挑战在于如何提升代码生成的安全性,尤其是在不依赖特定训练数据或提示词污染的情况下评估和改进模型输出。解决方案的关键在于采用 Instruct Prime 方法消除合规性提示和提示词污染,并引入三轮反思式提示(reflexion prompting)策略,在零样本基准(t₀)基础上逐步优化代码安全性;实验结果表明,该方法显著提升了安全修复率(Repair),减少了回归错误(Regression),并实现了净收益(NetGain),尤其在第一轮到第二轮之间效果最为明显,平均准确率从 70.74% 提升至 79.43%。

链接: https://arxiv.org/abs/2511.03898
作者: Arup Datta,Ahmed Aljohani,Hyunsook Do
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Software Engineering (cs.SE)
备注: Accepted for publication at the 2nd IEEE International Conference on AI-powered Software (AIware 2025)

点击查看摘要

Abstract:Large language models (LLMs) are now widely used to draft and refactor code, but code that works is not necessarily secure. We evaluate secure code generation using the Instruct Prime, which eliminated compliance-required prompts and cue contamination, and evaluate five instruction-tuned code LLMs using a zero-shot baseline and a three-round reflexion prompting approach. Security is measured using the Insecure Code Detector (ICD), and results are reported by measuring Repair, Regression, and NetGain metrics, considering the programming language and CWE family. Our findings show that insecurity remains common at the first round: roughly 25-33% of programs are insecure at a zero-shot baseline (t0 ). Weak cryptography/config-dependent bugs are the hardest to avoid while templated ones like XSS, code injection, and hard-coded secrets are handled more reliably. Python yields the highest secure rates; C and C# are the lowest, with Java, JS, PHP, and C++ in the middle. Reflexion prompting improves security for all models, improving average accuracy from 70.74% at t0 to 79.43% at t3 , with the largest gains in the first round followed by diminishing returns. The trends with Repair, Regression, and NetGain metrics show that applying one to two rounds produces most of the benefits. A replication package is available at this https URL.
zh

[AI-63] KnowThyself: An Agent ic Assistant for LLM Interpretability AAAI26 AAAI

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)可解释性工具碎片化、代码依赖性强且用户交互门槛高的问题。解决方案的关键在于提出一个名为KnowThyself的代理式助手,其核心架构由三个模块组成:首先由一个编排器LLM对用户自然语言提问进行语义重构,随后通过代理路由模块将问题分发至专用解释模块,最终输出结果被整合为结构化的解释内容。该设计将整个可解释性分析流程嵌入对话式工作流中,显著降低了技术门槛,并构建了一个可扩展的LLM可解释性平台。

链接: https://arxiv.org/abs/2511.03878
作者: Suraj Prasai,Mengnan Du,Ying Zhang,Fan Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 5 pages, 1 figure, Accepted for publication at the Demonstration Track of the 40th AAAI Conference on Artificial Intelligence (AAAI 26)

点击查看摘要

Abstract:We develop KnowThyself, an agentic assistant that advances large language model (LLM) interpretability. Existing tools provide useful insights but remain fragmented and code-intensive. KnowThyself consolidates these capabilities into a chat-based interface, where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. At its core, an orchestrator LLM first reformulates user queries, an agent router further directs them to specialized modules, and the outputs are finally contextualized into coherent explanations. This design lowers technical barriers and provides an extensible platform for LLM inspection. By embedding the whole process into a conversational workflow, KnowThyself offers a robust foundation for accessible LLM interpretability.
zh

[AI-64] OMPILOT: Harnessing Transformer Models for Auto Parallelization to Shared Memory Computing Paradigms

【速读】:该论文旨在解决传统代码翻译方法在跨语言转换中准确率低、灵活性差以及难以有效实现共享内存并行化的问题,特别是在将C++代码自动转化为OpenMP(一种用于共享内存并行编程的API)时面临的挑战。解决方案的关键在于提出OMPILOT,一个针对此特定领域设计的编码器-解码器Transformer架构,其通过定制化的预训练目标融合了并行构造语义,并结合无监督与有监督学习策略提升模型鲁棒性;此外,OMPILOT采用函数级而非仅循环级的翻译粒度,以捕获更广泛的语义上下文,从而显著改善并行代码生成的质量和正确性。为评估效果,作者还引入了OMPBLEU这一专门针对OpenMP并行结构正确性和质量的复合指标,弥补了传统翻译评价指标的不足。

链接: https://arxiv.org/abs/2511.03866
作者: Arijit Bhattacharjee,Ali TehraniJamsaz,Le Chen,Niranjan Hasabnis,Mihai Capota,Nesreen Ahmed,Ali Jannesari
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have significantly accelerated progress in code translation, enabling more accurate and efficient transformation across programming languages. While originally developed for natural language processing, LLMs have shown strong capabilities in modeling programming language syntax and semantics, outperforming traditional rule-based systems in both accuracy and flexibility. These models have streamlined cross-language conversion, reduced development overhead, and accelerated legacy code migration. In this paper, we introduce OMPILOT, a novel domain-specific encoder-decoder transformer tailored for translating C++ code into OpenMP, enabling effective shared-memory parallelization. OMPILOT leverages custom pre-training objectives that incorporate the semantics of parallel constructs and combines both unsupervised and supervised learning strategies to improve code translation robustness. Unlike previous work that focused primarily on loop-level transformations, OMPILOT operates at the function level to capture a wider semantic context. To evaluate our approach, we propose OMPBLEU, a novel composite metric specifically crafted to assess the correctness and quality of OpenMP parallel constructs, addressing limitations in conventional translation metrics.
zh

[AI-65] Levers of Power in the Field of AI

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)治理中决策者如何在不同机构背景下行使权力的问题,尤其关注个人能动性、组织逻辑与制度基础设施之间的交互作用。其解决方案的关键在于构建一个基于新制度主义理论的治理框架,并通过针对北美和欧洲高阶决策者的匿名问卷调查,提炼出十二个虚构但具代表性的角色画像(personas),以此揭示决策者在AI技术变革中的权力运作机制及其对制度稳定与变革的影响。研究进一步提出五项可验证假设,为政策制定者与民间社会参与者提供实操性的权力介入路径。

链接: https://arxiv.org/abs/2511.03859
作者: Tammy Mackenzie,Sukriti Punj,Natalie Perez,Sreyoshi Bhaduri,Branislav Radeljic
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 18 pages, research submission

点击查看摘要

Abstract:This paper examines how decision makers in academia, government, business, and civil society navigate questions of power in implementations of artificial intelligence. The study explores how individuals experience and exercise levers of power, which are presented as social mechanisms that shape institutional responses to technological change. The study reports on the responses of personalized questionnaires designed to gather insight on a decision maker’s institutional purview, based on an institutional governance framework developed from the work of Neo-institutionalists. Findings present the anonymized, real responses and circumstances of respondents in the form of twelve fictional personas of high-level decision makers from North America and Europe. These personas illustrate how personal agency, organizational logics, and institutional infrastructures may intersect in the governance of AI. The decision makers’ responses to the questionnaires then inform a discussion of the field-level personal power of decision makers, methods of fostering institutional stability in times of change, and methods of influencing institutional change in the field of AI. The final section of the discussion presents a table of the dynamics of the levers of power in the field of AI for change makers and five testable hypotheses for institutional and social movement researchers. In summary, this study provides insight on the means for policymakers within institutions and their counterparts in civil society to personally engage with AI governance.
zh

[AI-66] o See or To Read: User Behavior Reasoning in Multimodal LLM s NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在用户行为推理任务中,文本与图像表示形式对模型性能影响的不明确问题。其解决方案的关键在于提出一个系统性的基准测试框架 BehaviorLens,通过将交易数据分别以文本段落、散点图和流程图三种模态进行表示,对六种 MLLMs 在用户行为序列预测任务中的表现进行量化评估。实验结果表明,在不增加额外计算成本的前提下,图像模态表示可使 MLLMs 的下一次购买预测准确率提升 87.5%,显著优于等效的文本表示方式。

链接: https://arxiv.org/abs/2511.03845
作者: Tianning Dong,Luyi Ma,Varun Vasudevan,Jason Cho,Sushant Kumar,Kannan Achan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Efficient Reasoning

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \textttBehaviorLens, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.
zh

[AI-67] Optimizing Reasoning Efficiency through Prompt Difficulty Prediction NEURIPS2025

【速读】:该论文旨在解决推理型语言模型(reasoning language models)在复杂任务中虽表现优异但部署成本高昂的问题,主要瓶颈在于模型规模庞大及推理过程中的长轨迹计算开销。其解决方案的关键在于提出一种路由(routing)机制,通过训练轻量级预测器来评估问题难度或模型正确性,从而将每个问题分配给最可能成功解决且规模最小的模型,实现计算资源的高效利用。实验表明,该方法在多个数学基准测试中显著降低计算消耗,同时保持与大型模型(如s1.1-32B)相当的准确性。

链接: https://arxiv.org/abs/2511.03808
作者: Bo Zhao,Berkcan Kapusuzoglu,Kartik Balasubramaniam,Sambit Sahu,Supriyo Chakraborty,Genta Indra Winata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Workshop on Efficient Reasoning

点击查看摘要

Abstract:Reasoning language models perform well on complex tasks but are costly to deploy due to their size and long reasoning traces. We propose a routing approach that assigns each problem to the smallest model likely to solve it, reducing compute without sacrificing accuracy. Using intermediate representations from s1.1-32B, we train lightweight predictors of problem difficulty or model correctness to guide routing across a pool of reasoning models. On diverse math benchmarks, routing improves efficiency over random assignment and matches s1.1-32B’s performance while using significantly less compute. Our results demonstrate that difficulty-aware routing is effective for cost-efficient deployment of reasoning models.
zh

[AI-68] Scaling Agent Learning via Experience Synthesis

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Model, LLM)智能体中实际应用时面临的挑战,包括昂贵的环境交互(rollouts)、任务多样性不足、奖励信号不可靠以及基础设施复杂性等问题,这些问题阻碍了可扩展的经验数据收集。解决方案的关键在于提出DreamGym,一个统一的框架,通过将环境动态蒸馏为基于推理的经验模型(reasoning-based experience model),利用逐步推理生成一致的状态转移和反馈信号,从而实现高效、可扩展的在线RL训练;同时结合初始化于离线真实数据的经验回放缓冲区(experience replay buffer)与自适应任务生成机制,提升训练稳定性、过渡质量及知识获取效率,最终在合成环境和仿真到现实(sim-to-real)迁移场景中均显著优于现有方法。

链接: https://arxiv.org/abs/2511.03773
作者: Zhaorun Chen,Zhuokai Zhao,Kai Zhang,Bo Liu,Qi Qi,Yifan Wu,Tarun Kalluri,Sara Cao,Yuanhao Xiong,Haibo Tong,Huaxiu Yao,Hengduo Li,Jiacheng Zhu,Xian Li,Dawn Song,Bo Li,Jason Weston,Dat Huynh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.
zh

[AI-69] OptiMA: A Transaction-Based Framework with Throughput Optimization for Very Complex Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在向更大规模和更复杂模型演进过程中所面临的两大问题:一是系统对故障的敏感性增加,二是性能瓶颈显现。为应对前者,作者提出基于事务(transaction)的框架来设计超复杂多智能体系统(Very Complex Multi-Agent Systems, VCMAS),以提升系统的容错能力;为缓解后者,进一步将事务调度(transaction scheduling)机制集成到该框架中,从而优化系统执行效率。其解决方案的关键在于构建一个事务驱动的架构(OptiMA框架),并通过理论分析与实证验证表明,该方案不仅能支持百级规模智能体的稳定运行,还能通过事务调度实现超过16%的性能提升。

链接: https://arxiv.org/abs/2511.03761
作者: Umut Çalıkyılmaz,Nitin Nayak,Jinghua Groppe,Sven Groppe
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:In recent years, the research of multi-agent systems has taken a direction to explore larger and more complex models to fulfill sophisticated tasks. We point out two possible pitfalls that might be caused by increasing complexity; susceptibilities to faults, and performance bottlenecks. To prevent the former threat, we propose a transaction-based framework to design very complex multi-agent systems (VCMAS). To address the second threat, we offer to integrate transaction scheduling into the proposed framework. We implemented both of these ideas to develop the OptiMA framework and show that it is able to facilitate the execution of VCMAS with more than a hundred agents. We also demonstrate the effect of transaction scheduling on such a system by showing improvements up to more than 16%. Furthermore, we also performed a theoretical analysis on the transaction scheduling problem and provided practical tools that can be used for future research on it.
zh

[AI-70] Laugh Relate Engage: Stylized Comment Generation for Short Videos

【速读】:该论文旨在解决短视频平台中评论生成的两大核心挑战:一是确保生成内容符合平台规范(compliance),二是实现风格多样性与上下文感知能力(stylistic diversity and contextual awareness)。传统方法难以在保证合规性的前提下,灵活控制评论的语用风格(如双关、押韵、讽刺等)并准确匹配视频内容。解决方案的关键在于提出一个模块化多智能体系统(modular multi-agent system, MAS)——LOLGORITHM,其通过视频分割、情感与语境分析、以及基于风格提示的构造机制,结合多模态大语言模型(multimodal large language model, MLLM)直接处理视频输入,并借助显式提示标记和少量示例实现细粒度风格控制。该框架支持六种特定评论风格,在跨语言(中文Douyin与英文YouTube)数据集上验证了其有效性,显著优于基线模型,偏好率分别达90%以上(Douyin)和87.55%(YouTube)。

链接: https://arxiv.org/abs/2511.03757
作者: Xuan Ouyang,Senan Wang,Bouzhou Wang,Siyuan Xiahou,Jinrong Zhou,Yuekang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Short-video platforms have become a central medium in the modern Internet landscape, where efficient information delivery and strong interactivity are reshaping user engagement and cultural dissemination. Among the various forms of user interaction, comments play a vital role in fostering community participation and enabling content re-creation. However, generating comments that are both compliant with platform guidelines and capable of exhibiting stylistic diversity and contextual awareness remains a significant challenge. We introduce LOLGORITHM, a modular multi-agent system (MAS) designed for controllable short-video comment generation. The system integrates video segmentation, contextual and affective analysis, and style-aware prompt construction. It supports six distinct comment styles: puns (homophones), rhyming, meme application, sarcasm (irony), plain humor, and content extraction. Powered by a multimodal large language model (MLLM), LOLGORITHM directly processes video inputs and achieves fine-grained style control through explicit prompt markers and few-shot examples. To support development and evaluation, we construct a bilingual dataset using official APIs from Douyin (Chinese) and YouTube (English), covering five popular video genres: comedy skits, daily life jokes, funny animal clips, humorous commentary, and talk shows. Evaluation combines automated metrics originality, relevance, and style conformity with a large-scale human preference study involving 40 videos and 105 participants. Results show that LOLGORITHM significantly outperforms baseline models, achieving preference rates of over 90% on Douyin and 87.55% on YouTube. This work presents a scalable and culturally adaptive framework for stylized comment generation on short-video platforms, offering a promising path to enhance user engagement and creative interaction.
zh

[AI-71] Federated Learning with Gramian Angular Fields for Privacy-Preserving ECG Classification on Heterogeneous IoT Devices

【速读】:该论文旨在解决物联网(IoT)医疗环境中心电图(ECG)分类任务中的隐私保护与计算资源受限问题。其核心挑战在于如何在不传输原始敏感医疗数据的前提下,实现高效、准确的分布式模型训练。解决方案的关键在于提出一种基于联邦学习(Federated Learning, FL)的框架,将一维ECG信号转换为二维Gramian Angular Field(GAF)图像,从而利用卷积神经网络(CNN)进行特征提取,并确保数据始终保留在本地设备上。该方法在异构边缘设备(如树莓派4、笔记本和服务器)上的实验验证了其在分类精度(95.18%)和通信效率方面的优越性,证明了轻量级隐私保护AI在智能健康系统中部署的可行性与有效性。

链接: https://arxiv.org/abs/2511.03753
作者: Youssef Elmir,Yassine Himeur,Abbes Amira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Networking and Internet Architecture (cs.NI)
备注: 06 pages, 03 figures, accepted for presentation at the 7th IEEE Computing, Communications and IoT Applications Conference (ComComAp 2025)

点击查看摘要

Abstract:This study presents a federated learning (FL) framework for privacy-preserving electrocardiogram (ECG) classification in Internet of Things (IoT) healthcare environments. By transforming 1D ECG signals into 2D Gramian Angular Field (GAF) images, the proposed approach enables efficient feature extraction through Convolutional Neural Networks (CNNs) while ensuring that sensitive medical data remain local to each device. This work is among the first to experimentally validate GAF-based federated ECG classification across heterogeneous IoT devices, quantifying both performance and communication efficiency. To evaluate feasibility in realistic IoT settings, we deployed the framework across a server, a laptop, and a resource-constrained Raspberry Pi 4, reflecting edge-cloud integration in IoT ecosystems. Experimental results demonstrate that the FL-GAF model achieves a high classification accuracy of 95.18% in a multi-client setup, significantly outperforming a single-client baseline in both accuracy and training time. Despite the added computational complexity of GAF transformations, the framework maintains efficient resource utilization and communication overhead. These findings highlight the potential of lightweight, privacy-preserving AI for IoT-based healthcare monitoring, supporting scalable and secure edge deployments in smart health systems.
zh

[AI-72] Applying Time Series Deep Learning Models to Forecast the Growth of Perennial Ryegrass in Ireland

【速读】:该论文旨在解决爱尔兰奶业中牧草生长预测依赖不切实际的机制模型所带来的效率与准确性问题,从而影响牧场可持续性和经济效益。解决方案的关键在于引入针对单变量时间序列数据设计的深度学习模型,特别是基于时序卷积网络(Temporal Convolutional Network, TCN)的方法,利用历史牧草高度数据实现高精度预测,其在科克地区多年牧草生长预测中表现出优异性能(RMSE=2.74,MAE=3.46),并通过覆盖34年共1,757周的广泛数据集验证了模型配置的最优性,显著提升了预测可靠性并为可持续奶业发展提供技术支持。

链接: https://arxiv.org/abs/2511.03749
作者: Oluwadurotimi Onibonoje,Vuong M. Ngo,Andrew McCarre,Elodie Ruelle,Bernadette O-Briend,Mark Roantree
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 13 pages (two-columns), 7 figures, 3 tables

点击查看摘要

Abstract:Grasslands, constituting the world’s second-largest terrestrial carbon sink, play a crucial role in biodiversity and the regulation of the carbon cycle. Currently, the Irish dairy sector, a significant economic contributor, grapples with challenges related to profitability and sustainability. Presently, grass growth forecasting relies on impractical mechanistic models. In response, we propose deep learning models tailored for univariate datasets, presenting cost-effective alternatives. Notably, a temporal convolutional network designed for forecasting Perennial Ryegrass growth in Cork exhibits high performance, leveraging historical grass height data with RMSE of 2.74 and MAE of 3.46. Validation across a comprehensive dataset spanning 1,757 weeks over 34 years provides insights into optimal model configurations. This study enhances our understanding of model behavior, thereby improving reliability in grass growth forecasting and contributing to the advancement of sustainable dairy farming practices.
zh

[AI-73] OpenMENA: An Open-Source Memristor Interfacing and Compute Board for Neuromorphic Edge-AI Applications

【速读】:该论文旨在解决边缘人工智能(Edge AI)中计算能效低、硬件与软件协同设计不足的问题,尤其针对基于忆阻器交叉阵列(memristor crossbars)的存内计算架构在实际部署时面临的器件非理想性、权重迁移困难及缺乏可复现软硬件接口等挑战。解决方案的关键在于提出Open-MENA(Open Memristor-in-Memory Accelerator),其核心创新包括:(i) 一套可复现的混合信号读-写-验证循环硬件接口,实现对忆阻器阵列的精确控制;(ii) 集成高阶API的固件-软件栈,支持推理和设备端学习;(iii) 提出电压增量比例积分(VIPI)编程方法,将预训练权重映射为模拟电导,并通过芯片级闭环微调补偿器件非理想性,从而实现从权重迁移到本地适应的完整工作流。

链接: https://arxiv.org/abs/2511.03747
作者: Ali Safa,Farida Mohsen,Zainab Ali,Bo Wang,Amine Bermak
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memristive crossbars enable in-memory multiply-accumulate and local plasticity learning, offering a path to energy-efficient edge AI. To this end, we present Open-MENA (Open Memristor-in-Memory Accelerator), which, to our knowledge, is the first fully open memristor interfacing system integrating (i) a reproducible hardware interface for memristor crossbars with mixed-signal read-program-verify loops; (ii) a firmware-software stack with high-level APIs for inference and on-device learning; and (iii) a Voltage-Incremental Proportional-Integral (VIPI) method to program pre-trained weights into analog conductances, followed by chip-in-the-loop fine-tuning to mitigate device non-idealities. OpenMENA is validated on digit recognition, demonstrating the flow from weight transfer to on-device adaptation, and on a real-world robot obstacle-avoidance task, where the memristor-based model learns to map localization inputs to motor commands. OpenMENA is released as open source to democratize memristor-enabled edge-AI research.
zh

[AI-74] Conversational Collective Intelligence (CCI) using Hyperchat AI in an Authentic Forecasting Task

【速读】:该论文旨在解决如何通过技术手段提升大规模人类群体在复杂决策任务中的集体智能(Collective Intelligence, CI)表现问题,特别是在预测不确定性事件(如MLB比赛结果)时的准确性不足。解决方案的关键在于引入Hyperchat AI这一新型代理型(agentic)技术,通过实时AI辅助的对话机制,促进网络化人群之间的高效互动与协同推理,从而显著增强群体判断的准确性与一致性。实验表明,使用Hyperchat AI进行讨论的群体在高置信度预测中达到78%准确率,显著优于拉斯维加斯博彩市场基准(57%),且高交互频率的讨论进一步将准确率提升至88%,验证了实时交互式 deliberation 在放大集体智慧中的核心作用。

链接: https://arxiv.org/abs/2511.03732
作者: Hans Schumann,Louis Rosenberg,Ganesh Mani,Gregg Willcox
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperchat AI is a novel agentic technology that enables thoughtful conversations among networked human groups of potentially unlimited size. It allows large teams to discuss complex issues, brainstorm ideas, surface risks, assess alternatives and efficiently converge on optimized solutions that amplify the group’s Collective Intelligence (CI). A formal study was conducted to quantify the forecasting accuracy of human groups using Hyperchat AI to conversationally predict the outcome of Major League Baseball (MLB) games. During an 8-week period, networked groups of approximately 24 sports fans were tasked with collaboratively forecasting the winners of 59 baseball games through real-time conversation facilitated by AI agents. The results showed that when debating the games using Hyperchat AI technology, the groups converged on High Confidence predictions that significantly outperformed Vegas betting markets. Specifically, groups were 78% accurate in their High Confidence picks, a statistically strong result vs the Vegas odds of 57% (p=0.020). Had the groups bet against the spread (ATS) on these games, they would have achieved a 46% ROI against Vegas betting markets. In addition, High Confidence forecasts that were generated through above-average conversation rates were 88% accurate, suggesting that real-time interactive deliberation is central to amplified accuracy.
zh

[AI-75] Not All Explanations are Created Equal: Investigating the Pitfalls of Current XAI Evaluation

【速读】:该论文试图解决当前可解释人工智能(Explainable Artificial Intelligence, XAI)领域中评估方法的局限性问题,即现有评价手段多依赖于简单的用户调查,难以客观衡量解释质量,且无法区分高质量与低质量解释的实际差异。论文指出,大多数解释即使不准确或无用,也会因对比“无解释”而提升用户满意度,从而误导研究结论。其解决方案的关键在于强调应优先关注“可操作性解释”(actionable explanations),即能够引导用户采取具体行动的解释,并通过一个用于教授国际象棋概念的代理助手实验验证了这一观点:只有具备可操作性的解释才能真正提升用户理解和决策能力,而非仅带来主观满意度的提升。这为未来XAI研究提出了更严谨的评估标准和方向。

链接: https://arxiv.org/abs/2511.03730
作者: Joe Shymanski,Jacob Brue,Sandip Sen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: The authors’ accepted manuscript of Chapter 9 in Bi-directionality in Human-AI Collaborative Systems (Springer, 2025). The final published version is available at this https URL . 27 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) aims to create transparency in modern AI models by offering explanations of the models to human users. There are many ways in which researchers have attempted to evaluate the quality of these XAI models, such as user studies or proposed objective metrics like “fidelity”. However, these current XAI evaluation techniques are ad hoc at best and not generalizable. Thus, most studies done within this field conduct simple user surveys to analyze the difference between no explanations and those generated by their proposed solution. We do not find this to provide adequate evidence that the explanations generated are of good quality since we believe any kind of explanation will be “better” in most metrics when compared to none at all. Thus, our study looks to highlight this pitfall: most explanations, regardless of quality or correctness, will increase user satisfaction. We also propose that emphasis should be placed on actionable explanations. We demonstrate the validity of both of our claims using an agent assistant to teach chess concepts to users. The results of this chapter will act as a call to action in the field of XAI for more comprehensive evaluation techniques for future research in order to prove explanation quality beyond user satisfaction. Additionally, we present an analysis of the scenarios in which placebic or actionable explanations would be most useful.
zh

[AI-76] Beyond Chat: a Framework for LLM s as Human-Centered Support Systems

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在从单纯问答工具向人类陪伴者、教练、调解者和内容策展人等角色转变过程中,如何实现以人类为中心的负责任集成问题。其核心解决方案在于提出一个基于角色的框架(role-based framework),通过透明度、个性化、安全边界(guardrails)、带隐私保护的记忆机制以及共情与可靠性之间的平衡等跨领域设计原则,指导LLM支持系统的设计与评估。该框架不仅关注传统准确性指标,还引入信任度、用户参与度及长期效果等多维评价体系,并识别出过度依赖、幻觉、偏见、隐私泄露及获取不平等的风险,进而推动统一评估标准、人机混合模型、记忆架构优化、跨域基准测试与治理机制等未来方向的发展。

链接: https://arxiv.org/abs/2511.03729
作者: Zhiyin Zhou
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are moving beyond transactional question answering to act as companions, coaches, mediators, and curators that scaffold human growth, decision-making, and well-being. This paper proposes a role-based framework for human-centered LLM support systems, compares real deployments across domains, and identifies cross-cutting design principles: transparency, personalization, guardrails, memory with privacy, and a balance of empathy and reliability. It outlines evaluation metrics that extend beyond accuracy to trust, engagement, and longitudinal outcomes. It also analyzes risks including over-reliance, hallucination, bias, privacy exposure, and unequal access, and proposes future directions spanning unified evaluation, hybrid human-AI models, memory architectures, cross-domain benchmarking, and governance. The goal is to support responsible integration of LLMs in sensitive settings where people need accompaniment and guidance, not only answers.
zh

[AI-77] Efficient On-Device Agents via Adaptive Context Management

【速读】:该论文旨在解决在设备端(on-device)部署智能体(AI agent)时,受限于有限内存容量导致可用上下文窗口过小的问题,这一限制使得复杂工具调用与持久状态交互难以实现,从而形成性能与设备可行性之间的权衡。解决方案的关键在于提出一种上下文高效的设备端智能体框架,其核心是三项协同优化:(1) 基于专用LoRA适配器的动态内存系统,将对话历史压缩为结构化的上下文状态对象(Context State Object);(2) 采用极简序列化格式表示工具模式(tool schema),降低每个工具的token开销;(3) 引入按需加载机制(just-in-time schema-passing),仅在工具被选中时才加载完整定义。通过该框架,作者将初始系统提示词上下文压缩超过6倍,交互过程中的上下文增长速率降低10至25倍,同时保持或优于传统基线模型的性能,证明了战略性上下文管理是实现强大且持久的设备端AI的关键。

链接: https://arxiv.org/abs/2511.03728
作者: Sanidhya Vijayvargiya,Rahul Lokesh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures

点击查看摘要

Abstract:On-device AI agents offer the potential for personalized, low-latency assistance, but their deployment is fundamentally constrained by limited memory capacity, which restricts usable context. This reduced practical context window creates a trade-off between supporting rich, stateful interactions with complex tool capabilities and maintaining on-device feasibility. We break this trade-off with a framework for context-efficient on-device agents, driven by three synergistic optimizations (1) a dynamic memory system using specialized LoRA adapters to distill conversational history into a compressed, and structured Context State Object; (2) a minimalist serialization format for tool schemas to minimize token overhead per tool; and (3) a just-in-time schema-passing mechanism that loads full tool definitions only upon tool selection. We instantiate this framework by adapting a 3B parameter SLM to context-efficient trajectories and rigorously evaluate it against a conventional baseline on complex user tasks. Our agent matches, or exceeds, the performance of a conventional baseline while dramatically compressing context, achieving more than a 6-fold reduction in initial system prompt context and a 10- to 25-fold reduction in context growth rate based on the interaction verbosity, demonstrating that strategic context management is key to unlocking capable and persistent on-device AI.
zh

[AI-78] MazeMate: An LLM -Powered Chatbot to Support Computational Thinking in Gamified Programming Learning

【速读】:该论文试图解决的问题是:如何利用大语言模型(Large Language Models, LLMs)在编程游戏中有效支持计算思维(Computational Thinking, CT)的发展,尤其是在当前LLM应用中缺乏对CT过程的针对性引导。解决方案的关键在于设计并实现一个名为MazeMate的LLM驱动聊天机器人,嵌入到3D迷宫编程游戏中,通过提供自适应、情境敏感的脚手架(scaffolding),精准匹配迷宫求解与迷宫设计中的CT过程(如分解、抽象和算法思维)。实证研究表明,MazeMate在迷宫求解任务中能有效促进CT发展,但在迷宫设计任务中存在建议不匹配和虚构算法等局限,提示未来需优化设计以提升其在真实课堂环境中的可用性与有效性。

链接: https://arxiv.org/abs/2511.03727
作者: Chenyu Hou,Hua Yu,Gaoxia Zhu,John Derek Anas,Jiao Liu,Yew Soon Ong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computational Thinking (CT) is a foundational problem-solving skill, and gamified programming environments are a widely adopted approach to cultivating it. While large language models (LLMs) provide on-demand programming support, current applications rarely foster CT development. We present MazeMate, an LLM-powered chatbot embedded in a 3D Maze programming game, designed to deliver adaptive, context-sensitive scaffolds aligned with CT processes in maze solving and maze design. We report on the first classroom implementation with 247 undergraduates. Students rated MazeMate as moderately helpful, with higher perceived usefulness for maze solving than for maze design. Thematic analysis confirmed support for CT processes such as decomposition, abstraction, and algorithmic thinking, while also revealing limitations in supporting maze design, including mismatched suggestions and fabricated algorithmic solutions. These findings demonstrate the potential of LLM-based scaffolding to support CT and underscore directions for design refinement to enhance MazeMate usability in authentic classrooms.
zh

[AI-79] Simulation-Based Validation of an Integrated 4D/5D Digital-Twin Framework for Predictive Construction Control

【速读】:该论文旨在解决美国建筑行业中持续存在的成本与进度偏差问题,揭示了传统确定性关键路径法(Critical Path Method, CPM)和静态文档驱动估算方法的局限性。其解决方案的关键在于构建一个集成的4D/5D数字孪生框架,融合建筑信息模型(Building Information Modeling, BIM)与基于自然语言处理(Natural Language Processing, NLP)的成本映射、计算机视觉(Computer Vision, CV)驱动的进度测量、贝叶斯概率CPM更新以及深度强化学习(Deep Reinforcement Learning, DRL)资源均衡技术,从而实现更精准、透明且具备自适应能力的施工管理。

链接: https://arxiv.org/abs/2511.03684
作者: Atena Khoshkonesh,Mohsen Mohammadagha,Navid Ebrahimi
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Persistent cost and schedule deviations remain a major challenge in the U.S. construction industry, revealing the limitations of deterministic CPM and static document-based estimating. This study presents an integrated 4D/5D digital-twin framework that couples Building Information Modeling (BIM) with natural-language processing (NLP)-based cost mapping, computer-vision (CV)-driven progress measurement, Bayesian probabilistic CPM updating, and deep-reinforcement-learning (DRL) resource-leveling. A nine-month case implementation on a Dallas-Fort Worth mid-rise project demonstrated measurable gains in accuracy and efficiency: 43% reduction in estimating labor, 6% reduction in overtime, and 30% project-buffer utilization, while maintaining an on-time finish at 128 days within P50-P80 confidence bounds. The digital-twin sandbox also enabled real-time “what-if” forecasting and traceable cost-schedule alignment through a 5D knowledge graph. Findings confirm that integrating AI-based analytics with probabilistic CPM and DRL enhances forecasting precision, transparency, and control resilience. The validated workflow establishes a practical pathway toward predictive, adaptive, and auditable construction management.
zh

[AI-80] Exploratory Analysis of Cyberattack Patterns on E-Commerce Platforms Using Statistical Methods

【速读】:该论文旨在解决电子商务平台日益复杂的网络攻击问题,以维护消费者信任和业务连续性。其核心挑战在于如何准确检测并预测攻击模式,尤其是在高风险时段(如节假日购物季)中识别出威胁严重性更高的攻击行为。解决方案的关键在于构建一个融合统计建模与集成机器学习的混合分析框架:一方面采用Auto ARIMA进行时间序列预测,并通过Mann-Whitney U检验(U = 2579981.5, p = 0.0121)证实节假日期间攻击强度显著高于非节日;另一方面利用XGBoost、LightGBM和CatBoost等集成模型实现攻击类型的分类预测,其中CatBoost表现最优(准确率85.29%,F1分数0.2254,ROC AUC 0.8247)。该框架首次将季节性预测与可解释的集成学习相结合,实现了对时间维度风险的提前感知与具体攻击类型的有效识别,为网络安全资源的主动配置提供了依据。

链接: https://arxiv.org/abs/2511.03020
作者: Fatimo Adenike Adeniya(York St John University, London Campus, London, United Kingdom)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 9 figures, 6 tables; MSc Research Dissertation, York St John University, London Campus

点击查看摘要

Abstract:Cyberattacks on e-commerce platforms have grown in sophistication, threatening consumer trust and operational continuity. This research presents a hybrid analytical framework that integrates statistical modelling and machine learning for detecting and forecasting cyberattack patterns in the e-commerce domain. Using the Verizon Community Data Breach (VCDB) dataset, the study applies Auto ARIMA for temporal forecasting and significance testing, including a Mann-Whitney U test (U = 2579981.5, p = 0.0121), which confirmed that holiday shopping events experienced significantly more severe cyberattacks than non-holiday periods. ANOVA was also used to examine seasonal variation in threat severity, while ensemble machine learning models (XGBoost, LightGBM, and CatBoost) were employed for predictive classification. Results reveal recurrent attack spikes during high-risk periods such as Black Friday and holiday seasons, with breaches involving Personally Identifiable Information (PII) exhibiting elevated threat indicators. Among the models, CatBoost achieved the highest performance (accuracy = 85.29%, F1 score = 0.2254, ROC AUC = 0.8247). The framework uniquely combines seasonal forecasting with interpretable ensemble learning, enabling temporal risk anticipation and breach-type classification. Ethical considerations, including responsible use of sensitive data and bias assessment, were incorporated. Despite class imbalance and reliance on historical data, the study provides insights for proactive cybersecurity resource allocation and outlines directions for future real-time threat detection research.
zh

[AI-81] CORE - A Cell-Level Coarse-to-Fine Image Registration Engine for Multi-stain Image Alignment

【速读】:该论文旨在解决多模态全切片图像(Whole Slide Images, WSI)在高分辨率、细胞核级别上的精准配准问题,尤其针对明场和免疫荧光显微成像中因组织形态差异与非刚性形变导致的注册精度不足难题。解决方案的关键在于提出了一种新颖的“粗到精”(coarse-to-fine)框架CORE:首先通过提示驱动的组织掩膜提取(prompt-based tissue mask extraction)去除伪影和非组织区域,结合组织形态学特征与预训练特征提取器加速密集特征匹配完成全局对齐;随后基于检测到的细胞核中心点,采用自定义的形状感知点集配准模型实现精细刚性配准;最终利用相干点漂移(Coherent Point Drift, CPD)估计非线性位移场,在细胞层面实现非刚性对齐。该方法通过自动生成的细胞核结构增强可变形配准精度,确保跨模态下的细胞核级对应关系。

链接: https://arxiv.org/abs/2511.03826
作者: Esha Sadia Nasir,Behnaz Elhaminia,Mark Eastwood,Catherine King,Owen Cain,Lorraine Harper,Paul Moss,Dimitrios Chanouzas,David Snead,Nasir Rajpoot,Adam Shephard,Shan E Ahmed Raza
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and efficient registration of whole slide images (WSIs) is essential for high-resolution, nuclei-level analysis in multi-stained tissue slides. We propose a novel coarse-to-fine framework CORE for accurate nuclei-level registration across diverse multimodal whole-slide image (WSI) datasets. The coarse registration stage leverages prompt-based tissue mask extraction to effectively filter out artefacts and non-tissue regions, followed by global alignment using tissue morphology and ac- celerated dense feature matching with a pre-trained feature extractor. From the coarsely aligned slides, nuclei centroids are detected and subjected to fine-grained rigid registration using a custom, shape-aware point-set registration model. Finally, non-rigid alignment at the cellular level is achieved by estimating a non-linear dis- placement field using Coherent Point Drift (CPD). Our approach benefits from automatically generated nuclei that enhance the accuracy of deformable registra- tion and ensure precise nuclei-level correspondence across modalities. The pro- posed model is evaluated on three publicly available WSI registration datasets, and two private datasets. We show that CORE outperforms current state-of-the-art methods in terms of generalisability, precision, and robustness in bright-field and immunofluorescence microscopy WSIs
zh

[AI-82] Expert Evaluation of LLM World Models: A High-T_c Superconductivity Case Study ICML’25 ICML2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在专业科学领域中提供准确、全面且具有证据支持的答案能力不足的问题,特别是在高溫超導體(high-temperature cuprates)这一复杂研究领域。其解决方案的关键在于构建一个由专家精心整理的包含1,726篇文献的数据库和一套67个深度问题集,并采用检索增强生成(Retrieval-Augmented Generation, RAG)技术,使模型不仅能访问文本信息,还能获取图像等多模态数据,从而提升答案的准确性与综合性。实验表明,基于RAG的系统在平衡视角、事实完整性、简洁性和证据支撑等方面显著优于封闭式商用模型,验证了结构化知识检索与生成结合的有效性。

链接: https://arxiv.org/abs/2511.03782
作者: Haoyu Guo,Maria Tikhanovskaya,Paul Raccuglia,Alexey Vlaskin,Chris Co,Daniel J. Liebling,Scott Ellsworth,Matthew Abraham,Elizabeth Dorfman,N. P. Armitage,Chunhan Feng,Antoine Georges,Olivier Gingras,Dominik Kiese,Steven A. Kivelson,Vadim Oganesyan,B. J. Ramshaw,Subir Sachdev,T. Senthil,J. M. Tranquada,Michael P. Brenner,Subhashini Venugopalan,Eun-Ah Kim
机构: 未知
类目: perconductivity (cond-mat.supr-con); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI)
备注: (v1) 9 pages, 4 figures, with 7-page supporting information. Accepted at the ICML 2025 workshop on Assessing World Models and the Explorations in AI Today workshop at ICML’25

点击查看摘要

Abstract:Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high-temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert-curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert-formulated questions that probe deep understanding of the literature. We then evaluate six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the answers of these systems against a rubric that assesses balanced perspectives, factual comprehensiveness, succinctness, and evidentiary support. Among the six systems two using RAG on curated literature outperformed existing closed models across key metrics, particularly in providing comprehensive and well-supported answers. We discuss promising aspects of LLM performances as well as critical short-comings of all the models. The set of expert-formulated questions and the rubric will be valuable for assessing expert level performance of LLM based reasoning systems.
zh

[AI-83] Climbing the label tree: Hierarchy-preserving contrastive learning for medical imaging

【速读】:该论文旨在解决医学图像标签通常具有层次结构(如器官-组织-亚型)但标准自监督学习(SSL)方法忽略这一结构的问题。解决方案的关键在于提出一种保持层级关系的对比学习框架,引入两个可插拔的目标函数:层次加权对比损失(Hierarchy-Weighted Contrastive, HWC),通过共享祖先对正负样本对的权重进行调整以增强父节点内部的一致性;以及层级感知边界(Level-Aware Margin, LAM),利用原型边距分离不同层级的祖先组。该方法不依赖特定几何空间,适用于欧氏和双曲嵌入,且在多个基准测试中显著提升表示质量并更忠实于标签树结构,从而在性能与可解释性之间取得平衡。

链接: https://arxiv.org/abs/2511.03771
作者: Alif Elham Khan
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image labels are often organized by taxonomies (e.g., organ - tissue - subtype), yet standard self-supervised learning (SSL) ignores this structure. We present a hierarchy-preserving contrastive framework that makes the label tree a first-class training signal and an evaluation target. Our approach introduces two plug-in objectives: Hierarchy-Weighted Contrastive (HWC), which scales positive/negative pair strengths by shared ancestors to promote within-parent coherence, and Level-Aware Margin (LAM), a prototype margin that separates ancestor groups across levels. The formulation is geometry-agnostic and applies to Euclidean and hyperbolic embeddings without architectural changes. Across several benchmarks, including breast histopathology, the proposed objectives consistently improve representation quality over strong SSL baselines while better respecting the taxonomy. We evaluate with metrics tailored to hierarchy faithfulness: HF1 (hierarchical F1), H-Acc (tree-distance-weighted accuracy), and parent-distance violation rate. We also report top-1 accuracy for completeness. Ablations show that HWC and LAM are effective even without curvature, and combining them yields the most taxonomy-aligned representations. Taken together, these results provide a simple, general recipe for learning medical image representations that respect the label tree and advance both performance and interpretability in hierarchy-rich domains.
zh

[AI-84] Leverag ing LLM -based agents for social science research: insights from citation network simulations

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会模拟中的能力边界尚不明确的问题,尤其是其在学术引用网络等社会现象建模中的适用性。解决方案的关键在于提出CiteAgent框架,该框架通过基于LLM的代理(agent)模拟人类行为来生成引文网络,成功复现了现实世界引文网络中的关键特征,如幂律分布、引文扭曲和直径收缩现象;在此基础上,进一步构建了两种基于LLM的社会科学研究范式:LLM-SE(LLM-based Survey Experiment)和LLM-LE(LLM-based Laboratory Experiment),从而实现对引文网络现象的严谨分析,并验证或挑战现有理论,拓展了传统科学学(science of science)研究的边界。

链接: https://arxiv.org/abs/2511.03758
作者: Jiarui Ji,Runlin Lei,Xuchen Pan,Zhewei Wei,Hao Sun,Yankai Lin,Xu Chen,Yongzheng Yang,Yaliang Li,Bolin Ding,Ji-Rong Wen
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: accepted by HSSCOMMS’25

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) demonstrates their potential to encapsulate the logic and patterns inherent in human behavior simulation by leveraging extensive web data pre-training. However, the boundaries of LLM capabilities in social simulation remain unclear. To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter. Building on this realistic simulation, we establish two LLM-based research paradigms in social science: LLM-SE (LLM-based Survey Experiment) and LLM-LE (LLM-based Laboratory Experiment). These paradigms facilitate rigorous analyses of citation network phenomena, allowing us to validate and challenge existing theories. Additionally, we extend the research scope of traditional science of science studies through idealized social experiments, with the simulation experiment results providing valuable insights for real-world academic environments. Our work demonstrates the potential of LLMs for advancing science of science research in social science.
zh

机器学习

[LG-0] Multi-Method Analysis of Mathematics Placement Assessments: Classical Machine Learning and Clustering Approaches

链接: https://arxiv.org/abs/2511.04667
作者: Julian D. Allagan,Dasia A. Singleton,Shanae N. Perry,Gabrielle C. Morgan,Essence A. Morgan
类目: Machine Learning (cs.LG)
*备注: 28 pages, 8 table, 4figures, NAM conference

点击查看摘要

Abstract:This study evaluates a 40-item mathematics placement examination administered to 198 students using a multi-method framework combining Classical Test Theory, machine learning, and unsupervised clustering. Classical Test Theory analysis reveals that 55% of items achieve excellent discrimination ( D \geq 0.40 ) while 30% demonstrate poor discrimination ( D 0.20 ) requiring replacement. Question 6 (Graph Interpretation) emerges as the examination’s most powerful discriminator, achieving perfect discrimination ( D = 1.000 ), highest ANOVA F-statistic ( F = 4609.1 ), and maximum Random Forest feature importance (0.206), accounting for 20.6% of predictive power. Machine learning algorithms demonstrate exceptional performance, with Random Forest and Gradient Boosting achieving 97.5% and 96.0% cross-validation accuracy. K-means clustering identifies a natural binary competency structure with a boundary at 42.5%, diverging from the institutional threshold of 55% and suggesting potential overclassification into remedial categories. The two-cluster solution exhibits exceptional stability (bootstrap ARI = 0.855) with perfect lower-cluster purity. Convergent evidence across methods supports specific refinements: replace poorly discriminating items, implement a two-stage assessment, and integrate Random Forest predictions with transparency mechanisms. These findings demonstrate that multi-method integration provides a robust empirical foundation for evidence-based mathematics placement optimization.

[LG-1] Forgetting is Everywhere

链接: https://arxiv.org/abs/2511.04666
作者: Ben Sanati,Thomas L. Lee,Trevor McInroe,Aidan Scannell,Nikolay Malkin,David Abel,Amos Storkey
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Project page: this https URL

点击查看摘要

Abstract:A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner’s predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm’s propensity to forget. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

[LG-2] Nowcast3D: Reliable precipitation nowcasting via gray-box learning

链接: https://arxiv.org/abs/2511.04659
作者: Huaguan Chen,Wei Han,Haofei Sun,Ning Lin,Xingtao Song,Yunfan Yang,Jie Tian,Yang Liu,Ji-Rong Wen,Xiaoye Zhang,Xueshun Shen,Hao Sun
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Extreme precipitation nowcasting demands high spatiotemporal fidelity and extended lead times, yet existing approaches remain limited. Numerical Weather Prediction (NWP) and its deep-learning emulations are too slow and coarse for rapidly evolving convection, while extrapolation and purely data-driven models suffer from error accumulation and excessive smoothing. Hybrid 2D radar-based methods discard crucial vertical information, preventing accurate reconstruction of height-dependent dynamics. We introduce a gray-box, fully three-dimensional nowcasting framework that directly processes volumetric radar reflectivity and couples physically constrained neural operators with datadriven learning. The model learns vertically varying 3D advection fields under a conservative advection operator, parameterizes spatially varying diffusion, and introduces a Brownian-motion–inspired stochastic term to represent unresolved motions. A residual branch captures small-scale convective initiation and microphysical variability, while a diffusion-based stochastic module estimates uncertainty. The framework achieves more accurate forecasts up to three-hour lead time across precipitation regimes and ranked first in 57% of cases in a blind evaluation by 160 meteorologists. By restoring full 3D dynamics with physical consistency, it offers a scalable and robust pathway for skillful and reliable nowcasting of extreme precipitation.

[LG-3] -Prune: Joint Model Pruning and Resource Allocation for Communication-efficient Time-triggered Federated Learning

链接: https://arxiv.org/abs/2511.04653
作者: Xinlu Zhang,Yansha Deng,Toktam Mahmoodi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) offers new opportunities in machine learning, particularly in addressing data privacy concerns. In contrast to conventional event-based federated learning, time-triggered federated learning (TT-Fed), as a general form of both asynchronous and synchronous FL, clusters users into different tiers based on fixed time intervals. However, the FL network consists of a growing number of user devices with limited wireless bandwidth, consequently magnifying issues such as stragglers and communication overhead. In this paper, we introduce adaptive model pruning to wireless TT-Fed systems and study the problem of jointly optimizing the pruning ratio and bandwidth allocation to minimize the training loss while ensuring minimal learning latency. To answer this question, we perform convergence analysis on the gradient l_2 norm of the TT-Fed model based on model pruning. Based on the obtained convergence upper bound, a joint optimization problem of pruning ratio and wireless bandwidth is formulated to minimize the model training loss under a given delay threshold. Then, we derive closed-form solutions for wireless bandwidth and pruning ratio using Karush-Kuhn-Tucker(KKT) conditions. The simulation results show that model pruning could reduce the communication cost by 40% while maintaining the model performance at the same level.

[LG-4] Optimal Inference Schedules for Masked Diffusion Models

链接: https://arxiv.org/abs/2511.04647
作者: Sitan Chen,Kevin Cong,Jerry Li
类目: Machine Learning (cs.LG)
*备注: 33 pages, 1 figure

点击查看摘要

Abstract:A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called diffusion language models, of which the masked diffusion model (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work of Li and Cai obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, exact characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of univariate function approximation. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong a priori knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its total correlation and dual total correlation, which show that in some natural settings, one can sample in O(log n) steps without any visible loss in performance, where n is the total sequence length. Comments: 33 pages, 1 figure Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.04647 [cs.LG] (or arXiv:2511.04647v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.04647 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] Efficient probabilistic surrogate modeling techniques for partially-observed large-scale dynamical systems

链接: https://arxiv.org/abs/2511.04641
作者: Hans Harder,Abhijeet Vishwasrao,Luca Guastoni,Ricardo Vinuesa,Sebastian Peitz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper is concerned with probabilistic techniques for forecasting dynamical systems described by partial differential equations (such as, for example, the Navier-Stokes equations). In particular, it is investigating and comparing various extensions to the flow matching paradigm that reduce the number of sampling steps. In this regard, it compares direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs and rectified flows. Moreover, experiments are conducted on a set of challenging systems. In particular, we also address the challenge of directly predicting 2D slices of large-scale 3D simulations, paving the way for efficient inflow generation for solvers.

[LG-6] vomap: A Toolbox for Dynamic Mapping in Python

链接: https://arxiv.org/abs/2511.04611
作者: Maximilian Matthe
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted for publication by the Journal of Statistical Software

点击查看摘要

Abstract:This paper presents evomap, a Python package for dynamic mapping. Mapping methods are widely used across disciplines to visualize relationships among objects as spatial representations, or maps. However, most existing statistical software supports only static mapping, which captures objects’ relationships at a single point in time and lacks tools to analyze how these relationships evolve. evomap fills this gap by implementing the dynamic mapping framework EvoMap, originally proposed by Matthe, Ringel, and Skiera (2023), which adapts traditional static mapping methods for dynamic analyses. The package supports multiple mapping techniques, including variants of Multidimensional Scaling (MDS), Sammon Mapping, and t-distributed Stochastic Neighbor Embedding (t-SNE). It also includes utilities for data preprocessing, exploration, and result evaluation, offering a comprehensive toolkit for dynamic mapping applications. This paper outlines the foundations of static and dynamic mapping, describes the architecture and functionality of evomap, and illustrates its application through an extensive usage example.

[LG-7] Environment Agnostic Goal-Conditioning A Study of Reward-Free Autonomous Learning

链接: https://arxiv.org/abs/2511.04598
作者: Hampus Åström,Elin Anna Topp,Jacek Malec
类目: Machine Learning (cs.LG)
*备注: 8 pages without cover, references and supplementary materials, 11 with. Submitted to RLC 2025’s workshop RLBrew and IMOL 2025

点击查看摘要

Abstract:In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.

[LG-8] Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems NEURIPS2025

链接: https://arxiv.org/abs/2511.04594
作者: Utkarsh U. Chavan,Prashant Trivedi,Nandyala Hemachandra
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: To appear in 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Multi-agent systems (MAS) are central to applications such as swarm robotics and traffic routing, where agents must coordinate in a decentralized manner to achieve a common objective. Stochastic Shortest Path (SSP) problems provide a natural framework for modeling decentralized control in such settings. While the problem of learning in SSP has been extensively studied in single-agent settings, the decentralized multi-agent variant remains largely unexplored. In this work, we take a step towards addressing that gap. We study decentralized multi-agent SSPs (Dec-MASSPs) under linear function approximation, where the transition dynamics and costs are represented using linear models. Applying novel symmetry-based arguments, we identify the structure of optimal policies. Our main contribution is the first regret lower bound for this setting based on the construction of hard-to-learn instances for any number of agents, n . Our regret lower bound of \Omega(\sqrtK) , over K episodes, highlights the inherent learning difficulty in Dec-MASSPs. These insights clarify the learning complexity of decentralized control and can further guide the design of efficient learning algorithms in multi-agent systems.

[LG-9] Complexity as Advantage: A Regret-Based Perspective on Emergent Structure ICML2026

链接: https://arxiv.org/abs/2511.04590
作者: Oshri Naparstek
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 15 pages. Under preparation for submission to ICML 2026. Feedback welcome

点击查看摘要

Abstract:We introduce Complexity as Advantage (CAA), a framework that defines the complexity of a system relative to a family of observers. Instead of measuring complexity as an intrinsic property, we evaluate how much predictive regret a system induces for different observers attempting to model it. A system is complex when it is easy for some observers and hard for others, creating an information advantage. We show that this formulation unifies several notions of emergent behavior, including multiscale entropy, predictive information, and observer-dependent structure. The framework suggests that “interesting” systems are those positioned to create differentiated regret across observers, providing a quantitative grounding for why complexity can be functionally valuable. We demonstrate the idea through simple dynamical models and discuss implications for learning, evolution, and artificial agents.

[LG-10] ARETE: an R package for Automated REtrieval from TExt with large language models

链接: https://arxiv.org/abs/2511.04573
作者: Vasco V. Branco,Jandó Benedek,Lidia Pivovarova,Luís Correia,Pedro Cardoso
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.

[LG-11] Confidential Computing for Cloud Security: Exploring Hardware based Encryption Using Trusted Execution Environments

链接: https://arxiv.org/abs/2511.04550
作者: Dhruv Deepak Agarwal,Aswani Kumar Cherukuri
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growth of cloud computing has revolutionized data processing and storage capacities to another levels of scalability and flexibility. But in the process, it has created a huge challenge of security, especially in terms of safeguarding sensitive data. Classical security practices, including encryption at rest and during transit, fail to protect data in use and expose it to various possible breaches. In response to this problem , Confidential Computing has been a tool ,seeking to secure data in processing by usage of hardware-based Trusted Execution Environments (TEEs). TEEs, including Intel’s Software Guard Extensions (SGX) and ARM’s TrustZone, offers protected contexts within the processor, where data is kept confidential ,intact and secure , even with malicious software or compromised operating systems. In this research, we have explored the architecture and security features of TEEs like Intel SGX and ARM TrustZone, and their effectiveness in improving cloud data security. From a thorough literature survey ,we have analyzed the deployment strategies, performance indicators, and practical uses of these TEEs for the same purpose. In addition, we have discussed the issues regarding deployment, possible weaknesses, scalability issues, and integration issues. Our results focuses on the central position of TEEs in strengthening and advancing cloud security infrastructures, pointing towards their ability to create a secure foundation for Confidential Computing.

[LG-12] Uncertainty Quantification for Reduced-Order Surrogate Models Applied to Cloud Microphysics NEURIPS2025

链接: https://arxiv.org/abs/2511.04534
作者: Jonas E. Katona,Emily K. de Jong,Nipun Gunawardena
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Computational Physics (physics.comp-ph)
*备注: Accepted at the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences (ML4PS). 11 pages, 4 figures, 1 table. LLNL-CONF-2010541

点击查看摘要

Abstract:Reduced-order models (ROMs) can efficiently simulate high-dimensional physical systems, but lack robust uncertainty quantification methods. Existing approaches are frequently architecture- or training-specific, which limits flexibility and generalization. We introduce a post hoc, model-agnostic framework for predictive uncertainty quantification in latent space ROMs that requires no modification to the underlying architecture or training procedure. Using conformal prediction, our approach estimates statistical prediction intervals for multiple components of the ROM pipeline: latent dynamics, reconstruction, and end-to-end predictions. We demonstrate the method on a latent space dynamical model for cloud microphysics, where it accurately predicts the evolution of droplet-size distributions and quantifies uncertainty across the ROM pipeline.

[LG-13] End-to-End Reinforcement Learning of Koopman Models for eNMPC of an Air Separation Unit

链接: https://arxiv.org/abs/2511.04522
作者: Daniel Mayfrank,Kayra Dernek,Laura Lang,Alexander Mitsos,Manuel Dahmen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: manuscript (8 pages, 5 figures, 1 table), supplementary materials (5 pages, 1 figure, 1 table)

点击查看摘要

Abstract:With our recently proposed method based on reinforcement learning (Mayfrank et al. (2024), Comput. Chem. Eng. 190), Koopman surrogate models can be trained for optimal performance in specific (economic) nonlinear model predictive control ((e)NMPC) applications. So far, our method has exclusively been demonstrated on a small-scale case study. Herein, we show that our method scales well to a more challenging demand response case study built on a large-scale model of a single-product (nitrogen) air separation unit. Across all numerical experiments, we assume observability of only a few realistically measurable plant variables. Compared to a purely system identification-based Koopman eNMPC, which generates small economic savings but frequently violates constraints, our method delivers similar economic performance while avoiding constraint violations.

[LG-14] Comparing EPGP Surrogates and Finite Elements Under Degree-of-Freedom Parity

链接: https://arxiv.org/abs/2511.04518
作者: Obed Amo,Samit Ghosh,Markus Lange-Hegermann,Bogdan Raiţă,Michael Pokojovy
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:We present a new benchmarking study comparing a boundary-constrained Ehrenpreis–Palamodov Gaussian Process (B-EPGP) surrogate with a classical finite element method combined with Crank–Nicolson time stepping (CN-FEM) for solving the two-dimensional wave equation with homogeneous Dirichlet boundary conditions. The B-EPGP construction leverages exponential-polynomial bases derived from the characteristic variety to enforce the PDE and boundary conditions exactly and employs penalized least squares to estimate the coefficients. To ensure fairness across paradigms, we introduce a degrees-of-freedom (DoF) matching protocol. Under matched DoF, B-EPGP consistently attains lower space-time L^2 -error and maximum-in-time L^2 -error in space than CN-FEM, improving accuracy by roughly two orders of magnitude.

[LG-15] Linear Mode Connectivity under Data Shifts for Deep Ensembles of Image Classifiers

链接: https://arxiv.org/abs/2511.04514
作者: C. Hepburn,T. Zielke,A.P. Raulf
类目: Machine Learning (cs.LG)
*备注: 16 pages, 22 figures

点击查看摘要

Abstract:The phenomenon of linear mode connectivity (LMC) links several aspects of deep learning, including training stability under noisy stochastic gradients, the smoothness and generalization of local minima (basins), the similarity and functional diversity of sampled models, and architectural effects on data processing. In this work, we experimentally study LMC under data shifts and identify conditions that mitigate their impact. We interpret data shifts as an additional source of stochastic gradient noise, which can be reduced through small learning rates and large batch sizes. These parameters influence whether models converge to the same local minimum or to regions of the loss landscape with varying smoothness and generalization. Although models sampled via LMC tend to make similar errors more frequently than those converging to different basins, the benefit of LMC lies in balancing training efficiency against the gains achieved from larger, more diverse ensembles. Code and supplementary materials will be made publicly available at this https URL in due course.

[LG-16] Online Algorithms for Repeated Optimal Stopping: Achieving Both Competitive Ratio and Regret Bounds

链接: https://arxiv.org/abs/2511.04484
作者: Tsubasa Harada,Yasushi Kawase,Hanna Sumita
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 33 pages

点击查看摘要

Abstract:We study the repeated optimal stopping problem, which generalizes the classical optimal stopping problem with an unknown distribution to a setting where the same problem is solved repeatedly over T rounds. In this framework, we aim to design algorithms that guarantee a competitive ratio in each round while also achieving sublinear regret across all rounds. Our primary contribution is a general algorithmic framework that achieves these objectives simultaneously for a wide array of repeated optimal stopping problems. The core idea is to dynamically select an algorithm for each round, choosing between two candidates: (1) an empirically optimal algorithm derived from the history of observations, and (2) a sample-based algorithm with a proven competitive ratio guarantee. Based on this approach, we design an algorithm that performs no worse than the baseline sample-based algorithm in every round, while ensuring that the total regret is bounded by \tildeO(\sqrtT) . We demonstrate the broad applicability of our framework to canonical problems, including the prophet inequality, the secretary problem, and their variants under adversarial, random, and i.i.d. input models. For example, for the repeated prophet inequality problem, our method achieves a 1/2 -competitive ratio from the second round on and an \tildeO(\sqrtT) regret. Furthermore, we establish a regret lower bound of \Omega(\sqrtT) even in the i.i.d. model, confirming that our algorithm’s performance is almost optimal with respect to the number of rounds. Comments: 33 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2511.04484 [cs.DS] (or arXiv:2511.04484v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2511.04484 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] owards Causal Market Simulators

链接: https://arxiv.org/abs/2511.04469
作者: Dennis Thumm,Luis Ontaneda Mijares
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Other Statistics (stat.OT)
*备注: ICAIF 2025 Workshop on Rethinking Financial Time-Series

点击查看摘要

Abstract:Market generators using deep generative models have shown promise for synthetic financial data generation, but existing approaches lack causal reasoning capabilities essential for counterfactual analysis and risk assessment. We propose a Time-series Neural Causal Model VAE (TNCM-VAE) that combines variational autoencoders with structural causal models to generate counterfactual financial time series while preserving both temporal dependencies and causal relationships. Our approach enforces causal constraints through directed acyclic graphs in the decoder architecture and employs the causal Wasserstein distance for training. We validate our method on synthetic autoregressive models inspired by the Ornstein-Uhlenbeck process, demonstrating superior performance in counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth. The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

[LG-18] Data-driven uncertainty-aware seakeeping prediction of the Delft 372 catamaran using ensemble Hankel dynamic mode decomposition

链接: https://arxiv.org/abs/2511.04461
作者: Giorgio Palma,Andrea Serani,Matteo Diez
类目: ystems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we present and validate an ensemble-based Hankel Dynamic Mode Decomposition with control (HDMDc) for uncertainty-aware seakeeping predictions of a high-speed catamaran, namely the Delft 372 model. Experimental measurements (time histories) of wave elevation at the longitudinal center of gravity, heave, pitch, notional flight-deck velocity, notional bridge acceleration, and total resistance were collected from irregular wave basin tests on a 1:33.3 scale replica of the Delft 372 model under sea state 5 conditions at Fr = 0.425, and organized into training, validation, and test sets. The HDMDc algorithm constructs an equation-free linear reduced-order model of the seakeeping vessel by augmenting states and inputs with their time-lagged copies to capture nonlinear and memory effects. Two ensembling strategies, namely Bayesian HDMDc (BHDMDc), which samples hyperparameters considered stochastic variables with prior distribution to produce posterior mean forecasts with confidence intervals, and Frequentist HDMDc (FHDMDc), which aggregates multiple model obtained over data subsets, are compared in providing seakeeping prediction and uncertainty quantification. The FHDMDc approach is found to improve the accuracy of the predictions compared to the deterministic counterpart, also providing robust uncertainty estimation; whereas the application of BHDMDc to the present test case is not found beneficial in comparison to the deterministic model. FHDMDc-derived probability density functions for the motions closely match both experimental data and URANS results, demonstrating reliable and computationally efficient seakeeping prediction for design and operational support.

[LG-19] Federated Stochastic Minimax Optimization under Heavy-Tailed Noises

链接: https://arxiv.org/abs/2511.04456
作者: Xinwen Zhang,Hongchang Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heavy-tailed noise has attracted growing attention in nonconvex stochastic optimization, as numerous empirical studies suggest it offers a more realistic assumption than standard bounded variance assumption. In this work, we investigate nonconvex-PL minimax optimization under heavy-tailed gradient noise in federated learning. We propose two novel algorithms: Fed-NSGDA-M, which integrates normalized gradients, and FedMuon-DA, which leverages the Muon optimizer for local updates. Both algorithms are designed to effectively address heavy-tailed noise in federated minimax optimization, under a milder condition. We theoretically establish that both algorithms achieve a convergence rate of O(1/(TNp)^\fracs-12s) . To the best of our knowledge, these are the first federated minimax optimization algorithms with rigorous theoretical guarantees under heavy-tailed noise. Extensive experiments further validate their effectiveness.

[LG-20] Fitting Reinforcement Learning Model to Behavioral Data under Bandits

链接: https://arxiv.org/abs/2511.04454
作者: Hao Zhu,Jasper Hoffmann,Baohe Zhang,Joschka Boedecker
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:We consider the problem of fitting a reinforcement learning (RL) model to some given behavioral data under a multi-armed bandit environment. These models have received much attention in recent years for characterizing human and animal decision making behavior. We provide a generic mathematical optimization problem formulation for the fitting problem of a wide range of RL models that appear frequently in scientific research applications, followed by a detailed theoretical analysis of its convexity properties. Based on the theoretical results, we introduce a novel solution method for the fitting problem of RL models based on convex relaxation and optimization. Our method is then evaluated in several simulated bandit environments to compare with some benchmark methods that appear in the literature. Numerical results indicate that our method achieves comparable performance to the state-of-the-art, while significantly reducing computation time. We also provide an open-source Python package for our proposed method to empower researchers to apply it in the analysis of their datasets directly, without prior knowledge of convex optimization.

[LG-21] ForecastGAN: A Decomposition-Based Adversarial Framework for Multi-Horizon Time Series Forecasting

链接: https://arxiv.org/abs/2511.04445
作者: Syeda Sitara Wishal Fatima,Afshin Rahimi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Portions of this work were previously published in the author’s Master’s thesis at University of Windsor (2024)

点击查看摘要

Abstract:Time series forecasting is essential across domains from finance to supply chain management. This paper introduces ForecastGAN, a novel decomposition based adversarial framework addressing limitations in existing approaches for multi-horizon predictions. Although transformer models excel in long-term forecasting, they often underperform in short-term scenarios and typically ignore categorical features. ForecastGAN operates through three integrated modules: a Decomposition Module that extracts seasonality and trend components; a Model Selection Module that identifies optimal neural network configurations based on forecasting horizon; and an Adversarial Training Module that enhances prediction robustness through Conditional Generative Adversarial Network training. Unlike conventional approaches, ForecastGAN effectively integrates both numerical and categorical features. We validate our framework on eleven benchmark multivariate time series datasets that span various forecasting horizons. The results show that ForecastGAN consistently outperforms state-of-the-art transformer models for short-term forecasting while remaining competitive for long-term horizons. This research establishes a more generalizable approach to time series forecasting that adapts to specific contexts while maintaining strong performance across diverse data characteristics without extensive hyperparameter tuning.

[LG-22] Where Do LLM s Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

链接: https://arxiv.org/abs/2511.04355
作者: Amir Molzam Sharifloo,Maedeh Heydari,Parsa Kazerooni,Daniel Maninger,Mira Mezini
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: To be published in Proceedings of 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware), Data Benchmark Track

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.

[LG-23] DeepPAAC: A New Deep Galerkin Method for Principal-Agent Problems

链接: https://arxiv.org/abs/2511.04309
作者: Michael Ludkovski,Changgen Xie,Zimu Zhu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider numerical resolution of principal-agent (PA) problems in continuous time. We formulate a generic PA model with continuous and lump payments and a multi-dimensional strategy of the agent. To tackle the resulting Hamilton-Jacobi-Bellman equation with an implicit Hamiltonian we develop a novel deep learning method: the Deep Principal-Agent Actor Critic (DeepPAAC) Actor-Critic algorithm. DeepPAAC is able to handle multi-dimensional states and controls, as well as constraints. We investigate the role of the neural network architecture, training designs, loss functions, etc. on the convergence of the solver, presenting five different case studies.

[LG-24] Guided by Stars: Interpretable Concept Learning Over Time Series via Temporal Logic Semantics

链接: https://arxiv.org/abs/2511.04244
作者: Irene Ferfoglia,Simone Silvetti,Gaia Saveri,Laura Nenzi,Luca Bortolussi
类目: Machine Learning (cs.LG)
*备注: submitted to Journal of Artificial Intelligence Research (JAIR), 2025

点击查看摘要

Abstract:Time series classification is a task of paramount importance, as this kind of data often arises in safety-critical applications. However, it is typically tackled with black-box deep learning methods, making it hard for humans to understand the rationale behind their output. To take on this challenge, we propose a novel approach, STELLE (Signal Temporal logic Embedding for Logically-grounded Learning and Explanation), a neuro-symbolic framework that unifies classification and explanation through direct embedding of trajectories into a space of temporal logic concepts. By introducing a novel STL-inspired kernel that maps raw time series to their alignment with predefined STL formulae, our model jointly optimises accuracy and interpretability, as each prediction is accompanied by the most relevant logical concepts that characterise it. This yields (i) local explanations as human-readable STL conditions justifying individual predictions, and (ii) global explanations as class-characterising formulae. Experiments demonstrate that STELLE achieves competitive accuracy while providing logically faithful explanations, validated on diverse real-world benchmarks.

[LG-25] ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

链接: https://arxiv.org/abs/2511.04162
作者: Xiaokai Wang,Shaoyuan Huang,Yuting Li,Xiaofei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) form the cornerstone of modern AI services, supporting a wide range of applications, including autonomous driving, chatbots, and recommendation systems. As models increase in size and complexity, DNN workloads like training and inference tasks impose unprecedented demands on distributed computing resources, making the accurate prediction of runtime essential for optimizing development and resource allocation. Traditional methods rely on additive computational unit models, limiting their accuracy and generalizability. In contrast, graph-enhanced modeling improves performance but significantly increases data collection costs. Therefore, there is a critical need for a method that strikes a balance between accuracy, generalizability, and the costs of data collection. To address these challenges, we propose ScaleDL, a novel runtime prediction framework that combines nonlinear layer-wise modeling with graph neural network (GNN)-based cross-layer interaction mechanism, enabling accurate DNN runtime prediction and hierarchical generalizability across different network architectures. Additionally, we employ the D-optimal method to reduce data collection costs. Experiments on the workloads of five popular DNN models prove that ScaleDL enhances runtime prediction accuracy and generalizability, achieving 6 \times lower MRE and 5 \times lower RMSE compared to baseline models.

[LG-26] On Joint Regularization and Calibration in Deep Ensembles

链接: https://arxiv.org/abs/2511.04160
作者: Laurits Fredsgaard(1),Mikkel N. Schmidt(1) ((1) Department of Applied Mathematics and Computer Science, Technical University of Denmark)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 39 pages, 8 figures, 11 tables

点击查看摘要

Abstract:Deep ensembles are a powerful tool in machine learning, improving both model performance and uncertainty calibration. While ensembles are typically formed by training and tuning models individually, evidence suggests that jointly tuning the ensemble can lead to better performance. This paper investigates the impact of jointly tuning weight decay, temperature scaling, and early stopping on both predictive performance and uncertainty quantification. Additionally, we propose a partially overlapping holdout strategy as a practical compromise between enabling joint evaluation and maximizing the use of data for training. Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We highlight the trade-offs between individual and joint optimization in deep ensemble training, with the overlapping holdout strategy offering an attractive practical solution. We believe our findings provide valuable insights and guidance for practitioners looking to optimize deep ensemble models. Code is available at: this https URL

[LG-27] Deep Learning Approach for Clinical Risk Identification Using Transformer Modeling of Heterogeneous EHR Data

链接: https://arxiv.org/abs/2511.04158
作者: Anzhuo Xie,Wei-Chen Chang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes a Transformer-based longitudinal modeling method to address challenges in clinical risk classification with heterogeneous Electronic Health Record (EHR) data, including irregular temporal patterns, large modality differences, and complex semantic structures. The method takes multi-source medical features as input and employs a feature embedding layer to achieve a unified representation of structured and unstructured data. A learnable temporal encoding mechanism is introduced to capture dynamic evolution under uneven sampling intervals. The core model adopts a multi-head self-attention structure to perform global dependency modeling on longitudinal sequences, enabling the aggregation of long-term trends and short-term fluctuations across different temporal scales. To enhance semantic representation, a semantic-weighted pooling module is designed to assign adaptive importance to key medical events, improving the discriminative ability of risk-related features. Finally, a linear mapping layer generates individual-level risk scores. Experimental results show that the proposed model outperforms traditional machine learning and temporal deep learning models in accuracy, recall, precision, and F1-Score, achieving stable and precise risk identification in multi-source heterogeneous EHR environments and providing an efficient and reliable framework for clinical intelligent decision-making.

[LG-28] Learning to Land Anywhere: Transferable Generative Models for Aircraft Trajectories

链接: https://arxiv.org/abs/2511.04155
作者: Olav Finne Praesteng Larsen,Massimiliano Ruocco,Michail Spitieris,Abdulmajid Murad,Martina Ragosta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Access to trajectory data is a key requirement for developing and validating Air Traffic Management (ATM) solutions, yet many secondary and regional airports face severe data scarcity. This limits the applicability of machine learning methods and the ability to perform large-scale simulations or “what-if” analyses. In this paper, we investigate whether generative models trained on data-rich airports can be efficiently adapted to data-scarce airports using transfer learning. We adapt state-of-the-art diffusion- and flow-matching-based architectures to the aviation domain and evaluate their transferability between Zurich (source) and Dublin (target) landing trajectory datasets. Models are pretrained on Zurich and fine-tuned on Dublin with varying amounts of local data, ranging from 0% to 100%. Results show that diffusion-based models achieve competitive performance with as little as 5% of the Dublin data and reach baseline-level performance around 20%, consistently outperforming models trained from scratch across metrics and visual inspections. Latent flow matching and latent diffusion models also benefit from pretraining, though with more variable gains, while flow matching models show weaker generalization. Despite challenges in capturing rare trajectory patterns, these findings demonstrate the potential of transfer learning to substantially reduce data requirements for trajectory generation in ATM, enabling realistic synthetic data generation even in environments with limited historical records.

[LG-29] Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

链接: https://arxiv.org/abs/2511.04147
作者: Jiaming Zhang,Yujie Yang,Haoning Wang,Liping Zhang,Shengbo Eben Li
类目: Machine Learning (cs.LG)
*备注: Submitted to the Journal of Machine Learning Research (JMLR), under review

点击查看摘要

Abstract:Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.

[LG-30] Exploring the Feasibility of End-to-End Large Language Model as a Compiler IJCNN2025

链接: https://arxiv.org/abs/2511.04132
作者: Hongbin Zhang,Shihao Gao,Yang Liu,Mingjie Xing,Yanjun Wu,Chen Zhao
类目: Machine Learning (cs.LG)
*备注: This work has been accepted by IJCNN 2025 and submitted to the IEEE for publication

点击查看摘要

Abstract:In recent years, end-to-end Large Language Model (LLM) technology has shown substantial advantages across various domains. As critical system software and infrastructure, compilers are responsible for transforming source code into target code. While LLMs have been leveraged to assist in compiler development and maintenance, their potential as an end-to-end compiler remains largely unexplored. This paper explores the feasibility of LLM as a Compiler (LaaC) and its future directions. We designed the CompilerEval dataset and framework specifically to evaluate the capabilities of mainstream LLMs in source code comprehension and assembly code generation. In the evaluation, we analyzed various errors, explored multiple methods to improve LLM-generated code, and evaluated cross-platform compilation capabilities. Experimental results demonstrate that LLMs exhibit basic capabilities as compilers but currently achieve low compilation success rates. By optimizing prompts, scaling up the model, and incorporating reasoning methods, the quality of assembly code generated by LLMs can be significantly enhanced. Based on these findings, we maintain an optimistic outlook for LaaC and propose practical architectural designs and future research directions. We believe that with targeted training, knowledge-rich prompts, and specialized infrastructure, LaaC has the potential to generate high-quality assembly code and drive a paradigm shift in the field of compilation.

[LG-31] Decomposable Neuro Symbolic Regression

链接: https://arxiv.org/abs/2511.04124
作者: Giorgio Morales,John W. Sheppard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression (SR) models complex systems by discovering mathematical expressions that capture underlying relationships in observed data. However, most SR methods prioritize minimizing prediction error over identifying the governing equations, often producing overly complex or inaccurate expressions. To address this, we present a decomposable SR method that generates interpretable multivariate expressions leveraging transformer models, genetic algorithms (GAs), and genetic programming (GP). In particular, our explainable SR method distills a trained ``opaque’’ regression model into mathematical expressions that serve as explanations of its computed function. Our method employs a Multi-Set Transformer to generate multiple univariate symbolic skeletons that characterize how each variable influences the opaque model’s response. We then evaluate the generated skeletons’ performance using a GA-based approach to select a subset of high-quality candidates before incrementally merging them via a GP-based cascade procedure that preserves their original skeleton structure. The final multivariate skeletons undergo coefficient optimization via a GA. We evaluated our method on problems with controlled and varying degrees of noise, demonstrating lower or comparable interpolation and extrapolation errors compared to two GP-based methods, three neural SR methods, and a hybrid approach. Unlike them, our approach consistently learned expressions that matched the original mathematical structure.

[LG-32] KoTaP: A Panel Dataset for Corporate Tax Avoidance Performance and Governance in Korea

链接: https://arxiv.org/abs/2511.04094
作者: Hyungjong Na,Wonho Song,Seungyong Han,Donghyeon Jo,Sejin Myung,Hyungjoon Kim
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, 8 tables. Submitted to Scientific Data; currently under review. Data and codebook available at Zenodo (DOI: https://doi.org/10.5281/zenodo.17149808 )

点击查看摘要

Abstract:This study introduces the Korean Tax Avoidance Panel (KoTaP), a long-term panel dataset of non-financial firms listed on KOSPI and KOSDAQ between 2011 and 2024. After excluding financial firms, firms with non-December fiscal year ends, capital impairment, and negative pre-tax income, the final dataset consists of 12,653 firm-year observations from 1,754 firms. KoTaP is designed to treat corporate tax avoidance as a predictor variable and link it to multiple domains, including earnings management (accrual- and activity-based), profitability (ROA, ROE, CFO, LOSS), stability (LEV, CUR, SIZE, PPE, AGE, INVREC), growth (GRW, MB, TQ), and governance (BIG4, FORN, OWN). Tax avoidance itself is measured using complementary indicators cash effective tax rate (CETR), GAAP effective tax rate (GETR), and book-tax difference measures (TSTA, TSDA) with adjustments to ensure interpretability. A key strength of KoTaP is its balanced panel structure with standardized variables and its consistency with international literature on the distribution and correlation of core indicators. At the same time, it reflects distinctive institutional features of Korean firms, such as concentrated ownership, high foreign shareholding, and elevated liquidity ratios, providing both international comparability and contextual uniqueness. KoTaP enables applications in benchmarking econometric and deep learning models, external validity checks, and explainable AI analyses. It further supports policy evaluation, audit planning, and investment analysis, making it a critical open resource for accounting, finance, and interdisciplinary research.

[LG-33] Learning Filter-Aware Distance Metrics for Nearest Neighbor Search with Multiple Filters

链接: https://arxiv.org/abs/2511.04073
作者: Ananya Sutradhar,Suryansh Gupta,Ravishankar Krishnaswamy,Haiyang Xu,Aseem Rastogi,Gopal Srinivasa
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 1st Workshop on Vector Databases at International Conference on Machine Learning, 2025

点击查看摘要

Abstract:Filtered Approximate Nearest Neighbor (ANN) search retrieves the closest vectors for a query vector from a dataset. It enforces that a specified set of discrete labels S for the query must be included in the labels of each retrieved vector. Existing graph-based methods typically incorporate filter awareness by assigning fixed penalties or prioritizing nodes based on filter satisfaction. However, since these methods use fixed, data in- dependent penalties, they often fail to generalize across datasets with diverse label and vector distributions. In this work, we propose a principled alternative that learns the optimal trade-off between vector distance and filter match directly from the data, rather than relying on fixed penalties. We formulate this as a constrained linear optimization problem, deriving weights that better reflect the underlying filter distribution and more effectively address the filtered ANN search problem. These learned weights guide both the search process and index construction, leading to graph structures that more effectively capture the underlying filter distribution and filter semantics. Our experiments demonstrate that adapting the distance function to the data significantly im- proves accuracy by 5-10% over fixed-penalty methods, providing a more flexible and generalizable framework for the filtered ANN search problem.

[LG-34] Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training

链接: https://arxiv.org/abs/2511.04040
作者: Xiaoling Luo,Peng Chen,Chengliang Liu,Xiaopeng Jin,Jie Wen,Yumeng Liu,Junsong Wang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Multimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre-training mechanisms. To acquire complex protein information, we introduce reconstructive pre-training to mine more fine-grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi-label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.

[LG-35] Use of Continuous Glucose Monitoring with Machine Learning to Identify Metabolic Subphenotypes and Inform Precision Lifestyle Changes

链接: https://arxiv.org/abs/2511.03986
作者: Ahmed A. Metwally,Heyjun Park,Yue Wu,Tracey McLaughlin,Michael P. Snyder
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:The classification of diabetes and prediabetes by static glucose thresholds obscures the pathophysiological dysglycemia heterogeneity, primarily driven by insulin resistance (IR), beta-cell dysfunction, and incretin deficiency. This review demonstrates that continuous glucose monitoring and wearable technologies enable a paradigm shift towards non-invasive, dynamic metabolic phenotyping. We show evidence that machine learning models can leverage high-resolution glucose data from at-home, CGM-enabled oral glucose tolerance tests to accurately predict gold-standard measures of muscle IR and beta-cell function. This personalized characterization extends to real-world nutrition, where an individual’s unique postprandial glycemic response (PPGR) to standardized meals, such as the relative glucose spike to potatoes versus grapes, could serve as a biomarker for their metabolic subtype. Moreover, integrating wearable data reveals that habitual diet, sleep, and physical activity patterns, particularly their timing, are uniquely associated with specific metabolic dysfunctions, informing precision lifestyle interventions. The efficacy of dietary mitigators in attenuating PPGR is also shown to be phenotype-dependent. Collectively, this evidence demonstrates that CGM can deconstruct the complexity of early dysglycemia into distinct, actionable subphenotypes. This approach moves beyond simple glycemic control, paving the way for targeted nutritional, behavioral, and pharmacological strategies tailored to an individual’s core metabolic defects, thereby paving the way for a new era of precision diabetes prevention.

[LG-36] wIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

链接: https://arxiv.org/abs/2511.03983
作者: Michael Menezes,Barbara Su,Xinze Feng,Yehya Farhat,Hamza Shili,Anastasios Kyrillidis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks (“golden tickets”) without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

[LG-37] Structural Priors and Modular Adapters in the Composable Fine-Tuning Algorithm of Large-Scale Models

链接: https://arxiv.org/abs/2511.03981
作者: Yuxiao Wang,Di Wu,Feng Liu,Zhimin Qiu,Chenrui Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a composable fine-tuning method that integrates graph structural priors with modular adapters to address the high computational cost and structural instability faced by large-scale pre-trained models in multi-task adaptation. The method introduces a relation matrix to model dependencies among tasks, explicitly encoding correlations between nodes and paths into graph structural priors, which provide unified structural constraints for adapter weight allocation and path selection. Modular adapters are embedded into different layers through low-rank mapping and a pluggable mechanism, enabling efficient cross-task composition and reuse under prior guidance. This mechanism not only improves parameter efficiency and training stability but also alleviates path conflicts and redundant computation in multi-task scenarios. Furthermore, experiments on hyperparameter sensitivity, environmental sensitivity, and data sensitivity are conducted to systematically analyze key factors such as routing temperature, gating thresholds, and relation matrix regularization strength, verifying the consistency and superior performance of the method under structural constraints. The results demonstrate that the proposed framework significantly enhances task prediction accuracy, adapter weight allocation precision, and overall computational efficiency while maintaining model lightweight design, highlighting the synergistic advantages of graph priors and modular mechanisms in composable fine-tuning.

[LG-38] Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models

链接: https://arxiv.org/abs/2511.03972
作者: Semih Cayci
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.

[LG-39] PrivacyCD: Hierarchical Unlearning for Protecting Student Privacy in Cognitive Diagnosis

链接: https://arxiv.org/abs/2511.03966
作者: Mingliang Hou,Yinuo Wang,Teng Guo,Zitao Liu,Wenzhou Dou,Jiaqi Zheng,Renqiang Luo,Mi Tian,Weiqi Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The need to remove specific student data from cognitive diagnosis (CD) models has become a pressing requirement, driven by users’ growing assertion of their “right to be forgotten”. However, existing CD models are largely designed without privacy considerations and lack effective data unlearning mechanisms. Directly applying general purpose unlearning algorithms is suboptimal, as they struggle to balance unlearning completeness, model utility, and efficiency when confronted with the unique heterogeneous structure of CD models. To address this, our paper presents the first systematic study of the data unlearning problem for CD models, proposing a novel and efficient algorithm: hierarchical importanceguided forgetting (HIF). Our key insight is that parameter importance in CD models exhibits distinct layer wise characteristics. HIF leverages this via an innovative smoothing mechanism that combines individual and layer, level importance, enabling a more precise distinction of parameters associated with the data to be unlearned. Experiments on three real world datasets show that HIF significantly outperforms baselines on key metrics, offering the first effective solution for CD models to respond to user data removal requests and for deploying high-performance, privacy preserving AI systems

[LG-40] Conditional Score Learning for Quickest Change Detection in Markov Transition Kernels

链接: https://arxiv.org/abs/2511.03953
作者: Wuxia Chen,Taposh Banerjee,Vahid Tarokh
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address the problem of quickest change detection in Markov processes with unknown transition kernels. The key idea is to learn the conditional score \nabla_\mathbfy \log p(\mathbfy|\mathbfx) directly from sample pairs ( \mathbfx,\mathbfy) , where both \mathbfx and \mathbfy are high-dimensional data generated by the same transition kernel. In this way, we avoid explicit likelihood evaluation and provide a practical way to learn the transition dynamics. Based on this estimation, we develop a score-based CUSUM procedure that uses conditional Hyvarinen score differences to detect changes in the kernel. To ensure bounded increments, we propose a truncated version of the statistic. With Hoeffding’s inequality for uniformly ergodic Markov processes, we prove exponential lower bounds on the mean time to false alarm. We also prove asymptotic upper bounds on detection delay. These results give both theoretical guarantees and practical feasibility for score-based detection in high-dimensional Markov models.

[LG-41] LogHD: Robust Compression of Hyperdimensional Classifiers via Logarithmic Class-Axis Reduction DATE2026

链接: https://arxiv.org/abs/2511.03938
作者: Sanggeon Yun,Hyunwoo Oh,Ryozo Masukawa,Pietro Mercati,Nathaniel D. Bastian,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注: Accepted to DATE 2026

点击查看摘要

Abstract:Hyperdimensional computing (HDC) suits memory, energy, and reliability-constrained systems, yet the standard “one prototype per class” design requires O(CD) memory (with C classes and dimensionality D ). Prior compaction reduces D (feature axis), improving storage/compute but weakening robustness. We introduce LogHD, a logarithmic class-axis reduction that replaces the C per-class prototypes with n!\approx!\lceil\log_k C\rceil bundle hypervectors (alphabet size k ) and decodes in an n -dimensional activation space, cutting memory to O(D\log_k C) while preserving D . LogHD uses a capacity-aware codebook and profile-based decoding, and composes with feature-axis sparsification. Across datasets and injected bit flips, LogHD attains competitive accuracy with smaller models and higher resilience at matched memory. Under equal memory, it sustains target accuracy at roughly 2.5 - 3.0\times higher bit-flip rates than feature-axis compression; an ASIC instantiation delivers 498\times energy efficiency and 62.6\times speedup over an AMD Ryzen 9 9950X and 24.3\times / 6.58\times over an NVIDIA RTX 4090, and is 4.06\times more energy-efficient and 2.19\times faster than a feature-axis HDC ASIC baseline.

[LG-42] SynQuE: Estimating Synthetic Dataset Quality Without Annotations

链接: https://arxiv.org/abs/2511.03928
作者: Arthur Chen,Victor Zhong
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose LENS, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with LENS consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection.

[LG-43] On Predicting Sociodemographics from Mobility Signals

链接: https://arxiv.org/abs/2511.03924
作者: Ekin Uğurel,Cynthia Chen,Brian H. Y. Lee,Filipe Rodrigues
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:Inferring sociodemographic attributes from mobility data could help transportation planners better leverage passively collected datasets, but this task remains difficult due to weak and inconsistent relationships between mobility patterns and sociodemographic traits, as well as limited generalization across contexts. We address these challenges from three angles. First, to improve predictive accuracy while retaining interpretability, we introduce a behaviorally grounded set of higher-order mobility descriptors based on directed mobility graphs. These features capture structured patterns in trip sequences, travel modes, and social co-travel, and significantly improve prediction of age, gender, income, and household structure over baselines features. Second, we introduce metrics and visual diagnostic tools that encourage evenness between model confidence and accuracy, enabling planners to quantify uncertainty. Third, to improve generalization and sample efficiency, we develop a multitask learning framework that jointly predicts multiple sociodemographic attributes from a shared representation. This approach outperforms single-task models, particularly when training data are limited or when applying models across different time periods (i.e., when the test set distribution differs from the training set).

[LG-44] DecoHD: Decomposed Hyperdimensional Classification under Extreme Memory Budgets DATE2026

链接: https://arxiv.org/abs/2511.03911
作者: Sanggeon Yun,Hyunwoo Oh,Ryozo Masukawa,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注: Accepted to DATE 2026

点击查看摘要

Abstract:Decomposition is a proven way to shrink deep networks without changing I/O. We bring this idea to hyperdimensional computing (HDC), where footprint cuts usually shrink the feature axis and erode concentration and robustness. Prior HDC decompositions decode via fixed atomic hypervectors, which are ill-suited for compressing learned class prototypes. We introduce DecoHD, which learns directly in a decomposed HDC parameterization: a small, shared set of per-layer channels with multiplicative binding across layers and bundling at the end, yielding a large representational space from compact factors. DecoHD compresses along the class axis via a lightweight bundling head while preserving native bind-bundle-score; training is end-to-end, and inference remains pure HDC, aligning with in/near-memory accelerators. In evaluation, DecoHD attains extreme memory savings with only minor accuracy degradation under tight deployment budgets. On average it stays within about 0.1-0.15% of a strong non-reduced HDC baseline (worst case 5.7%), is more robust to random bit-flip noise, reaches its accuracy plateau with up to ~97% fewer trainable parameters, and – in hardware – delivers roughly 277x/35x energy/speed gains over a CPU (AMD Ryzen 9 9950X), 13.5x/3.7x over a GPU (NVIDIA RTX 4090), and 2.0x/2.4x over a baseline HDC ASIC.

[LG-45] Vectorized Computation of Euler Characteristic Functions and Transforms

链接: https://arxiv.org/abs/2511.03909
作者: Jessi Cisewski-Kehe,Brittany Terese Fasy,Alexander McCleary,Eli Quist,Jack Ruder
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:The weighted Euler characteristic transform (WECT) and Euler characteristic function (ECF) have proven to be useful tools in a variety of applications. However, current methods for computing these functions are neither optimized for speed nor do they scale to higher-dimensional settings. In this work, we present a vectorized framework for computing such topological transforms using tensor operations, which is highly optimized for GPU architectures and works in full generality across geometric simplicial complexes (or cubical complexes) of arbitrary dimension. Experimentally, the framework demonstrates significant speedups (up to 180 \times ) over existing methods when computing the WECT and ECF across a variety of image datasets. Computation of these transforms is implemented in a publicly available Python package called pyECT.

[LG-46] Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

链接: https://arxiv.org/abs/2511.03877
作者: Kimia Kazemian(1),Zhenzhen Liu(1),Yangfanyu Yang(2),Katie Z Luo(1),Shuhan Gu(1),Audrey Du(1),Xinyu Yang(2),Jack Jansons(1),Kilian Q Weinberger(1),John Thickstun(1),Yian Yin(2),Sarah Dean(1) ((1) Department of Computer Science, Cornell University (Ithaca, USA), (2) Department of Information Science, Cornell University (Ithaca, USA))
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses - citations of 2.3M papers) and GitHub (pushes/stars - forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views - edits), Spotify (streams - concert attendance), e-commerce (click-throughs - purchases), and LinkedIn profile (views - messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at this https URL.

[LG-47] Which Similarity-Sensitive Entropy?

链接: https://arxiv.org/abs/2511.03849
作者: Phuc Nguyen,Josiah Couch,Rahul Bansal,Alexandra Morgan,Chris Tam,Miao Li,Rima Arnaout,Ramy Arnaout
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:A canonical step in quantifying a system is to measure its entropy. Shannon entropy and other traditional entropy measures capture only the information encoded in the frequencies of a system’s elements. Recently, Leinster, Cobbold, and Reeve (LCR) introduced a method that also captures the rich information encoded in the similarities and differences among elements, yielding similarity-sensitive entropy. More recently, the Vendi score (VS) was introduced as an alternative, raising the question of how LCR and VS compare, and which is preferable. Here we address these questions conceptually, analytically, and experimentally, using 53 machine-learning datasets. We show that LCR and VS can differ by orders of magnitude and can capture complementary information about a system, except in limiting cases. We demonstrate that both LCR and VS depend on how similarities are scaled and introduce the concept of half distance'' to parameterize this dependence. We prove that VS provides an upper bound on LCR for several values of the Rényi-Hill order parameter and conjecture that this bound holds for all values. We conclude that VS is preferable only when interpreting elements as linear combinations of a more fundamental set of ur-elements’’ or when the system or dataset possesses a quantum-mechanical character. In the broader circumstance where one seeks simply to capture the rich information encoded by similarity, LCR is favored; nevertheless, for certain half-distances the two methods can complement each other.

[LG-48] Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction

链接: https://arxiv.org/abs/2511.03836
作者: Lipeng Zu,Hansong Zhou,Xiaonan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer. However, the target updates in DQN often rely on next states generated by actions from past, potentially suboptimal, policy. As a result, these states may not provide informative learning signals, causing high variance into the update process. This issue is exacerbated when the sampled transitions are poorly aligned with the agent’s current policy. To address this limitation, we propose the Successor-state Aggregation Deep Q-Network (SADQ), which explicitly models environment dynamics using a stochastic transition model. SADQ integrates successor-state distributions into the Q-value estimation process, enabling more stable and policy-aligned value updates. Additionally, it explores a more efficient action selection strategy with the modeled transition structure. We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance. Our extensive empirical results across standard RL benchmarks and real-world vector-based control tasks demonstrate that SADQ consistently outperforms DQN variants in both stability and learning efficiency.

[LG-49] Higher-Order Causal Structure Learning with Additive Models

链接: https://arxiv.org/abs/2511.03831
作者: James Enouen,Yujia Zheng,Ignavier Ng,Yan Liu,Kun Zhang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal structure learning has long been the central task of inferring causal insights from data. Despite the abundance of real-world processes exhibiting higher-order mechanisms, however, an explicit treatment of interactions in causal discovery has received little attention. In this work, we focus on extending the causal additive model (CAM) to additive models with higher-order interactions. This second level of modularity we introduce to the structure learning problem is most easily represented by a directed acyclic hypergraph which extends the DAG. We introduce the necessary definitions and theoretical tools to handle the novel structure we introduce and then provide identifiability results for the hyper DAG, extending the typical Markov equivalence classes. We next provide insights into why learning the more complex hypergraph structure may actually lead to better empirical results. In particular, more restrictive assumptions like CAM correspond to easier-to-learn hyper DAGs and better finite sample complexity. We finally develop an extension of the greedy CAM algorithm which can handle the more complex hyper DAG search space and demonstrate its empirical usefulness in synthetic experiments.

[LG-50] From Static to Dynamic: Enhancing Offline-to-Online Reinforcement Learning via Energy-Guided Diffusion Stratification

链接: https://arxiv.org/abs/2511.03828
作者: Lipeng Zu,Hansong Zhou,Xiaonan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transitioning from offline to online reinforcement learning (RL) poses critical challenges due to distributional shifts between the fixed behavior policy in the offline dataset and the evolving policy during online learning. Although this issue is widely recognized, few methods attempt to explicitly assess or utilize the distributional structure of the offline data itself, leaving a research gap in adapting learning strategies to different types of samples. To address this challenge, we propose an innovative method, Energy-Guided Diffusion Stratification (StratDiff), which facilitates smoother transitions in offline-to-online RL. StratDiff deploys a diffusion model to learn prior knowledge from the offline dataset. It then refines this knowledge through energy-based functions to improve policy imitation and generate offline-like actions during online fine-tuning. The KL divergence between the generated action and the corresponding sampled action is computed for each sample and used to stratify the training batch into offline-like and online-like subsets. Offline-like samples are updated using offline objectives, while online-like samples follow online learning strategies. We demonstrate the effectiveness of StratDiff by integrating it with off-the-shelf methods Cal-QL and IQL. Extensive empirical evaluations on D4RL benchmarks show that StratDiff significantly outperforms existing methods, achieving enhanced adaptability and more stable performance across diverse RL settings.

[LG-51] Sketch-Augmented Features Improve Learning Long-Range Dependencies in Graph Neural Networks NEURIPS2025

链接: https://arxiv.org/abs/2511.03824
作者: Ryien Hosseini,Filippo Simini,Venkatram Vishwanath,Rebecca Willett,Henry Hoffmann
类目: Machine Learning (cs.LG)
*备注: To appear at NeurIPS 2025

点击查看摘要

Abstract:Graph Neural Networks learn on graph-structured data by iteratively aggregating local neighborhood information. While this local message passing paradigm imparts a powerful inductive bias and exploits graph sparsity, it also yields three key challenges: (i) oversquashing of long-range information, (ii) oversmoothing of node representations, and (iii) limited expressive power. In this work we inject randomized global embeddings of node features, which we term \textitSketched Random Features, into standard GNNs, enabling them to efficiently capture long-range dependencies. The embeddings are unique, distance-sensitive, and topology-agnostic – properties which we analytically and empirically show alleviate the aforementioned limitations when injected into GNNs. Experimental results on real-world graph learning tasks confirm that this strategy consistently improves performance over baseline GNNs, offering both a standalone solution and a complementary enhancement to existing techniques such as graph positional encodings. Our source code is available at \hrefthis https URLthis https URL.

[LG-52] One Size Does Not Fit All: Architecture-Aware Adaptive Batch Scheduling with DEBA

链接: https://arxiv.org/abs/2511.03809
作者: François Belias,Naser Ezzati-Jivan,Foutse Khomh
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 14 pages

点击查看摘要

Abstract:Adaptive batch size methods aim to accelerate neural network training, but existing approaches apply identical adaptation strategies across all architectures, assuming a one-size-fits-all solution. We introduce DEBA (Dynamic Efficient Batch Adaptation), an adaptive batch scheduler that monitors gradient variance, gradient norm variation and loss variation to guide batch size adaptations. Through systematic evaluation across six architectures (ResNet-18/50, DenseNet-121, EfficientNet-B0, MobileNet-V3, ViT-B16) on CIFAR-10 and CIFAR-100, with five random seeds per configuration, we demonstrate that the architecture fundamentally determines adaptation efficacy. Our findings reveal that: (1) lightweight and medium-depth architectures (MobileNet-V3, DenseNet-121, EfficientNet-B0) achieve a 45-62% training speedup with simultaneous accuracy improvements of 1-7%; (2) shallow residual networks (ResNet-18) show consistent gains of +2.4 - 4.0% in accuracy, 36 - 43% in speedup, while deep residual networks (ResNet-50) exhibit high variance and occasional degradation; (3) already-stable architectures (ViT-B16) show minimal speedup (6%) despite maintaining accuracy, indicating that adaptation benefits vary with baseline optimization characteristics. We introduce a baseline characterization framework using gradient stability metrics (stability score, gradient norm variation) that predicts which architectures will benefit from adaptive scheduling. Our ablation studies reveal critical design choices often overlooked in prior work: sliding window statistics (vs. full history) and sufficient cooldown periods (5+ epochs) between adaptations are essential for success. This work challenges the prevailing assumption that adaptive methods generalize across architectures and provides the first systematic evidence that batch size adaptation requires an architecture-aware design.

[LG-53] Fair and Explainable Credit-Scoring under Concept Drift: Adaptive Explanation Frameworks for Evolving Populations

链接: https://arxiv.org/abs/2511.03807
作者: Shivogo John
类目: Machine Learning (cs.LG)
*备注: 18 pages, 14 figures

点击查看摘要

Abstract:Evolving borrower behaviors, shifting economic conditions, and changing regulatory landscapes continuously reshape the data distributions underlying modern credit-scoring systems. Conventional explainability techniques, such as SHAP, assume static data and fixed background distributions, making their explanations unstable and potentially unfair when concept drift occurs. This study addresses that challenge by developing adaptive explanation frameworks that recalibrate interpretability and fairness in dynamically evolving credit models. Using a multi-year credit dataset, we integrate predictive modeling via XGBoost with three adaptive SHAP variants: (A) per-slice explanation reweighting that adjusts for feature distribution shifts, (B) drift-aware SHAP rebaselining with sliding-window background samples, and © online surrogate calibration using incremental Ridge regression. Each method is benchmarked against static SHAP explanations using metrics of predictive performance (AUC, F1), directional and rank stability (cosine, Kendall tau), and fairness (demographic parity and recalibration). Results show that adaptive methods, particularly rebaselined and surrogate-based explanations, substantially improve temporal stability and reduce disparate impact across demographic groups without degrading predictive accuracy. Robustness tests, including counterfactual perturbations, background sensitivity analysis, and proxy-variable detection, confirm the resilience of adaptive explanations under real-world drift conditions. These findings establish adaptive explainability as a practical mechanism for sustaining transparency, accountability, and ethical reliability in data-driven credit systems, and more broadly, in any domain where decision models evolve with population change.

[LG-54] FusionDP: Foundation Model-Assisted Differentially Private Learning for Partially Sensitive Features

链接: https://arxiv.org/abs/2511.03806
作者: Linghui Zeng,Ruixuan Liu,Atiquer Rahman Sarkar,Xiaoqian Jiang,Joyce C. Ho,Li Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring the privacy of sensitive training data is crucial in privacy-preserving machine learning. However, in practical scenarios, privacy protection may be required for only a subset of features. For instance, in ICU data, demographic attributes like age and gender pose higher privacy risks due to their re-identification potential, whereas raw lab results are generally less sensitive. Traditional DP-SGD enforces privacy protection on all features in one sample, leading to excessive noise injection and significant utility degradation. We propose FusionDP, a two-step framework that enhances model utility under feature-level differential privacy. First, FusionDP leverages large foundation models to impute sensitive features given non-sensitive features, treating them as external priors that provide high-quality estimates of sensitive attributes without accessing the true values during model training. Second, we introduce a modified DP-SGD algorithm that trains models on both original and imputed features while formally preserving the privacy of the original sensitive features. We evaluate FusionDP on two modalities: a sepsis prediction task on tabular data from PhysioNet and a clinical note classification task from MIMIC-III. By comparing against privacy-preserving baselines, our results show that FusionDP significantly improves model performance while maintaining rigorous feature-level privacy, demonstrating the potential of foundation model-driven imputation to enhance the privacy-utility trade-off for various modalities.

[LG-55] Contamination Detection for VLMs using Multi-Modal Semantic Perturbation

链接: https://arxiv.org/abs/2511.03774
作者: Jaden Park,Mu Cai,Feng Yao,Jingbo Shang,Soochahn Lee,Yong Jae Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset will be released publicly.

[LG-56] A Dynamic Recurrent Adjacency Memory Network for Mixed-Generation Power System Stability Forecasting

链接: https://arxiv.org/abs/2511.03746
作者: Guang An Ooi,Otavio Bertozzi,Mohd Asim Aftab,Charalambos Konstantinou,Shehab Ahmed
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Power Systems

点击查看摘要

Abstract:Modern power systems with high penetration of inverter-based resources exhibit complex dynamic behaviors that challenge the scalability and generalizability of traditional stability assessment methods. This paper presents a dynamic recurrent adjacency memory network (DRAMN) that combines physics-informed analysis with deep learning for real-time power system stability forecasting. The framework employs sliding-window dynamic mode decomposition to construct time-varying, multi-layer adjacency matrices from phasor measurement unit and sensor data to capture system dynamics such as modal participation factors, coupling strengths, phase relationships, and spectral energy distributions. As opposed to processing spatial and temporal dependencies separately, DRAMN integrates graph convolution operations directly within recurrent gating mechanisms, enabling simultaneous modeling of evolving dynamics and temporal dependencies. Extensive validations on modified IEEE 9-bus, 39-bus, and a multi-terminal HVDC network demonstrate high performance, achieving 99.85%, 99.90%, and 99.69% average accuracies, respectively, surpassing all tested benchmarks, including classical machine learning algorithms and recent graph-based models. The framework identifies optimal combinations of measurements that reduce feature dimensionality by 82% without performance degradation. Correlation analysis between dominant measurements for small-signal and transient stability events validates generalizability across different stability phenomena. DRAMN achieves state-of-the-art accuracy while providing enhanced interpretability for power system operators, making it suitable for real-time deployment in modern control centers.

[LG-57] Dark Energy Survey Year 3 results: Simulation-based wCDM inference from weak lensing and galaxy clustering maps with deep learning. I. Analysis design

链接: https://arxiv.org/abs/2511.04681
作者: A. Thomsen,J. Bucko,T. Kacprzak,V. Ajani,J. Fluri,A. Refregier,D. Anbajagane,F. J. Castander,A. Ferté,M. Gatti,N. Jeffrey,A. Alarcon,A. Amon,K. Bechtol,M. R. Becker,G. M. Bernstein,A. Campos,A. Carnero Rosell,C. Chang,R. Chen,A. Choi,M. Crocce,C. Davis,J. DeRose,S. Dodelson,C. Doux,K. Eckert,J. Elvin-Poole,S. Everett,P. Fosalba,D. Gruen,I. Harrison,K. Herner,E. M. Huff,M. Jarvis,N. Kuropatkin,P.-F. Leget,N. MacCrann,J. McCullough,J. Myles,A. Navarro-Alsina,S. Pandey,A. Porredon,J. Prat,M. Raveri,M. Rodriguez-Monroy,R. P. Rollins,A. Roodman,E. S. Rykoff,C. Sánchez,L. F. Secco,E. Sheldon,T. Shin,M. A. Troxel,I. Tutusaus,T. N. Varga,N. Weaverdyck,R. H. Wechsler,B. Yanny,B. Yin,Y. Zhang,J. Zuntz,S. Allam,F. Andrade-Oliveira,D. Bacon,J. Blazek,D. Brooks,R. Camilleri,J. Carretero,R. Cawthon,L. N. da Costa,M. E. da Silva Pereira,T. M. Davis,J. De Vicente,S. Desai,P. Doel,J. García-Bellido,G. Gutierrez,S. R. Hinton,D. L. Hollowood,K. Honscheid,D. J. James,K. Kuehn,O. Lahav,S. Lee,J. L. Marshall,J. Mena-Fernández,F. Menanteau,R. Miquel,J. Muir,R. L. C. Ogando,A. A. Plazas Malagón,E. Sanchez,D. Sanchez Cid,I. Sevilla-Noarbe,M. Smith,E. Suchyta,M. E. C. Swanson,D. Thomas,C. To,D. L. Tucker(DES Collaboration)
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 38 pages, 14 figures, submitted

点击查看摘要

Abstract:Data-driven approaches using deep learning are emerging as powerful techniques to extract non-Gaussian information from cosmological large-scale structure. This work presents the first simulation-based inference (SBI) pipeline that combines weak lensing and galaxy clustering maps in a realistic Dark Energy Survey Year 3 (DES Y3) configuration and serves as preparation for a forthcoming analysis of the survey data. We develop a scalable forward model based on the CosmoGridV1 suite of N-body simulations to generate over one million self-consistent mock realizations of DES Y3 at the map level. Leveraging this large dataset, we train deep graph convolutional neural networks on the full survey footprint in spherical geometry to learn low-dimensional features that approximately maximize mutual information with target parameters. These learned compressions enable neural density estimation of the implicit likelihood via normalizing flows in a ten-dimensional parameter space spanning cosmological w CDM, intrinsic alignment, and linear galaxy bias parameters, while marginalizing over baryonic, photometric redshift, and shear bias nuisances. To ensure robustness, we extensively validate our inference pipeline using synthetic observations derived from both systematic contaminations in our forward model and independent Buzzard galaxy catalogs. Our forecasts yield significant improvements in cosmological parameter constraints, achieving 2-3\times higher figures of merit in the \Omega_m - S_8 plane relative to our implementation of baseline two-point statistics and effectively breaking parameter degeneracies through probe combination. These results demonstrate the potential of SBI analyses powered by deep learning for upcoming Stage-IV wide-field imaging surveys.

[LG-58] ODE approximation for the Adam algorithm: General and overparametrized setting

链接: https://arxiv.org/abs/2511.04622
作者: Steffen Dereich,Arnulf Jentzen,Sebastian Kassing
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The Adam optimizer is currently presumably the most popular optimization method in deep learning. In this article we develop an ODE based method to study the Adam optimizer in a fast-slow scaling regime. For fixed momentum parameters and vanishing step-sizes, we show that the Adam algorithm is an asymptotic pseudo-trajectory of the flow of a particular vector field, which is referred to as the Adam vector field. Leveraging properties of asymptotic pseudo-trajectories, we establish convergence results for the Adam algorithm. In particular, in a very general setting we show that if the Adam algorithm converges, then the limit must be a zero of the Adam vector field, rather than a local minimizer or critical point of the objective function. In contrast, in the overparametrized empirical risk minimization setting, the Adam algorithm is able to locally find the set of minima. Specifically, we show that in a neighborhood of the global minima, the objective function serves as a Lyapunov function for the flow induced by the Adam vector field. As a consequence, if the Adam algorithm enters a neighborhood of the global minima infinitely often, it converges to the set of global minima. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2511.04622 [math.OC] (or arXiv:2511.04622v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2511.04622 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] Dynamic causal discovery in Alzheimers disease through latent pseudotime modelling NEURIPS2025

链接: https://arxiv.org/abs/2511.04619
作者: Natalia Glazman,Jyoti Mangal,Pedro Borges,Sebastien Ourselin,M. Jorge Cardoso
类目: Applications (stat.AP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Accepted to the NeurIPS 2025 Workshop on CauScien: Uncovering Causality in Science

点击查看摘要

Abstract:The application of causal discovery to diseases like Alzheimer’s (AD) is limited by the static graph assumptions of most methods; such models cannot account for an evolving pathophysiology, modulated by a latent disease pseudotime. We propose to apply an existing latent variable model to real-world AD data, inferring a pseudotime that orders patients along a data-driven disease trajectory independent of chronological age, then learning how causal relationships evolve. Pseudotime outperformed age in predicting diagnosis (AUC 0.82 vs 0.59). Incorporating minimal, disease-agnostic background knowledge substantially improved graph accuracy and orientation. Our framework reveals dynamic interactions between novel (NfL, GFAP) and established AD markers, enabling practical causal discovery despite violated assumptions.

[LG-60] Physics-Informed Neural Networks and Neural Operators for Parametric PDEs: A Human-AI Collaborative Analysis

链接: https://arxiv.org/abs/2511.04576
作者: Zhuo Zhang,Xiong Xiong,Sen Zhang,Yuan Zhao,Xi Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 61 pages, 3 figures. Submitted to The 1st International Conference on AI Scientists (ICAIS 2025)

点击查看摘要

Abstract:PDEs arise ubiquitously in science and engineering, where solutions depend on parameters (physical properties, boundary conditions, geometry). Traditional numerical methods require re-solving the PDE for each parameter, making parameter space exploration prohibitively expensive. Recent machine learning advances, particularly physics-informed neural networks (PINNs) and neural operators, have revolutionized parametric PDE solving by learning solution operators that generalize across parameter spaces. We critically analyze two main paradigms: (1) PINNs, which embed physical laws as soft constraints and excel at inverse problems with sparse data, and (2) neural operators (e.g., DeepONet, Fourier Neural Operator), which learn mappings between infinite-dimensional function spaces and achieve unprecedented generalization. Through comparisons across fluid dynamics, solid mechanics, heat transfer, and electromagnetics, we show neural operators can achieve computational speedups of 10^3 to 10^5 times faster than traditional solvers for multi-query scenarios, while maintaining comparable accuracy. We provide practical guidance for method selection, discuss theoretical foundations (universal approximation, convergence), and identify critical open challenges: high-dimensional parameters, complex geometries, and out-of-distribution generalization. This work establishes a unified framework for understanding parametric PDE solvers via operator learning, offering a comprehensive, incrementally updated resource for this rapidly evolving field

[LG-61] Riesz Regression As Direct Density Ratio Estimation

链接: https://arxiv.org/abs/2511.04568
作者: Masahiro Kato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Riesz regression has garnered attention as a tool in debiased machine learning for causal and structural parameter estimation (Chernozhukov et al., 2021). This study shows that Riesz regression is closely related to direct density-ratio estimation (DRE) in important cases, including average treat- ment effect (ATE) estimation. Specifically, the idea and objective in Riesz regression coincide with the one in least-squares importance fitting (LSIF, Kanamori et al., 2009) in direct density-ratio estimation. While Riesz regression is general in the sense that it can be applied to Riesz representer estimation in a wide class of problems, the equivalence with DRE allows us to directly import exist- ing results in specific cases, including convergence-rate analyses, the selection of loss functions via Bregman-divergence minimization, and regularization techniques for flexible models, such as neural networks. Conversely, insights about the Riesz representer in debiased machine learning broaden the applications of direct density-ratio estimation methods. This paper consolidates our prior results in Kato (2025a) and Kato (2025b).

[LG-62] Machine Learning for Electron-Scale Turbulence Modeling in W7-X

链接: https://arxiv.org/abs/2511.04567
作者: Ionut-Gabriel Farcas,Don Lawrence Carl Agapito Fernando,Alejandro Banon Navarro,Gabriele Merlo,Frank Jenko
类目: Plasma Physics (physics.plasm-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 13 pages, 7 tables, 11 figures

点击查看摘要

Abstract:Constructing reduced models for turbulent transport is essential for accelerating profile predictions and enabling many-query tasks such as uncertainty quantification, parameter scans, and design optimization. This paper presents machine-learning-driven reduced models for Electron Temperature Gradient (ETG) turbulence in the Wendelstein 7-X (W7-X) stellarator. Each model predicts the ETG heat flux as a function of three plasma parameters: the normalized electron temperature radial gradient ( \omega_T_e ), the ratio of normalized electron temperature and density radial gradients ( \eta_e ), and the electron-to-ion temperature ratio ( \tau ). We first construct models across seven radial locations using regression and an active machine-learning-based procedure. This process initializes models using low-cardinality sparse-grid training data and then iteratively refines their training sets by selecting the most informative points from a pre-existing simulation database. We evaluate the prediction capabilities of our models using out-of-sample datasets with over 393 points per location, and 95% prediction intervals are estimated via bootstrapping to assess prediction uncertainty. We then investigate the construction of generalized reduced models, including a generic, position-independent model, and assess their heat flux prediction capabilities at three additional locations. Our models demonstrate robust performance and predictive accuracy comparable to the original reference simulations, even when applied beyond the training domain.

[LG-63] Uncertainties in Physics-informed Inverse Problems: The Hidden Risk in Scientific AI

链接: https://arxiv.org/abs/2511.04564
作者: Yoh-ichi Mototake,Makoto Sasaki
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Physics-informed machine learning (PIML) integrates partial differential equations (PDEs) into machine learning models to solve inverse problems, such as estimating coefficient functions (e.g., the Hamiltonian function) that characterize physical systems. This framework enables data-driven understanding and prediction of complex physical phenomena. While coefficient functions in PIML are typically estimated on the basis of predictive performance, physics as a discipline does not rely solely on prediction accuracy to evaluate models. For example, Kepler’s heliocentric model was favored owing to small discrepancies in planetary motion, despite its similar predictive accuracy to the geocentric model. This highlights the inherent uncertainties in data-driven model inference and the scientific importance of selecting physically meaningful solutions. In this paper, we propose a framework to quantify and analyze such uncertainties in the estimation of coefficient functions in PIML. We apply our framework to reduced model of magnetohydrodynamics and our framework shows that there are uncertainties, and unique identification is possible with geometric constraints. Finally, we confirm that we can estimate the reduced model uniquely by incorporating these constraints.

[LG-64] Unified Generative Latent Representation for Functional Brain Graphs NEURIPS2025

链接: https://arxiv.org/abs/2511.04539
作者: Subati Abulikemu,Tiago Azevedo,Michail Mamalakis,John Suckling
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: NeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations

点击查看摘要

Abstract:Functional brain graphs are often characterized with separate graph-theoretic or spectral descriptors, overlooking how these properties covary and partially overlap across brains and conditions. We anticipate that dense, weighted functional connectivity graphs occupy a low-dimensional latent geometry along which both topological and spectral structures display graded variations. Here, we estimated this unified graph representation and enabled generation of dense functional brain graphs through a graph transformer autoencoder with latent diffusion, with spectral geometry providing an inductive bias to guide learning. This geometry-aware latent representation, although unsupervised, meaningfully separated working-memory states and decoded visual stimuli, with performance further enhanced by incorporating neural dynamics. From the diffusion modeled distribution, we were able to sample biologically plausible and structurally grounded synthetic dense graphs.

[LG-65] Online Bayesian Experimental Design for Partially Observed Dynamical Systems

链接: https://arxiv.org/abs/2511.04403
作者: Sara Pérez-Vieites,Sahel Iqbal,Simo Särkkä,Dominik Baumann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Bayesian experimental design (BED) provides a principled framework for optimizing data collection, but existing approaches do not apply to crucial real-world settings such as dynamical systems with partial observability, where only noisy and incomplete observations are available. These systems are naturally modeled as state-space models (SSMs), where latent states mediate the link between parameters and data, making the likelihood – and thus information-theoretic objectives like the expected information gain (EIG) – intractable. In addition, the dynamical nature of the system requires online algorithms that update posterior distributions and select designs sequentially in a computationally efficient manner. We address these challenges by deriving new estimators of the EIG and its gradient that explicitly marginalize latent states, enabling scalable stochastic optimization in nonlinear SSMs. Our approach leverages nested particle filters (NPFs) for efficient online inference with convergence guarantees. Applications to realistic models, such as the susceptible-infected-recovered (SIR) and a moving source location task, show that our framework successfully handles both partial observability and online computation.

[LG-66] Causal Regime Detection in Energy Markets With Augmented Time Series Structural Causal Models

链接: https://arxiv.org/abs/2511.04361
作者: Dennis Thumm
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注: EurIPS 2025 Workshop Causality for Impact: Practical challenges for real-world applications of causal methods

点击查看摘要

Abstract:Energy markets exhibit complex causal relationships between weather patterns, generation technologies, and price formation, with regime changes occurring continuously rather than at discrete break points. Current approaches model electricity prices without explicit causal interpretation or counterfactual reasoning capabilities. We introduce Augmented Time Series Causal Models (ATSCM) for energy markets, extending counterfactual reasoning frameworks to multivariate temporal data with learned causal structure. Our approach models energy systems through interpretable factors (weather, generation mix, demand patterns), rich grid dynamics, and observable market variables. We integrate neural causal discovery to learn time-varying causal graphs without requiring ground truth DAGs. Applied to real-world electricity price data, ATSCM enables novel counterfactual queries such as “What would prices be under different renewable generation scenarios?”.

[LG-67] Robustness of Minimum-Volume Nonnegative Matrix Factorization under an Expanded Sufficiently Scattered Condition

链接: https://arxiv.org/abs/2511.04291
作者: Giovanni Barbarino,Nicolas Gillis,Subhayan Saha
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA)
*备注: 38 pages, 4 figures

点击查看摘要

Abstract:Minimum-volume nonnegative matrix factorization (min-vol NMF) has been used successfully in many applications, such as hyperspectral imaging, chemical kinetics, spectroscopy, topic modeling, and audio source separation. However, its robustness to noise has been a long-standing open problem. In this paper, we prove that min-vol NMF identifies the groundtruth factors in the presence of noise under a condition referred to as the expanded sufficiently scattered condition which requires the data points to be sufficiently well scattered in the latent simplex generated by the basis vectors.

[LG-68] Online Conformal Inference with Retrospective Adjustment for Faster Adaptation to Distribution Shift

链接: https://arxiv.org/abs/2511.04275
作者: Jungbin Jun,Ilsang Ohn
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction has emerged as a powerful framework for constructing distribution-free prediction sets with guaranteed coverage assuming only the exchangeability assumption. However, this assumption is often violated in online environments where data distributions evolve over time. Several recent approaches have been proposed to address this limitation, but, typically, they slowly adapt to distribution shifts because they update predictions only in a forward manner, that is, they generate a prediction for a newly observed data point while previously computed predictions are not updated. In this paper, we propose a novel online conformal inference method with retrospective adjustment, which is designed to achieve faster adaptation to distributional shifts. Our method leverages regression approaches with efficient leave-one-out update formulas to retroactively adjust past predictions when new data arrive, thereby aligning the entire set of predictions with the most recent data distribution. Through extensive numerical studies performed on both synthetic and real-world data sets, we show that the proposed approach achieves faster coverage recalibration and improved statistical efficiency compared to existing online conformal prediction methods.

[LG-69] wirlator: A Pipeline for Analyzing Subgroup Symmetry Effects in Quantum Machine Learning Ansatzes

链接: https://arxiv.org/abs/2511.04243
作者: Valter Uotila,Väinö Mehtola,Ilmo Salmenperä,Bo Zhao
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages; 8 figures

点击查看摘要

Abstract:Leveraging data symmetries has been a key driver of performance gains in geometric deep learning and geometric and equivariant quantum machine learning. While symmetrization appears to be a promising method, its practical overhead, such as additional gates, reduced expressibility, and other factors, is not well understood in quantum machine learning. In this work, we develop an automated pipeline to measure various characteristics of quantum machine learning ansatzes with respect to symmetries that can appear in the learning task. We define the degree of symmetry in the learning problem as the size of the subgroup it admits. Subgroups define partial symmetries, which have not been extensively studied in previous research, which has focused on symmetries defined by whole groups. Symmetrizing the 19 common ansatzes with respect to these varying-sized subgroup representations, we compute three classes of metrics that describe how the common ansatz structures behave under varying amounts of symmetries. The first metric is based on the norm of the difference between the original and symmetrized generators, while the second metric counts depth, size, and other characteristics from the symmetrized circuits. The third class of metrics includes expressibility and entangling capability. The results demonstrate varying gate overhead across the studied ansatzes and confirm that increased symmetry reduces expressibility of the circuits. In most cases, increased symmetry increases entanglement capability. These results help select sufficiently expressible and computationally efficient ansatze patterns for geometric quantum machine learning applications.

[LG-70] Robust inference using density-powered Stein operators

链接: https://arxiv.org/abs/2511.03963
作者: Shinto Eguchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a density-power weighted variant for the Stein operator, called the \gamma -Stein operator. This is a novel class of operators derived from the \gamma -divergence, designed to build robust inference methods for unnormalized probability models. The operator’s construction (weighting by the model density raised to a positive power \gamma inherently down-weights the influence of outliers, providing a principled mechanism for robustness. Applying this operator yields a robust generalization of score matching that retains the crucial property of being independent of the model’s normalizing constant. We extend this framework to develop two key applications: the \gamma -kernelized Stein discrepancy for robust goodness-of-fit testing, and \gamma -Stein variational gradient descent for robust Bayesian posterior approximation. Empirical results on contaminated Gaussian and quartic potential models show our methods significantly outperform standard baselines in both robustness and statistical efficiency.

[LG-71] High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

链接: https://arxiv.org/abs/2511.03952
作者: Aukosh Jagannath,Taj Jones-McCormick,Varnan Sarangian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.

[LG-72] A general technique for approximating high-dimensional empirical kernel matrices

链接: https://arxiv.org/abs/2511.03892
作者: Chiraag Kaushik,Justin Romberg,Vidya Muthukumar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:We present simple, user-friendly bounds for the expected operator norm of a random kernel matrix under general conditions on the kernel function k(\cdot,\cdot) . Our approach uses decoupling results for U-statistics and the non-commutative Khintchine inequality to obtain upper and lower bounds depending only on scalar statistics of the kernel function and a ``correlation kernel’’ matrix corresponding to k(\cdot,\cdot) . We then apply our method to provide new, tighter approximations for inner-product kernel matrices on general high-dimensional data, where the sample size and data dimension are polynomially related. Our method obtains simplified proofs of existing results that rely on the moment method and combinatorial arguments while also providing novel approximation results for the case of anisotropic Gaussian data. Finally, using similar techniques to our approximation result, we show a tighter lower bound on the bias of kernel regression with anisotropic Gaussian data.

[LG-73] Learning Paths for Dynamic Measure Transport: A Control Perspective NEURIPS2025

链接: https://arxiv.org/abs/2511.03797
作者: Aimee Maurais,Bamdad Hosseini,Youssef Marzouk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: To appear at NeurIPS 2025 Workshop on Frontiers of Probabilistic Inference: Sampling Meets Learning

点击查看摘要

Abstract:We bring a control perspective to the problem of identifying paths of measures for sampling via dynamic measure transport (DMT). We highlight the fact that commonly used paths may be poor choices for DMT and connect existing methods for learning alternate paths to mean-field games. Based on these connections we pose a flexible family of optimization problems for identifying tilted paths of measures for DMT and advocate for the use of objective terms which encourage smoothness of the corresponding velocities. We present a numerical algorithm for solving these problems based on recent Gaussian process methods for solution of partial differential equations and demonstrate the ability of our method to recover more efficient and smooth transport models compared to those which use an untilted reference path.

[LG-74] Deep Learning-Driven Downscaling for Climate Risk Assessment of Projected Temperature Extremes in the Nordic Region

链接: https://arxiv.org/abs/2511.03770
作者: Parthiban Loganathan,Elias Zea,Ricardo Vinuesa,Evelyn Otero
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rapid changes and increasing climatic variability across the widely varied Koppen-Geiger regions of northern Europe generate significant needs for adaptation. Regional planning needs high-resolution projected temperatures. This work presents an integrative downscaling framework that incorporates Vision Transformer (ViT), Convolutional Long Short-Term Memory (ConvLSTM), and Geospatial Spatiotemporal Transformer with Attention and Imbalance-Aware Network (GeoStaNet) models. The framework is evaluated with a multicriteria decision system, Deep Learning-TOPSIS (DL-TOPSIS), for ten strategically chosen meteorological stations encompassing the temperate oceanic (Cfb), subpolar oceanic (Cfc), warm-summer continental (Dfb), and subarctic (Dfc) climate regions. Norwegian Earth System Model (NorESM2-LM) Coupled Model Intercomparison Project Phase 6 (CMIP6) outputs were bias-corrected during the 1951-2014 period and subsequently validated against earlier observations of day-to-day temperature metrics and diurnal range statistics. The ViT showed improved performance (Root Mean Squared Error (RMSE): 1.01 degrees C; R^2: 0.92), allowing for production of credible downscaled projections. Under the SSP5-8.5 scenario, the Dfc and Dfb climate zones are projected to warm by 4.8 degrees C and 3.9 degrees C, respectively, by 2100, with expansion in the diurnal temperature range by more than 1.5 degrees C. The Time of Emergence signal first appears in subarctic winter seasons (Dfc: approximately 2032), signifying an urgent need for adaptation measures. The presented framework offers station-based, high-resolution estimates of uncertainties and extremes, with direct uses for adaptation policy over high-latitude regions with fast environmental change.

[LG-75] Bifidelity Karhunen-Loève Expansion Surrogate with Active Learning for Random Fields

链接: https://arxiv.org/abs/2511.03756
作者: Aniket Jivani,Cosmin Safta,Beckett Y. Zhou,Xun Huan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:We present a bifidelity Karhunen-Loève expansion (KLE) surrogate model for field-valued quantities of interest (QoIs) under uncertain inputs. The approach combines the spectral efficiency of the KLE with polynomial chaos expansions (PCEs) to preserve an explicit mapping between input uncertainties and output fields. By coupling inexpensive low-fidelity (LF) simulations that capture dominant response trends with a limited number of high-fidelity (HF) simulations that correct for systematic bias, the proposed method enables accurate and computationally affordable surrogate construction. To further improve surrogate accuracy, we form an active learning strategy that adaptively selects new HF evaluations based on the surrogate’s generalization error, estimated via cross-validation and modeled using Gaussian process regression. New HF samples are then acquired by maximizing an expected improvement criterion, targeting regions of high surrogate error. The resulting BF-KLE-AL framework is demonstrated on three examples of increasing complexity: a one-dimensional analytical benchmark, a two-dimensional convection-diffusion system, and a three-dimensional turbulent round jet simulation based on Reynolds-averaged Navier–Stokes (RANS) and enhanced delayed detached-eddy simulations (EDDES). Across these cases, the method achieves consistent improvements in predictive accuracy and sample efficiency relative to single-fidelity and random-sampling approaches.

[LG-76] Friction on Demand: A Generative Framework for the Inverse Design of Metainterfaces

链接: https://arxiv.org/abs/2511.03735
作者: Valentin Mouton,Adrien Mélot
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Computational Physics (physics.comp-ph)
*备注: Preprint

点击查看摘要

Abstract:Designing frictional interfaces to exhibit prescribed macroscopic behavior is a challenging inverse problem, made difficult by the non-uniqueness of solutions and the computational cost of contact simulations. Traditional approaches rely on heuristic search over low-dimensional parameterizations, which limits their applicability to more complex or nonlinear friction laws. We introduce a generative modeling framework using Variational Autoencoders (VAEs) to infer surface topographies from target friction laws. Trained on a synthetic dataset composed of 200 million samples constructed from a parameterized contact mechanics model, the proposed method enables efficient, simulation-free generation of candidate topographies. We examine the potential and limitations of generative modeling for this inverse design task, focusing on balancing accuracy, throughput, and diversity in the generated solutions. Our results highlight trade-offs and outline practical considerations when balancing these objectives. This approach paves the way for near-real-time control of frictional behavior through tailored surface topographies.

信息检索

[IR-0] Coordination-Free Lane Partitioning for Convergent ANN Search

链接: https://arxiv.org/abs/2511.04221
作者: Carl Kugblenu,Petri Vuorimaa
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注: 10 pages, 6 figures; arXiv preprint

点击查看摘要

Abstract:Production vector search systems often fan out each query across parallel lanes (threads, replicas, or shards) to meet latency service-level objectives (SLOs). In practice, these lanes rediscover the same candidates, so extra compute does not increase coverage. We present a coordination-free lane partitioner that turns duplication into complementary work at the same cost and deadline. For each query we (1) build a deterministic candidate pool sized to the total top-k budget, (2) apply a per-query pseudorandom permutation, and (3) assign each lane a disjoint slice of positions. Lanes then return different results by construction, with no runtime coordination. At equal cost with four lanes (total candidate budget 64), on SIFT1M (1M SIFT feature vectors) with Hierarchical Navigable Small World graphs (HNSW) recall@10 rises from 0.249 to 0.999 while lane overlap falls from nearly 100% to 0%. On MS MARCO (8.8M passages) with HNSW, hit@10 improves from 0.200 to 0.601 and Mean Reciprocal Rank at 10 (MRR@10) from 0.133 to 0.330. For inverted file (IVF) indexes we see smaller but consistent gains (for example, +11% on MS MARCO) by de-duplicating list routing. A microbenchmark shows planner overhead of ~37 microseconds per query (mean at the main setting) with linear growth in the number of merged candidates. These results yield a simple operational guideline: size the per-query pool to the total budget, deterministically partition positions across lanes, and turn redundant fan-out into complementary coverage without changing budget or deadline. Comments: 10 pages, 6 figures; arXiv preprint Subjects: Information Retrieval (cs.IR); Databases (cs.DB) Cite as: arXiv:2511.04221 [cs.IR] (or arXiv:2511.04221v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.04221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] E-CARE: An Efficient LLM -based Commonsense-Augmented Framework for E-Commerce

链接: https://arxiv.org/abs/2511.04087
作者: Ge Zhang,Rohan Deepak Ajwani,Tony Zheng,Hongjian Gu,Yaochen Hu,Wei Guo,Mark Coates,Yingxue Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Finding relevant products given a user query plays a pivotal role in an e-commerce platform, as it can spark shopping behaviors and result in revenue gains. The challenge lies in accurately predicting the correlation between queries and products. Recently, mining the cross-features between queries and products based on the commonsense reasoning capacity of Large Language Models (LLMs) has shown promising performance. However, such methods suffer from high costs due to intensive real-time LLM inference during serving, as well as human annotations and potential Supervised Fine Tuning (SFT). To boost efficiency while leveraging the commonsense reasoning capacity of LLMs for various e-commerce tasks, we propose the Efficient Commonsense-Augmented Recommendation Enhancer (E-CARE). During inference, models augmented with E-CARE can access commonsense reasoning with only a single LLM forward pass per query by utilizing a commonsense reasoning factor graph that encodes most of the reasoning schema from powerful LLMs. The experiments on 2 downstream tasks show an improvement of up to 12.1% on precision@5.

[IR-2] Publication Trend in DESIDOC Journal of Library and Information Technology during 2013-2017: A Scientometric Approach

链接: https://arxiv.org/abs/2511.04082
作者: M Sadik Batcha,S Roselin Jahina,Muneer Ahmad
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 7 pages, 3 figures, Research Article

点击查看摘要

Abstract:DESIDOC Journal of Library Information Technology (DJLIT) formerly known as DESIDOC Bulletin of Information Technology is a peer-reviewed, open access, bimonthly journal. This paper presents a Scientometric analysis of the DESIDOC Journal. The paper analyses the pattern of growth of the research output published in the journal, pattern of authorship, author productivity, and, subjects covered to the papers over the period (2013-2017). It is found that 227 papers were published during the period of study (2001-2012). The maximum numbers of articles were collaborative in nature. The subject concentration of the journal noted is Scientometrics. The maximum numbers of articles (65%) have ranged their thought contents between 6 and 10 pages. The study applied standard formula and statistical tools to bring out the factual result.

[IR-3] Caption Injection for Optimization in Generative Search Engine

链接: https://arxiv.org/abs/2511.04080
作者: Xiaolu Chen,Yong Liao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative Search Engines (GSEs) leverage Retrieval-Augmented Generation (RAG) techniques and Large Language Models (LLMs) to integrate multi-source information and provide users with accurate and comprehensive responses. Unlike traditional search engines that present results in ranked lists, GSEs shift users’ attention from sequential browsing to content-driven subjective perception, driving a paradigm shift in information retrieval. In this context, enhancing the subjective visibility of content through Generative Search Engine Optimization (G-SEO) methods has emerged as a new research focus. With the rapid advancement of Multimodal Retrieval-Augmented Generation (MRAG) techniques, GSEs can now efficiently integrate text, images, audio, and video, producing richer responses that better satisfy complex information needs. Existing G-SEO methods, however, remain limited to text-based optimization and fail to fully exploit multimodal data. To address this gap, we propose Caption Injection, the first multimodal G-SEO approach, which extracts captions from images and injects them into textual content, integrating visual semantics to enhance the subjective visibility of content in generative search scenarios. We systematically evaluate Caption Injection on MRAMG, a benchmark for MRAG, under both unimodal and multimodal settings. Experimental results show that Caption Injection significantly outperforms text-only G-SEO baselines under the G-Eval metric, demonstrating the necessity and effectiveness of multimodal integration in G-SEO to improve user-perceived content visibility.

[IR-4] wo Decades of Research at the University of Lagos (2004-2023): A Scientometric Analysis of Productivity Collaboration and Impact

链接: https://arxiv.org/abs/2511.04075
作者: Muneer Ahmad,Samuel Ibor Ubi
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 19 pages, 3 figures, Research Article

点击查看摘要

Abstract:This paper presents a scientometric analysis of research output from the University of Lagos, focusing on the two decades spanning 2004 to 2023. Using bibliometric data retrieved from the Web of Science, we examine trends in publication volume, collaboration patterns, citation impact, and the most prolific authors, departments, and research domains at the university. The study reveals a consistent increase in research productivity, with the highest publication output recorded in 2023. Health Sciences, Engineering, and Social Sciences are identified as dominant fields, reflecting the university’s interdisciplinary research strengths. Collaborative efforts, both locally and internationally, show a positive correlation with higher citation impact, with the United States and the United Kingdom being the leading international collaborators. Notably, open-access publications account for a significant portion of the university’s research output, enhancing visibility and citation rates. The findings offer valuable insights into the university’s research performance over the past two decades, providing a foundation for strategic planning and policy formulation to foster research excellence and global impact.

附件下载

点击下载今日全部论文列表