本篇博文主要内容为 2025-12-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-19)
今日共更新556篇论文,其中:
- 自然语言处理共58篇(Computation and Language (cs.CL))
- 人工智能共198篇(Artificial Intelligence (cs.AI))
- 计算机视觉共131篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共155篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Generative Adversarial Reason er: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中因过程错误(如计算失误、逻辑脆弱性和表面合理但无效的步骤)而导致性能受限的问题。解决方案的关键在于提出生成对抗推理框架(Generative Adversarial Reasoner),通过对抗强化学习联合训练一个LLM推理器与一个基于LLM的判别器,实现对推理链中每个逻辑完整片段的逐步评估与优化。该方法采用计算高效的审查调度机制将推理链分割为长度相近的逻辑单元,并由判别器提供结构化、简洁的合理性判断;同时,推理器和判别器分别获得来自逻辑一致性与错误检测的密集且校准良好的步骤级奖励信号,从而提升信用分配精度、样本效率及整体推理质量。
链接: https://arxiv.org/abs/2512.16917
作者: Qihao Liu,Luoxin Ye,Wufei Ma,Yu-Cheng Chou,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice’s soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
zh
[NLP-1] Constructive Circuit Amplification: Improving Math Reasoning in LLM s via Targeted Sub-Network Updates
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中能力提升依赖全局微调导致效率低下且可能损害原有能力的问题。其解决方案的关键在于提出一种名为“构造性电路放大”(Constructive Circuit Amplification)的新方法,该方法通过识别推理轨迹中的关键标记(pivotal tokens)和执行特定任务的模型组件(即稀疏子网络或电路),仅对这些稀疏且关键的模型组件进行更新,从而实现对特定能力的精准增强。实验表明,该方法在数学推理任务上可提升高达11.4%的准确率,同时仅修改约1.59%的模型参数,并对其他任务能力影响极小,验证了基于电路干预的高效、可控的能力增强路径。
链接: https://arxiv.org/abs/2512.16914
作者: Nikhil Prakash,Donghao Ren,Dominik Moritz,Yannick Assogba
机构: Northeastern University (东北大学); Apple (苹果)
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures
Abstract:Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.
zh
[NLP-2] Exploration v.s. Exploitation: Rethinking RLVR through Clipping Entropy and Spurious Reward
【速读】: 该论文旨在解决强化学习中带有可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)框架下探索-利用权衡(exploration-exploitation trade-off)的机制不明确问题,特别是如何解释看似矛盾的现象:即虚假奖励(spurious rewards)通过抑制利用行为、熵最小化(entropy minimization)通过抑制探索行为,均能提升大型语言模型(Large Language Models, LLMs)的推理性能。其解决方案的关键在于揭示了在虚假奖励作用下,裁剪偏差(clipping bias)会降低策略熵(policy entropy),从而促使模型输出更自信和确定性的结果;同时提出了一种奖励错位模型(reward-misalignment model),阐明虚假奖励在非污染场景下仍能带来性能提升的内在机制,为更有效的RLVR训练提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2512.16912
作者: Peter Chen,Xiaopeng Li,Ziniu Li,Wotao Yin,Xi Chen,Tianyi Lin
机构: Columbia(哥伦比亚大学); CUHK SZ(香港中文大学深圳校区); DAMO, Alibaba US(阿里达摩院,阿里巴巴美国); NYU Stern(纽约大学斯特恩商学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages
Abstract:This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
zh
[NLP-3] How Good is Post-Hoc Watermarking With Language Model Rephrasing? KR
【速读】: 该论文旨在解决如何在不改变原始文本内容的前提下,通过后处理(post-hoc)方式对大型语言模型(LLM)生成的文本进行可追溯性水印标记的问题,以保护版权或检测其在训练数据或检索增强生成(RAG)中的使用。其核心解决方案在于利用 LLM 对已有文本进行重写的同时嵌入生成时水印(generation-time watermarking),从而在不依赖模型服务架构限制的情况下,提升水印的可检测性与语义保真度之间的平衡。关键创新点包括:采用更大规模重写模型、束搜索(beam search)、多候选生成及熵过滤等计算资源分配策略优化检测性能;发现 Gumbel-max 采样在核采样(nucleus sampling)下优于近期方法,并且束搜索显著提升多数方案效果;同时揭示了在代码等可验证文本上,小模型反而优于大模型这一反直觉现象,为后处理水印技术的实际应用提供了重要依据。
链接: https://arxiv.org/abs/2512.16904
作者: Pierre Fernandez,Tom Sander,Hady Elsahar,Hongyan Chang,Tomáš Souček,Valeriu Lacatusu,Tuan Tran,Sylvestre-Alvise Rebuffi,Alexandre Mourachko
机构: Meta(Meta)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Code at this https URL
Abstract:Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore post-hoc watermarking where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.
zh
[NLP-4] In-Context Algebra
【速读】: 该论文旨在解决Transformer模型在处理变量符号序列时如何实现符号推理的问题,特别是在符号意义不固定、需通过上下文动态确定的场景下。传统研究发现,当符号具有固定数值含义时,Transformer会学习到反映代数结构的几何嵌入(geometric embeddings),但此类方法难以适应符号语义随序列变化的复杂情境。本文的关键解决方案是设计了一种新的任务范式:在不同序列中随机分配符号与代数群元素的对应关系,从而迫使模型学习基于上下文的符号推理机制。通过构造受控数据分布进行因果测试,作者识别出三种模型稳定学习的核心机制:交换复制(commutative copying)、单位元识别(identity element recognition)和封闭性消去(closure-based cancellation),表明模型能在无固定符号语义的情况下发展出有效的符号推理能力,这与先前发现的几何表示形成互补。
链接: https://arxiv.org/abs/2512.16902
作者: Eric Todd,Jannik Brinkmann,Rohit Gandikota,David Bau
机构: Northeastern University (东北大学); TU Clausthal (克劳斯塔尔工业大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 18 figures. Code and data at this https URL
Abstract:We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.
zh
[NLP-5] Impacts of Racial Bias in Historical Training Data for News AI
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在新闻业应用中可能嵌入并放大历史偏见的问题,特别是这些模型如何通过训练数据中的隐含刻板印象影响内容分类与生成任务。其关键解决方案在于采用可解释人工智能(Explainable AI, XAI)方法对模型输出进行深入剖析,揭示如“blacks”这一标签在训练语料库中编码的非预期语义——它不仅作为泛化的“种族主义检测器”作用于少数群体,还在现代事件(如新冠疫情下的反亚裔仇恨报道、黑命攸关运动报道)中表现不佳。此实证研究凸显了在新闻工作流中部署AI工具时必须警惕历史偏见的再生产风险,并强调需结合技术透明性与伦理审查以实现负责任的AI应用。
链接: https://arxiv.org/abs/2512.16901
作者: Rahul Bhargava,Malene Hornstrup Jespersen,Emily Boardman Ndulue,Vivica Dsouza
机构: Northeastern University (东北大学); University of Copenhagen (哥本哈根大学); Media Ecosystems Analysis Group (媒体生态系统分析组)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning “blacks” thematic topic label. Through quantitative and qualitative means we investigate this label’s use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the “blacks” label operates partially as a general “racism detector” across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.
zh
[NLP-6] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image KR
【速读】: 该论文旨在解决当前奖励模型(Reward Models, RMs)在处理多模态理解与交错生成任务(interleaved generation)中缺乏系统性评估工具的问题,尤其针对能够处理图像与文本交错序列的全能模型(omni models)。现有研究多集中于纯文本场景,而对跨模态推理、图像编辑及图文混合生成等复杂任务的RM性能评估尚属空白。解决方案的关键在于提出首个全面的基准测试平台Multimodal RewardBench 2(MMRB2),其核心创新包括:(1)设计具有实际挑战性的多模态提示;(2)收集来自23种先进模型和代理的高质量响应;(3)通过集成过滤策略确保偏好对(preference pairs)具备强专家共识。该基准覆盖四个关键子任务——文生图、图像编辑、交错生成与多模态推理,并提供每任务1000组专家标注偏好数据,为训练和评估适用于多模态大语言模型的奖励模型提供了标准化、可复现的评估框架。
链接: https://arxiv.org/abs/2512.16899
作者: Yushi Hu,Reyhane Askari-Hemmat,Melissa Hall,Emily Dinan,Luke Zettlemoyer,Marjan Ghazvininejad
机构: Meta(元)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code and data available at this https URL
Abstract:Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning (“thinking-with-images”), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to 90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
zh
[NLP-7] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在构建搜索代理(search agents)时面临的“知识边界感知不足”问题,即如何自适应地平衡利用模型内部参数化知识与外部搜索资源,避免因过度依赖搜索而引入成本和风险,同时防止仅依赖参数知识导致的幻觉(hallucination)。现有方法通过奖励塑形限制工具调用次数来减少搜索滥用,但存在奖励设计复杂、信用分配模糊及被策略性规避等问题。论文提出AdaSearch,其关键在于采用两阶段、结果导向的强化学习框架,将问题求解过程与是否调用搜索的决策过程解耦,并使决策机制显式化、可解释化,从而提升对何时应使用外部搜索的认知能力,实现更高效且透明的智能决策。
链接: https://arxiv.org/abs/2512.16883
作者: Tzu-Han Lin,Wei-Lin Chen,Chen-An Li,Hung-yi Lee,Yun-Nung Chen,Yu Meng
机构: National Taiwan University (国立台湾大学); Department of Computer Science, University of Virginia (弗吉尼亚大学计算机科学系)
类目: Computation and Language (cs.CL)
备注: Preprint. Code and artifacts will be uploaded to this https URL
Abstract:Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.
zh
[NLP-8] LLM Cache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference
【速读】: 该论文旨在解决基于Transformer的语言模型在实际部署中面临的高推理延迟问题,尤其针对自回归解码阶段的效率瓶颈。现有缓存机制(如token-level key-value缓存)虽能提升部分场景下的速度,但适用范围有限且难以扩展至多层结构或不同架构。其解决方案的关键在于提出LLMCache——一种分层缓存框架,通过语义相似性匹配复用中间激活(intermediate activations),实现跨编码器与解码器架构、任意Transformer层的通用缓存策略;同时引入轻量级指纹匹配机制和自适应淘汰策略以控制缓存过时(cache staleness),从而在保持模型精度损失低于0.5%的前提下,实现最高达3.1倍的推理加速。
链接: https://arxiv.org/abs/2512.16843
作者: Harsh Vardhan Bansal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted and presented at 13th IEEE International Conference on Intelligent Systems and Embedded Design (ISED-2025)
Abstract:Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with 0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications
zh
[NLP-9] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels
【速读】: 该论文旨在解决如何量化语音韵律(prosody)单独传递的信息量及其语义内容的问题,即区分语音中由韵律表达但文本未包含的信息。解决方案的关键在于提出一种信息论方法,利用大规模语音与语言模型估计特定语义维度(如讽刺、情绪或疑问性)与通信通道(如音频或文本)之间的互信息(mutual information),从而在电视和播客语料中定量分析不同通道对各类语义特征的贡献。结果表明,在缺乏长时上下文的情况下,音频通道(隐含韵律)在传达讽刺和情绪方面提供的信息量远超文本通道一个数量级以上,而对疑问性的增益较小。
链接: https://arxiv.org/abs/2512.16832
作者: Aditya Yadavalli,Tiago Pimentel,Tamar I Regev,Ethan Wilcox,Alex Warstadt
机构: UC San Diego (加州大学圣地亚哥分校); ETH Zürich (苏黎世联邦理工学院); MIT (麻省理工学院); Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Prosody – the melody of speech – conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance’s meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel – and by implication the prosodic channel – transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.
zh
[NLP-10] Grammar-Forced Translation of Natural Language to Temporal Logic using LLM s
【速读】: 该论文旨在解决自然语言(Natural Language, NL)到时序逻辑(Temporal Logic, TL)翻译任务中的关键挑战,包括原子命题(Atomic Propositions, APs)准确提升、共指现象(co-reference)处理以及小样本学习下的性能瓶颈。现有方法通常将任务分解为原子命题提升与逻辑翻译两个阶段,并依赖语言模型从全词汇表中迭代预测token,但这种方法在复杂性和准确性上存在局限。论文提出的解决方案是Grammar Forced Translation (GraFT) 框架,其核心创新在于通过利用问题特有结构限制每一步的有效输出token集合,从而显著压缩解空间——即在每个预测步骤中仅允许少量合法token,而非全词汇表。这种基于语法约束的解空间缩减不仅提升了模型对共指关系和语义一致性的建模能力,还提供了理论支持以证明其更高效的训练收敛性。实验表明,GraFT在CW、GLTL和Navi等多个基准上相较当前最优方法实现了平均5.49%的端到端翻译准确率提升及14.06%的跨领域翻译准确率提升。
链接: https://arxiv.org/abs/2512.16814
作者: William English,Dominic Simon,Sumit Kumar Jha,Rickard Ewetz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.
zh
[NLP-11] Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology
【速读】: 该论文旨在解决多模态检索增强生成(Multi-modal Retrieval-Augmented Generation, MM-RAG)在生物医学问答(Biomedical QA)中如何权衡视觉信息处理策略的问题,具体聚焦于何时应将图表转换为文本进行检索,何时应采用无需光学字符识别(OCR)的视觉检索方法。其解决方案的关键在于系统性比较不同模态处理方式对模型性能的影响:通过构建一个包含120道多选题(MCQs)的基准测试集,并使用Docling解析和Qdrant索引实现四种增强策略(无增强、文本RAG、多模态转换、Late-interaction视觉检索),发现对于中等规模模型(如Gemma-3-27B-IT),将图表转为文本的方案优于OCR-free视觉检索;而对于前沿大模型(如GPT-4o及GPT-5系列),OCR-free视觉检索(如ColPali)与文本增强效果相当,且在GPT-5下进一步提升至约82.8%准确率,表明模型容量决定了最优管道选择——视觉转文本降低推理负担适合中型模型,而OCR-free视觉检索在强大生成器支持下更具竞争力。
链接: https://arxiv.org/abs/2512.16802
作者: Primož Kocbek,Azra Frkatović-Hodžić,Dora Lalić,Vivian Hui,Gordan Lauc,Gregor Štiglic
机构: 未知
类目: Computation and Language (cs.CL)
备注: Will be published in IEEE BigData 2025 proceedings. Contains 10 pages, 1 figure, 5 tables
Abstract:Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.
zh
[NLP-12] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLM s
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对检索到的来源存在冲突、过时或主观信息时表现不佳的问题,这些问题会导致生成答案缺乏可靠性与一致性。现有方法虽分别尝试缓解上述问题,但缺乏统一的推理监督机制。其解决方案的关键在于提出一种基于推理轨迹增强的RAG框架,通过三个结构化阶段实现可解释的推理过程:(1) 文档级仲裁(document-level adjudication),(2) 冲突分析(conflict analysis),以及(3) 基于证据的综合生成(grounded synthesis),从而输出带引用的答案或有依据的拒绝响应。此外,引入冲突感知的信任评分(Conflict-Aware Trust-Score, CATS)管道,利用大语言模型作为裁判(LLM-as-a-Judge)对生成结果的 groundedness、事实正确性、拒绝准确性及行为一致性进行多维度评估,显著提升了系统性能,在Qwen模型上端到端答案正确率从0.069提升至0.883,行为一致性从0.074提升至0.722。
链接: https://arxiv.org/abs/2512.16795
作者: Shubham Mishra,Samyek Jain,Gorang Mehrishi,Shiv Tiwari,Harsh Sharma,Pratik Narang,Dhruv Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: Under Review
Abstract:Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.
zh
[NLP-13] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation
【速读】: 该论文旨在解决自然语言(Natural Language, NL)到时序逻辑(Temporal Logic, TL)翻译中因缺乏准确原子命题(atomic proposition)接地(grounding)而导致的语义失准问题。现有方法要么假设已知精确的原子命题映射,要么在接地阶段表现不佳,从而导致生成的时序逻辑表达式虽语法正确但语义与目标不一致。解决方案的关键在于提出一个名为GinSign的框架,其核心创新是将原本自由形式的生成任务分解为结构化的分类问题:首先预测谓词标签,再选择适当类型的常量参数,从而实现对系统签名(system signature)中原子命题的精准映射。该方法利用小型掩码语言模型即可完成高效且高精度的接地,避免了对昂贵大语言模型(Large Language Models, LLMs)的依赖,并在多个领域实验中实现了95.5%的接地逻辑等价率,较当前最优方法提升1.4倍。
链接: https://arxiv.org/abs/2512.16770
作者: William English,Chase Walker,Dominic Simon,Rickard Ewetz
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature \mathcalS , the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms \mathcalP . We decompose the grounding task hierarchically- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of 95.5% , a 1.4\times improvement over SOTA.
zh
[NLP-14] DataFlow: An LLM -Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)数据准备过程中存在的可扩展性差、可靠性低以及语义丰富度不足的问题,当前主流方法依赖于非结构化脚本和松散定义的工作流,缺乏系统级抽象,难以实现复现性和模型驱动的数据生成。解决方案的关键在于提出 DataFlow 框架,其核心创新包括:(1)基于系统级抽象的模块化、可重用且可组合的数据变换机制;(2)类 PyTorch 的流水线构建 API,支持调试与优化;(3)集成近 200 个可复用算子和六大通用领域管道(文本、数学推理、代码、Text-to-SQL、代理式 RAG 和大规模知识提取);(4)引入 DataFlow-Agent,通过算子合成、流水线规划与迭代验证自动将自然语言规范转化为可执行数据流。该方案显著提升了下游 LLM 性能,在多个基准上超越人工标注数据集和专用合成基线,验证了其在可靠、可复现和可扩展数据准备方面的有效性。
链接: https://arxiv.org/abs/2512.16676
作者: Hao Liang,Xiaochen Ma,Zhou Liu,Zhen Hao Wong,Zhengyang Zhao,Zimo Meng,Runming He,Chengyu Shen,Qifeng Cai,Zhaoyang Han,Meiyi Qiang,Yalin Feng,Tianyi Bai,Zewei Pan,Ziyi Guo,Yizhen Jiang,Jingwen Deng,Qijie You,Peichao Lai,Tianyu Guo,Chi Hsu Tsai,Hengyi Feng,Rui Hu,Wenkai Yu,Junbo Niu,Bohan Zeng,Ruichuan An,Lu Ma,Jihao Huang,Yaowei Zheng,Conghui He,Linpeng Tang,Bin Cui,Weinan E,Wentao Zhang
机构: Peking University (北京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3% execution accuracy in Text-to-SQL over SynSQL, +7% average improvements on code benchmarks, and 1–3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
zh
[NLP-15] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
【速读】: 该论文试图解决当前基于强化学习(Reinforcement Learning, RL)的大语言模型训练中日益增加的复杂性问题,即多阶段训练流程、动态超参数调度和课程学习策略是否为必要手段。其解决方案的关键在于提出一种极简方法 JustRL,采用单阶段训练与固定超参数配置,在不依赖复杂调优的情况下实现了卓越性能(在两个1.5B参数推理模型上分别达到54.9%和64.3%的平均准确率),且计算开销仅为现有复杂方法的一半。此外,该方案展现出稳定、单调的训练轨迹,无需干预即可持续提升性能,表明过度设计可能掩盖了本可通过更稳定基线解决的问题。
链接: https://arxiv.org/abs/2512.16649
作者: Bingxiang He,Zekai Qu,Zeyuan Liu,Yinghao Chen,Yuxin Zuo,Cheng Qian,Kaiyan Zhang,Weize Chen,Chaojun Xiao,Ganqu Cui,Ning Ding,Zhiyuan Liu
机构: Tsinghua University (清华大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures
Abstract:Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbfIs this complexity necessary? We present \textbfJustRL, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9% and 64.3% average accuracy across nine mathematical benchmarks) while using 2 \times less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks’’ like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
zh
[NLP-16] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在政治敏感话题上过度拒绝(refusal)的问题,即模型在无需干预的情况下对某些高敏感性问题自动拒绝回答,从而限制了其可用性和可控性。解决方案的关键在于提出一种称为“拒绝引导”(Refusal Steering)的推理时(inference-time)方法:通过引入一个“以LLM为裁判”的机制(LLM-as-a-judge)来量化拒绝置信度,并采用岭正则化(ridge-regularized)策略计算出能够精确分离拒绝-合规方向的激活引导向量(steering vectors)。该方法可在不重新训练模型的前提下,有效消除模型在政治敏感话题上的拒绝行为,同时保持对有害内容的安全对齐(safety alignment),并在多个规模的模型(4B 和 80B 参数)上实现泛化。
链接: https://arxiv.org/abs/2512.16602
作者: Iker García-Ferrero,David Montero,Roman Orus
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal–compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
zh
[NLP-17] Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在面对模糊探索式搜索(Fuzzy Exploratory Search)任务时表现不足的问题,即当用户提出语义模糊、多维且不具明确答案的查询时,现有检索系统难以准确识别并返回最相关网页。解决方案的关键在于提出一个名为“Needle in the Web”的新基准测试集,其包含663个跨七个领域的探索性问题,通过基于网络内容事实主张的灵活生成方法控制查询难度,并强调对真实世界网页内容的检索与推理能力评估。该基准有效揭示了当前主流LLM和代理型搜索系统在处理语义模糊性下的性能瓶颈,推动了对模糊检索(fuzzy retrieval)这一开放问题的研究进展。
链接: https://arxiv.org/abs/2512.16553
作者: Yumeng Wang,Tianyu Fan,Lingrui Xu,Chao Huang
机构: Tsinghua University (清华大学); The University of Hong Kong (香港大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Data and code are available at this https URL
Abstract:Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.
zh
[NLP-18] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT -4.1 Models in Sentence and Document-Level Text Simplification
【速读】: 该论文旨在解决科学文本在句子级和文档级上的简化问题,以提升其可读性和可访问性。解决方案的关键在于对比两种方法:一是基于提示工程(prompt engineering)的无上下文(no-context)方法,二是对模型进行微调(fine-tuned, FT)的方法。实验表明,使用OpenAI的gpt-4.1-mini模型结合无上下文策略在两个粒度层级上均表现稳健,而微调模型则结果不一,凸显了不同粒度文本简化任务的复杂性;其中,在特定情况下,gpt-4.1-nano-ft模型在文档级简化中表现突出。
链接: https://arxiv.org/abs/2512.16541
作者: Primoz Kocbek,Gregor Stiglic
机构: University of Maribor, Faculty of Health Science (马里博尔大学健康科学学院); University of Ljubljana, Faculty of Medicine (卢布尔雅那大学医学院); University of Edinburgh, Usher Institute (爱丁堡大学尤舍研究所)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 tables. CLEF 2025 Working Notes, 9 to 12 September 2025, Madrid, Spain
Abstract:This work describes our submission to the CLEF 2025 SimpleText track Task 1, addressing both sentenceand document-level simplification of scientific texts. The methodology centered on using the gpt-4.1, gpt-4.1mini, and gpt-4.1-nano models from OpenAI. Two distinct approaches were compared: a no-context method relying on prompt engineering and a fine-tuned (FT) method across models. The gpt-4.1-mini model with no-context demonstrated robust performance at both levels of simplification, while the fine-tuned models showed mixed results, highlighting the complexities of simplifying text at different granularities, where gpt-4.1-nano-ft performance stands out at document-level simplification in one case.
zh
[NLP-19] Plain language adaptations of biomedical text using LLM s: Comparision of evaluation metrics
【速读】: 该论文旨在解决生物医学文本复杂难懂的问题,以提升公众的健康素养(health literacy)。其核心挑战在于如何有效简化专业性极强的生物医学文献,使其更易被非专业人群理解。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)进行文本简化,具体采用了三种方法:基于提示模板的基线方法、双AI代理协作方法以及微调(fine-tuning)方法,并以GPT-4o和GPT-4o-mini为基准模型进行比较评估。实验结果表明,GPT-4o-mini在多项定量指标(如Flesch-Kincaid等级、SMOG指数、SARI、BERTScore、G-Eval)和定性评分(5点李克特量表)上表现最优,而微调方法效果反而较差,说明在该任务中直接使用预训练LLM并结合合理提示策略比微调更具优势。
链接: https://arxiv.org/abs/2512.16530
作者: Primoz Kocbek,Leon Kopitar,Gregor Stiglic
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure
Abstract:This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.
zh
[NLP-20] opic Modelling Black Box Optimization
【速读】: 该论文旨在解决Latent Dirichlet Allocation (LDA)模型中主题数 $ T $ 的选择问题,这是一个影响主题模型统计拟合效果与可解释性的关键设计决策。传统方法依赖人工经验或网格搜索,效率低下且难以在有限计算预算下找到最优解。论文将 $ T $ 的选择建模为一个离散黑箱优化问题,通过训练不同 $ T $ 值的LDA模型并评估其验证困惑度(validation perplexity)来指导搜索。解决方案的关键在于引入两种学习型、可泛化的优化方法——PABBO(Preferential Amortized Black-Box Optimization)和SABBO(Sharpness-Aware Black-Box Optimization),它们相较于传统的遗传算法(GA)和进化策略(ES),能在极少量函数评估(即模型训练)下快速收敛至近优解,显著提升样本效率和时间效率。
链接: https://arxiv.org/abs/2512.16445
作者: Roman Akramov,Artem Khamatullin,Svetlana Glazyrina,Maksim Kryzhanovskiy,Roman Ischenko
机构: Lomonosov Moscow State University (莫斯科国立大学); Institute for Artificial Intelligence, Lomonosov Moscow State University (莫斯科国立大学人工智能研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Choosing the number of topics T in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of T as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.
zh
[NLP-21] From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection
【速读】: 该论文旨在解决Embeddings-as-a-Service(EaaS)在商业应用中面临的模仿攻击(imitation attacks)问题,现有水印技术因忽视嵌入向量的语义特性而导致保护效果受限,表现为不可靠性、隐蔽性不足和对原始语义分布的破坏。其解决方案的关键在于提出一种基于语义的水印范式SemMark:通过局部敏感哈希(locality-sensitive hashing)将语义空间分区,并在特定区域注入语义感知水印信号,确保水印的不可感知性和多样性;同时引入基于局部离群因子(local outlier factor)的自适应水印权重机制,以最小化对原始嵌入分布的扰动,从而实现更强的可验证性、隐蔽性、无害性和多样性。
链接: https://arxiv.org/abs/2512.16439
作者: Hao Li,Yubing Ren,Yanan Cao,Yingjie Li,Fang Fang,Xuebin Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Benefiting from the superior capabilities of large language models in natural language understanding and generation, Embeddings-as-a-Service (EaaS) has emerged as a successful commercial paradigm on the web platform. However, prior studies have revealed that EaaS is vulnerable to imitation attacks. Existing methods protect the intellectual property of EaaS through watermarking techniques, but they all ignore the most important properties of embedding: semantics, resulting in limited harmlessness and stealthiness. To this end, we propose SemMark, a novel semantic-based watermarking paradigm for EaaS copyright protection. SemMark employs locality-sensitive hashing to partition the semantic space and inject semantic-aware watermarks into specific regions, ensuring that the watermark signals remain imperceptible and diverse. In addition, we introduce the adaptive watermark weight mechanism based on the local outlier factor to preserve the original embedding distribution. Furthermore, we propose Detect-Sampling and Dimensionality-Reduction attacks and construct four scenarios to evaluate the watermarking method. Extensive experiments are conducted on four popular NLP datasets, and SemMark achieves superior verifiability, diversity, stealthiness, and harmlessness.
zh
[NLP-22] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains
【速读】: 该论文旨在解决临床环境中自动语音识别(ASR)部署面临的三大技术障碍:严格的数据隐私限制、有限的计算资源以及严重的声学域偏移问题。研究发现,即使使用鲁棒的多语言模型IndicWav2Vec,在真实临床音频(Gram Vaani)上其词错误率(WER)仍高达40.94%,无法满足实际应用需求。为应对这些挑战,作者提出一种高效且隐私保护的适应框架,其核心在于采用低秩适应(Low-Rank Adaptation, LoRA)技术,使模型能够在边缘设备上直接从实时数据流中持续学习,从而保障患者数据隐私;同时引入多域经验回放机制,显著缓解灾难性遗忘问题(减少47%),最终在目标域实现相对17.1%的WER改进,为构建可自我优化、适用于高影响力现实场景的可靠ASR系统提供了可行路径。
链接: https://arxiv.org/abs/2512.16401
作者: Darshil Chauhan,Adityasinh Solanki,Vansh Patel,Kanav Kapoor,Ritvik Jain,Aditya Bansal,Dhruv Kumar,Prateek Narang
机构: BITS Pilani, Pilani Campus, India; Qure.ai, India
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.
zh
[NLP-23] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLM s
【速读】: 该论文旨在解决当前生成式语音大模型(SpeechLLMs)在语音到文本翻译任务中是否优于传统级联架构(cascade systems)这一开放性问题。其解决方案的关键在于构建了首个全面的评测基准测试套件——Hearing to Translate,系统性地对比了5种前沿SpeechLLMs与16种强基线系统(包括直接端到端模型和基于语音基础模型(SFM)与多语言大语言模型(LLM)组合的级联系统),覆盖16个基准、13个语种对及9种挑战性场景(如不流畅、噪声干扰和长语音)。实验结果表明,尽管SpeechLLMs理论上可绕过传统转录流程,但当前性能仍不及级联系统,尤其在复杂条件下表现不稳定;而将LLM集成到模型内部或管道中是实现高质量语音翻译的关键因素。
链接: https://arxiv.org/abs/2512.16378
作者: Sara Papi,Javier Garcia Gilabert,Zachary Hopton,Vilém Zouhar,Carlos Escolano,Gerard I. Gállego,Jorge Iranzo-Sánchez,Ahrii Kim,Dominik Macháček,Patricia Schmidtova,Maike Züfle
机构: Fondazione Bruno Kessler(布鲁诺·凯勒基金会); Barcelona Supercomputing Center(巴塞罗那超级计算中心); University of Zurich(苏黎世大学); ETH Zurich(苏黎世联邦理工学院); Universitat Politècnica de Catalunya(加泰罗尼亚理工大学); Universitat Politècnica de València(瓦伦西亚理工大学); AI-Bio Convergence Research Institute(人工智能与生物融合研究中心); Charles University(查理大学); KIT(卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Project available at this https URL
Abstract:As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
zh
[NLP-24] Hacking Neural Evaluation Metrics with Single Hub Text
【速读】: 该论文旨在解决当前基于嵌入的神经文本评估指标(如COMET)在可靠性与安全性方面的潜在问题,尤其是其黑箱特性可能导致评估结果不可靠的风险。为揭示此类指标的脆弱性,研究提出一种在离散文本空间中寻找单一对抗文本的方法,该文本无论输入何种测试案例均被一致评价为高质量,从而暴露评估指标的缺陷。解决方案的关键在于通过优化生成一个“中心枢纽文本”(hub text),使其在多个翻译任务和语言对上均获得高于实际翻译模型输出的COMET得分,验证了现有评估指标存在可被利用的漏洞,并展现出跨语言的泛化能力。
链接: https://arxiv.org/abs/2512.16323
作者: Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai
机构: NTT, Inc.(NTT公司); Nara Institute of Science and Technology(奈良科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT’24 English-to-Japanese (En–Ja) and English-to-German (En–De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja–En and De–En.
zh
[NLP-25] Agent Tools Orchestration Leaks More: Dataset Benchmark and Mitigation
【速读】: 该论文旨在解决由大型语言模型驱动的单智能体多工具架构(Single-Agent Multi-Tool Architecture)所引入的新隐私风险——工具编排隐私风险(Tools Orchestration Privacy Risk, TOP-R)。该风险表现为:智能体在实现用户良性目标的过程中,通过跨多个工具聚合信息片段并利用推理能力合成出未预期的敏感信息,从而导致隐私泄露。论文提出了一种系统性的解决方案,其关键在于引入隐私增强原则(Privacy Enhancement Principle, PEP),该方法通过调整智能体的目标函数以平衡“有用性”与“隐私意识”,从而有效缓解TOP-R。实验表明,PEP可将平均风险泄露率(Risk Leakage Rate, RLR)从90.24%降低至46.58%,同时显著提升综合安全与鲁棒性指标H-Score至0.624,验证了其有效性。
链接: https://arxiv.org/abs/2512.16310
作者: Yuxuan Qiao,Dongqin Liu,Hongchang Yang,Wei Zhou,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Driven by Large Language Models, the single-agent, multi-tool architecture has become a popular paradigm for autonomous agents due to its simplicity and effectiveness. However, this architecture also introduces a new and severe privacy risk, which we term Tools Orchestration Privacy Risk (TOP-R), where an agent, to achieve a benign user goal, autonomously aggregates information fragments across multiple tools and leverages its reasoning capabilities to synthesize unexpected sensitive information. We provide the first systematic study of this risk. First, we establish a formal framework, attributing the risk’s root cause to the agent’s misaligned objective function: an overoptimization for helpfulness while neglecting privacy awareness. Second, we construct TOP-Bench, comprising paired leakage and benign scenarios, to comprehensively evaluate this risk. To quantify the trade-off between safety and robustness, we introduce the H-Score as a holistic metric. The evaluation results reveal that TOP-R is a severe risk: the average Risk Leakage Rate (RLR) of eight representative models reaches 90.24%, while the average H-Score is merely 0.167, with no model exceeding 0.3. Finally, we propose the Privacy Enhancement Principle (PEP) method, which effectively mitigates TOP-R, reducing the Risk Leakage Rate to 46.58% and significantly improving the H-Score to 0.624. Our work reveals both a new class of risk and inherent structural limitations in current agent architectures, while also offering feasible mitigation strategies.
zh
[NLP-26] Adaptation of Agent ic AI
【速读】: 该论文旨在解决当前 agentic AI 系统中适应策略(adaptation strategies)日益多样化且缺乏系统性分类与比较的问题,从而阻碍了模型性能提升、可靠性和泛化能力的优化。其解决方案的关键在于提出一个统一的框架,将适应机制细分为两类:基于工具执行信号(tool-execution-signaled)和基于代理输出信号(agent-output-signaled)的代理适应(agent adaptation),以及无代理依赖(agent-agnostic)和代理监督(agent-supervised)的工具适应(tool adaptation)。该框架不仅厘清了适应策略的设计空间,还显式揭示了不同策略间的权衡关系,并为实际系统设计中选择或切换适应策略提供了可操作的指导。
链接: https://arxiv.org/abs/2512.16301
作者: Pengcheng Jiang,Jiacheng Lin,Zhiyi Shi,Zifeng Wang,Luxi He,Yichen Wu,Ming Zhong,Peiyang Song,Qizheng Zhang,Heng Wang,Xueqiang Xu,Hanwen Xu,Pengrui Han,Dylan Zhang,Jiashuo Sun,Chaoqi Yang,Kun Qian,Tian Wang,Changran Hu,Manling Li,Quanzheng Li,Hao Peng,Sheng Wang,Jingbo Shang,Chao Zhang,Jiaxuan You,Liyuan Liu,Pan Lu,Yu Zhang,Heng Ji,Yejin Choi,Dawn Song,Jimeng Sun,Jiawei Han
机构: UIUC; Stanford; Princeton; Harvard; UW; Caltech; UC Berkeley; UCSD; Georgia Tech; Northwestern; TAMU; Unity
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.
zh
[NLP-27] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言及濒危语言翻译任务中性能评估不足的问题,尤其聚焦于芬兰语与四种低资源乌拉尔语系语言(科米-兹雅rian语、莫克沙语、埃里亚语和乌德穆尔特语)之间的翻译能力差异。其解决方案的关键在于对比推理型(reasoning)与非推理型(non-reasoning)架构模型在文学文本平行语料库上的翻译意愿表现,通过拒绝率(refusal rate)分析发现,推理型模型的拒绝率显著降低16个百分点,表明其在处理低资源语言翻译任务时更具鲁棒性和实用性,为濒危语言保护提供了可操作的模型选择依据。
链接: https://arxiv.org/abs/2512.16287
作者: Yehor Tereshchenko,Mika Hämäläinen,Svitlana Myroniuk
机构: Metropolia University of Applied Sciences (芬兰Metropolia应用科学大学); University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL)
备注: IWCLUL 2025
Abstract:The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI’s GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.
zh
[NLP-28] QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体在执行复杂任务时因工具调用、多步规划和跨智能体通信所引发的安全风险问题。现有依赖部署者编写自然语言策略的方法存在歧义性和上下文敏感性,难以映射为可机器验证的规则,导致运行时安全控制不可靠。解决方案的关键在于提出 \textscQuadSentinel,一个由四个智能体组成的防护机制(状态追踪器、策略验证器、威胁监测器和仲裁器),将安全策略形式化为逻辑序列(sequents),并将其编译为基于可观测状态的谓词规则,在线强制执行;其中仲裁逻辑与高效的 top-k 谓词更新机制协同工作,通过优先级调度和分层冲突解决显著降低计算开销,同时提升规则召回率并减少误报。
链接: https://arxiv.org/abs/2512.16279
作者: Yiliu Yang,Yilei Jiang,Qunzhong Wang,Yingshui Tan,Xiaoyong Zhu,Sherman S.M. Chow,Bo Zheng,Xiangyu Yue
机构: The Chinese University of Hong Kong (香港中文大学); Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textscQuadSentinel, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top- k predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textscQuadSentinel improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at this https URL.
zh
[NLP-29] Sigma-Moe-Tiny Technical Report
【速读】: 该论文旨在解决极端稀疏的混合专家(Mixture-of-Experts, MoE)模型中专家负载不平衡的问题,尤其是在低层网络中传统负载均衡损失函数失效导致训练不稳定的问题。其关键解决方案是提出一种渐进式稀疏化调度策略(progressive sparsification schedule),以在保证专家利用率的同时提升训练稳定性,从而实现仅激活0.5B参数却达到与更大规模模型相当性能的高效MoE架构设计。
链接: https://arxiv.org/abs/2512.16248
作者: Qingguo Hu,Zhenghao Lin,Ziyue Yang,Yucheng Ding,Xiao Liu,Yuting Jiang,Ruizhe Wang,Tianyu Chen,Zhongxin Guo,Yifan Xiong,Rui Gao,Lei Qu,Jinsong Su,Peng Cheng,Yeyun Gong
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: this https URL Code: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16248 [cs.CL] (or arXiv:2512.16248v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.16248 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-30] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在推理过程中并行度受限的问题,当前基于置信度的解码策略通常仅能实现每轮前向传播生成1–3个词元(Tokens Per Forward Pass, TPF),严重制约了推理速度。解决方案的关键在于识别并优化词元填充顺序(Token Filling Order, TFO),提出了一种无需训练、即插即用的前瞻并行解码算法LoPA(Lookahead PArallel Decoding LoPA)。LoPA通过并行分支探索多种候选TFO,并依据分支置信度选择具有最高未来并行潜力的路径,从而显著提升TPF;实验表明,该方法将D2F-Dream模型的TPF从基准值提升至10.1(GSM8K数据集上),同时结合专为分支并行(Branch Parallelism, BP)设计的多设备推理系统,实现了高达1073.9 tokens/秒的单样本吞吐量。
链接: https://arxiv.org/abs/2512.16229
作者: Chenkai Xu,Yijie Jin,Jiajun Li,Yi Tu,Guoping Long,Dandan Tu,Tianqi Hou,Junchi Yan,Zhijie Deng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1–3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at this https URL.
zh
[NLP-31] An Information-Theoretic Framework for Robust Large Language Model Editing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中知识更新效率与准确性不足的问题,即如何在不进行全量重训练的前提下,实现对模型知识的精准、安全且具有广泛适用性的修正。现有模型编辑方法往往局限于特定领域,易引发意外行为并限制实际应用效果。其解决方案的关键在于提出基于信息瓶颈理论(Information Bottleneck Theory)的新框架——信息瓶颈知识编辑器(Information Bottleneck Knowledge Editor, IBKE),通过压缩和隔离关键知识信息,最小化对无关行为的影响;具体而言,IBKE利用紧凑的潜在表示引导梯度更新,从而实现鲁棒性强、泛化能力高的模型编辑效果,在多种LLM架构和基准任务上均展现出最优性能。
链接: https://arxiv.org/abs/2512.16227
作者: Qizhou Chen,Chengyu Wang,Taolin Zhang,Xiaofeng He
机构: East China Normal University (华东师范大学); Alibaba Group (阿里巴巴集团); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have become indispensable tools in science, technology, and society, enabling transformative advances across diverse fields. However, errors or outdated information within these models can undermine their accuracy and restrict their safe deployment. Developing efficient strategies for updating model knowledge without the expense and disruption of full retraining remains a critical challenge. Current model editing techniques frequently struggle to generalize corrections beyond narrow domains, leading to unintended consequences and limiting their practical impact. Here, we introduce a novel framework for editing LLMs, grounded in information bottleneck theory. This approach precisely compresses and isolates the essential information required for generalizable knowledge correction while minimizing disruption to unrelated model behaviors. Building upon this foundation, we present the Information Bottleneck Knowledge Editor (IBKE), which leverages compact latent representations to guide gradient-based updates, enabling robust and broadly applicable model editing. We validate IBKE’s effectiveness across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy and improved generality and specificity of edits. These findings establish a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.
zh
[NLP-32] Mitigating Hallucinations in Healthcare LLM s with Granular Fact-Checking and Domain-Specific Adaptation
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在医疗领域生成内容时存在的幻觉(hallucination)问题,尤其是在涉及临床决策和患者安全的关键场景下,LLM输出的可靠性与准确性难以保障。解决方案的关键在于提出一个独立于LLM运行的事实核查模块(fact-checking module),该模块通过数值验证和基于自然语言处理(Natural Language Processing, NLP)的离散逻辑进行细粒度事实校验,以对照电子健康记录(Electronic Health Records, EHRs)确保输出真实性;同时,采用低秩适应(Low-Rank Adaptation, LoRa)技术在MIMIC-III数据集上微调领域特定摘要模型,从而降低幻觉率并提升摘要质量。
链接: https://arxiv.org/abs/2512.16189
作者: Musarrat Zeba,Abdullah Al Mamun,Kishoar Jahan Tithee,Debopom Sutradhar,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Reem E. Mohamed,Md Rafiqul Islam,Yakub Sebastian,Mukhtar Hussain,Sami Azam
机构: Applied Artificial Intelligence and INtelligent Systems (AAIINS) Laboratory (应用人工智能与智能系统实验室); United International University (联合国际大学); Monash University (莫纳什大学); Lappeenranta-Lahti University of Technology (拉彭兰塔-拉赫蒂理工大学); Charles Darwin University (查尔斯达尔文大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.
zh
[NLP-33] A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
【速读】: 该论文旨在解决从警察事件公告等非结构化文本(如社交媒体帖子)中进行结构化信息抽取的难题,此类文本具有噪声大、格式不统一和表达非正式等特点,严重影响了数据处理的时效性和准确性。解决方案的关键在于构建一个领域自适应的信息抽取流水线,通过针对特定任务设计提示(prompt engineering)并结合低秩适配(Low-Rank Adaptation, LoRA)对Qwen2.5-7B模型进行参数高效微调,从而在保持模型泛化能力的同时显著提升对15个关键字段(包括地点、事件特征及影响评估等)的抽取精度。实验表明,该方法在死亡人数检测准确率超过98.36%,死亡人数和省级位置提取的精确匹配率分别达到95.31%和95.54%,验证了其在专业领域多任务结构化信息抽取中的有效性与实用性。
链接: https://arxiv.org/abs/2512.16183
作者: Mengfan Shen,Kangqi Song,Xindi Wang,Wei Jia,Tao Wang,Ziqiang Han
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 41 pages,3figures and 9 tables
Abstract:Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.
zh
[NLP-34] DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)水印技术在面对改写攻击(paraphrase attacks)和搭便车伪造攻击(piggyback spoofing attacks)时的防御不足问题,后者可能注入有害内容、破坏水印可靠性并削弱归属可信度。解决方案的关键在于提出DualGuard,其核心是自适应双流水印机制(adaptive dual-stream watermarking mechanism),通过根据语义内容动态注入两种互补的水印信号,从而实现对两类攻击的双重防护:不仅能够检测水印,还能追溯伪造来源,显著提升水印的可检测性、鲁棒性、可追溯性和文本质量,推动LLM水印技术向真实应用场景落地迈进。
链接: https://arxiv.org/abs/2512.16182
作者: Hao Li,Yubing Ren,Yanan Cao,Yingjie Li,Fang Fang,Shi Wang,Li Guo
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:With the rapid development of cloud-based services, large language models (LLMs) have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.
zh
[NLP-35] Science Consultant Agent
【速读】: 该论文旨在解决AI建模策略选择与实施过程中的效率低下问题,即在开发基于人工智能(Artificial Intelligence, AI)的解决方案时,实践者难以快速确定最适合的建模方法并高效落地。其解决方案的关键在于构建一个集成式Web端智能代理系统——Science Consultant Agent,该系统通过四个核心模块协同工作:结构化问卷(Questionnaire)、智能填充(Smart Fill)、基于研究的推荐(Research-Guided Recommendation)和原型生成器(Prototype Builder),将领域知识、文献支持的策略推荐与自动化原型构建相结合,从而显著加速从决策到实现的全流程开发,适用于产品经理、软件开发者及研究人员等多类用户。
链接: https://arxiv.org/abs/2512.16171
作者: Karthikeyan K,Philip Wu,Xin Tang,Alexandre Alves
机构: Duke University (杜克大学); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:The Science Consultant Agent is a web-based Artificial Intelligence (AI) tool that helps practitioners select and implement the most effective modeling strategy for AI-based solutions. It operates through four core components: Questionnaire, Smart Fill, Research-Guided Recommendation, and Prototype Builder. By combining structured questionnaires, literature-backed solution recommendations, and prototype generation, the Science Consultant Agent accelerates development for everyone from Product Managers and Software Developers to Researchers. The full pipeline is illustrated in Figure 1.
zh
[NLP-36] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning
【速读】: 该论文旨在解决社交媒体中由虚假叙事引发的仇恨言论(Faux-Hate)检测问题,其核心挑战在于识别混合使用印地语与英语的文本中是否存在此类有害内容,并进一步预测仇恨言论的目标对象和严重程度。解决方案的关键在于结合先进的自然语言处理技术与领域特定的预训练策略,通过多任务学习框架同时优化二分类(真假仇恨言论判别)与目标及严重程度预测两个子任务,从而提升整体检测性能。
链接: https://arxiv.org/abs/2512.16147
作者: Yash Bhaskar,Sankalp Bahad,Parameswari Krishnamurthy
机构: IIIT Hyderabad (印度信息技术研究所海得拉巴分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted Paper, Anthology ID: this http URL -fauxhate.3, 4 pages, 1 figure, 1 table
Abstract:Social media platforms, while enabling global connectivity, have become hubs for the rapid spread of harmful content, including hate speech and fake narratives \citedavidson2017automated, shu2017fake. The Faux-Hate shared task focuses on detecting a specific phenomenon: the generation of hate speech driven by fake narratives, termed Faux-Hate. Participants are challenged to identify such instances in code-mixed Hindi-English social media text. This paper describes our system developed for the shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involving fake and hate speech classification, and (b) Target and Severity prediction, categorizing the intended target and severity of hateful content. Our approach combines advanced natural language processing techniques with domain-specific pretraining to enhance performance across both tasks. The system achieved competitive results, demonstrating the efficacy of leveraging multi-task learning for this complex problem.
zh
[NLP-37] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation
【速读】: 该论文旨在解决当前医学报告生成(Medical Report Generation, MRG)方法在临床准确性上的不足问题。现有方法通常基于词级监督目标(token-level objectives),仅关注语言风格的模仿,而忽视了报告内容与真实临床标签的一致性,导致生成文本虽语法通顺但可能缺乏医学正确性。解决方案的关键在于提出一种语义驱动的强化学习方法(Semantic-driven Reinforcement Learning, SRL),其核心是采用组相对策略优化(Group Relative Policy Optimization, GRPO)框架,并设计了一个基于关键影像学发现的报告级奖励函数——边际余弦相似度(Margin-based Cosine Similarity, MCCS),以直接提升生成报告与参考报告在临床标签层面的语义一致性。此外,引入轻量级推理格式约束引导模型输出结构化的“思考型报告”,从而显著增强生成结果的临床可靠性。实验表明,该方法在IU X-Ray和MIMIC-CXR数据集上均取得最优临床效度(Clinical Efficacy, CE)指标,验证了以临床语义为指导的报告级奖励优于传统词级监督策略。
链接: https://arxiv.org/abs/2512.16145
作者: Pengyu Wang,Shuchang Ye,Usman Naseem,Jinman Kim
机构: The University of Sydney (悉尼大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured “thinking report” outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.
zh
[NLP-38] Convolutional Lie Operator for Sentence Classification
【速读】: 该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在捕捉语言中复杂变换能力上的局限性,尽管其在提取局部、位置不变特征方面表现优异,但难以建模语言中更复杂的非欧几里得对称性。解决方案的关键在于引入李群(Lie group)操作,通过将李卷积(Lie Convolution)集成到基于卷积的句子分类器中,从而有效捕获语言数据中那些传统CNN难以建模的复杂变换模式。实验表明,所提出的SCLie和DPCLie模型在准确率上优于传统CNN方法,验证了基于李代数结构的表示学习在语言建模中的潜力。
链接: https://arxiv.org/abs/2512.16125
作者: Daniela N. Rim,Heeyoul Choi
机构: Handong Global University (韩国海东全球大学)
类目: Computation and Language (cs.CL)
备注: Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval
Abstract:Traditional Convolutional Neural Networks have been successful in capturing local, position-invariant features in text, but their capacity to model complex transformation within language can be further explored. In this work, we explore a novel approach by integrating Lie Convolutions into Convolutional-based sentence classifiers, inspired by the ability of Lie group operations to capture complex, non-Euclidean symmetries. Our proposed models SCLie and DPCLie empirically outperform traditional Convolutional-based sentence classifiers, suggesting that Lie-based models relatively improve the accuracy by capturing transformations not commonly associated with language. Our findings motivate more exploration of new paradigms in language modeling.
zh
[NLP-39] ContextLeak: Auditing Leakage in Private In-Context Learning Methods
【速读】: 该论文旨在解决在上下文学习(In-Context Learning, ICL)中,当示例包含敏感信息时,如何有效评估和测量隐私保护机制是否存在信息泄露的问题。现有方法虽提出多种隐私保护策略(如基于提示的防御、嵌入空间聚合和报告噪声最大值等),但缺乏系统性的审计手段来量化其实际防护效果。论文提出的解决方案核心是构建 ContextLeak 框架,通过引入“金丝雀”标记(canary insertion)——即在示例中嵌入唯一可识别的标记 token,并设计针对性查询来检测这些标记是否被模型输出泄露,从而实现对 ICL 中隐私泄露风险的实证测量。该方法能够与理论隐私预算(ε)紧密关联,可靠地识别出各类隐私保护技术的实际泄露情况,并揭示当前多数方法在隐私与任务性能之间存在显著权衡失衡问题。
链接: https://arxiv.org/abs/2512.16059
作者: Jacob Choi,Shuying Cao,Xingjian Dong,Wang Bill Zhu,Robin Jia,Sai Praneeth Karimireddy
机构: University of Southern California (南加州大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:In-Context Learning (ICL) has become a standard technique for adapting Large Language Models (LLMs) to specialized tasks by supplying task-specific exemplars within the prompt. However, when these exemplars contain sensitive information, reliable privacy-preserving mechanisms are essential to prevent unintended leakage through model outputs. Many privacy-preserving methods are proposed to protect the information leakage in the context, but there are less efforts on how to audit those methods. We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. ContextLeak uses canary insertion, embedding uniquely identifiable tokens in exemplars and crafting targeted queries to detect their presence. We apply ContextLeak across a range of private ICL techniques, both heuristic such as prompt-based defenses and those with theoretical guarantees such as Embedding Space Aggregation and Report Noisy Max. We find that ContextLeak tightly correlates with the theoretical privacy budget ( \epsilon ) and reliably detects leakage. Our results further reveal that existing methods often strike poor privacy-utility trade-offs, either leaking sensitive information or severely degrading performance.
zh
[NLP-40] Are We on the Right Way to Assessing LLM -as-a-Judge?
【速读】: 该论文旨在解决当前大语言模型作为评判者(LLM-as-a-Judge)评估方法中存在的两个核心问题:一是现有基准测试严重依赖人工标注的“黄金标准”,引入了人类偏倚并限制了可扩展性;二是缺乏对LLM判官自身可靠性的无监督量化评估手段。其解决方案的关键在于提出Sage这一全新评估套件,该套件基于理性选择理论的公理,引入两种无需人工标注即可衡量LLM判官质量的新指标:局部自一致性(pair-wise preference stability,即成对偏好稳定性)和全局逻辑一致性(transitivity across a full set of preferences,即偏好传递性)。通过构建包含650个问题的数据集(融合结构化基准与真实用户查询),实验验证了Sage指标的稳定性和与监督基准(如LLMBar和RewardBench2)的高度相关性,从而为LLM判官的鲁棒性和准确性提供了可靠的无监督评估框架。
链接: https://arxiv.org/abs/2512.16041
作者: Yuanning Feng,Sinan Wang,Zhengxiang Cheng,Yao Wan,Dongping Chen
机构: Huazhong University of Science and Technology (华中科技大学); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage’s reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.
zh
[NLP-41] Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms
【速读】: 该论文旨在解决如何利用个人自述信息(self-disclosure)来提升对主观任务中标注者标签的预测能力,特别是针对社会规范判断的标注模式建模问题。其关键解决方案在于系统性地对自述语句进行分类,并构建基于此分类的标注者模型;研究发现,人口统计学特征(demographics)相较于态度、关系和经历更具预测力,且理论驱动的分类方法优于自动聚类方法,同时仅需少量相关自述即可实现有效预测,而更丰富的标注者自述样本多样性则能显著提升模型性能。
链接: https://arxiv.org/abs/2512.16034
作者: Kieran Henderson,Kian Omoomi,Vasudha Varadarajan,Allison Lahnala,Charles Welch
机构: Toronto University (多伦多大学); Carnegie Mellon University (卡内基梅隆大学); McMaster University (麦克马斯特大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosure sentences and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. We find that demographics are more impactful than attitudes, relationships, and experiences. Generally, theory-based approaches worked better than automatic clusters. Contrary to previous work, only a small number of related comments are needed. Lastly, having a more diverse sample of annotator self-disclosures leads to the best performance.
zh
[NLP-42] Cross-Language Bias Examination in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言场景下偏见评估不足的问题,特别是现有方法难以全面捕捉跨语言的显性与隐性偏见差异。其解决方案的关键在于提出了一种创新的多语言偏见评估框架,该框架结合了基于BBQ基准的显性偏见评估与基于提示的内隐联想测试(prompt-based Implicit Association Test, IAT)来测量隐性偏见,并通过将测试提示和词表翻译至英语、中文、阿拉伯语、法语和西班牙语五种语言,实现了对不同语言中偏见类型的直接比较。这一方法不仅揭示了语言间偏见水平的显著差异(如阿拉伯语和西班牙语表现出更高程度的刻板印象偏见),还发现了偏见类型间的非对称模式(如年龄相关偏见在显性指标中最低但在隐性指标中最高),从而为开发公平、有效的多语言LLMs提供了系统性的分析工具和实证依据。
链接: https://arxiv.org/abs/2512.16029
作者: Yuxuan Liang,Marwa Mahmoud
机构: Georgia Institute of Technology (佐治亚理工学院); University of Glasgow (格拉斯哥大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models, combining explicit bias assessment through the BBQ benchmark with implicit bias measurement using a prompt-based Implicit Association Test. By translating the prompts and word list into five target languages, English, Chinese, Arabic, French, and Spanish, we directly compare different types of bias across languages. The results reveal substantial gaps in bias across languages used in LLMs. For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias. We also identify contrasting patterns across bias types. Age shows the lowest explicit bias but the highest implicit bias, emphasizing the importance of detecting implicit biases that are undetectable with standard benchmarks. These findings indicate that LLMs vary significantly across languages and bias dimensions. This study fills a key research gap by providing a comprehensive methodology for cross-lingual bias analysis. Ultimately, our work establishes a foundation for the development of equitable multilingual LLMs, ensuring fairness and effectiveness across diverse languages and cultures.
zh
[NLP-43] Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中多头自注意力机制(Multi-Head Self-Attention, MHSA)在计算效率与注意力保真度之间难以平衡的问题。传统低秩近似方法依赖静态秩假设,缺乏对输入序列动态变化和硬件约束的适应能力,导致在长序列场景下性能受限或资源浪费。其解决方案的关键在于提出动态秩强化学习(Dynamic Rank Reinforcement Learning, DR-RL)框架,通过将秩选择建模为一个序贯策略优化问题,并设计基于在线矩阵扰动理论的增量更新机制,实现对MHSA低秩分解的实时自适应调整;同时引入轻量级Transformer策略网络与批处理奇异值分解(SVD)操作,保障在现代GPU架构上的可扩展性,从而在保持下游任务精度的同时显著降低浮点运算量(FLOPs)。
链接: https://arxiv.org/abs/2512.15973
作者: Caner Erden
机构: Sakarya University of Applied Sciences (萨卡里亚应用科学大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions–limiting their flexibility across diverse input contexts–our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: this https URL
zh
[NLP-44] BRAID: Bounded Reasoning for Autonomous Inference and Decisions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中存在的性能、成本与token使用量之间的非线性关系问题,尤其是在自主代理系统中如何提升推理效率和准确性。解决方案的关键在于提出BRAID(Bounded Reasoning for Autonomous Inference and Decisions)框架,其核心是通过基于Mermaid的结构化指令图(machine-readable prompts)实现受限推理,替代传统的无边界自然语言扩展方式,从而显著提高推理准确性和计算成本效益。
链接: https://arxiv.org/abs/2512.15959
作者: Armağan Amcalar,Eyup Cinar
机构: OpenServ Labs; Eskisehir Osmangazi University (埃斯基谢希尔奥斯曼加齐大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at this https URL.
zh
[NLP-45] DSO: Direct Steering Optimization for Bias Mitigation
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在决策过程中因感知到输入中人物的种族、性别等人口统计学特征而产生偏见的问题,例如未能识别女性为医生,同时应对偏见缓解与模型整体性能之间的权衡难题。解决方案的关键在于提出直接激活转向优化(Direct Steering Optimization, DSO),该方法利用强化学习自动学习线性变换以调整模型中间层激活,从而在推理阶段实现对偏见的可控干预,且在公平性与模型能力之间达到最优平衡。相比依赖预设启发式规则的传统转向方法,DSO通过端到端优化策略直接针对行为控制目标进行设计,显著提升了偏见干预的有效性。
链接: https://arxiv.org/abs/2512.15926
作者: Lucas Monteiro Paes,Nivedha Sivakumar,Yinong Oliver Wang,Masha Fedzechkina Donaldson,Luca Zappella,Nicholas Apostoloff
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
zh
[NLP-46] Social Story Frames: Contextual Reasoning about Narrative Intent and Reception
【速读】: 该论文旨在解决当前计算模型在模拟读者对故事反应方面的局限性问题,尤其是难以捕捉解释性、情感性和评价性响应等细微差异。其解决方案的关键在于提出SocialStoryFrames这一形式化框架,通过融合叙事理论、语言语用学和心理学的分类体系,结合对话上下文,系统地提炼出关于读者反应的合理推断,包括感知作者意图、解释与预测推理、情感反应及价值判断等维度;并进一步开发了SSF-Generator和SSF-Classifier两个模型,经由人类问卷调查(N=382)和专家标注验证,实现了对大规模社交媒体故事数据中叙事意图频率及其社区间差异的量化分析,从而为在线社群中的叙事研究提供了可扩展、细粒度且情境敏感的新方法。
链接: https://arxiv.org/abs/2512.15925
作者: Joel Mire,Maria Antoniak,Steven R. Wilson,Zexin Ma,Achyutarama R. Ganti,Andrew Piper,Maarten Sap
机构: Carnegie Mellon University (卡内基梅隆大学); University of Colorado Boulder (科罗拉多大学博尔德分校); University of Michigan-Flint (密歇根大学弗林特分校); University of Connecticut (康涅狄格大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Presented at IC2S2 2025; Under Review (ARR Oct 2025)
Abstract:Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.
zh
[NLP-47] abReX : Tabular Referenceless eXplainable Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成表格时缺乏有效评估手段的问题,现有指标要么将表格扁平化为文本而忽略其结构信息,要么依赖固定参考表限制了泛化能力。解决方案的关键在于提出一种无参考、基于属性驱动的评估框架TabReX,通过将源文本与生成表格映射为规范化的知识图谱(knowledge graphs),利用LLM引导的匹配过程进行对齐,并计算可解释且符合评价标准的分数,以量化结构和事实一致性;该方法支持敏感性与特异性之间的可控权衡,提供人类对齐的判断结果及单元格级别的错误追踪,从而实现对结构化生成系统的可信、可解释评估。
链接: https://arxiv.org/abs/2512.15907
作者: Tejas Anvekar,Juhna Park,Aparna Garimella,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.
zh
[NLP-48] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在基础视觉推理任务中表现受限的问题,其根源在于模型主要依赖文本描述进行视觉理解,而文本作为监督信号具有主观性和不完整性;同时,由于多模态指令微调数据规模远小于纯文本预训练数据,MLLMs容易过度拟合语言先验,忽视视觉细节。解决方案的关键在于提出JARVIS框架——一个受JEPA启发的自监督视觉增强方法,通过将I-JEPA学习范式融入标准的视觉-语言对齐训练流程:利用冻结的视觉基础模型作为上下文和目标编码器,训练LLM的早期层作为预测器,以学习图像中的结构与语义规律,从而减少对语言监督的依赖,显著提升视觉感知能力,且不损害多模态推理性能。
链接: https://arxiv.org/abs/2512.15885
作者: Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Pier Luigi Dovesi,Shaghayegh Roohi,Mark Granroth-Wilding,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); AMD Silo AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: this https URL.
zh
[NLP-49] From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining
【速读】: 该论文旨在解决从脑活动(brain activity)中解码语音(speech)时因训练数据量有限而导致的性能瓶颈问题。传统方法依赖于短时、高度受控实验中采集的少量神经记录,限制了模型泛化能力。其解决方案的关键在于利用患者临床监测期间获得的长达一周的颅内(intracranial)和音频记录,通过对比学习(contrastive learning)框架进行预训练,从而将训练数据规模提升两个数量级以上。这一策略显著提升了模型性能,并揭示了跨日脑信号结构漂移现象,强调了模型需显式建模跨日变异性的必要性。
链接: https://arxiv.org/abs/2512.15830
作者: Linnea Evanson,Mingfang(Lucy)Zhang,Hubert Banville,Saarang Panchavati,Pierre Bourdillon,Jean-Rémi King
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: Linnea Evanson* and Mingfang (Lucy) Zhang* are joint first authors. Pierre Bourdillon** and Jean-Rémi King** are joint last authors
Abstract:Decoding speech from brain activity has typically relied on limited neural recordings collected during short and highly controlled experiments. Here, we introduce a framework to leverage week-long intracranial and audio recordings from patients undergoing clinical monitoring, effectively increasing the training dataset size by over two orders of magnitude. With this pretraining, our contrastive learning model substantially outperforms models trained solely on classic experimental data, with gains that scale log-linearly with dataset size. Analysis of the learned representations reveals that, while brain activity represents speech features, its global structure largely drifts across days, highlighting the need for models that explicitly account for cross-day variability. Overall, our approach opens a scalable path toward decoding and modeling brain representations in both real-life and controlled task settings.
zh
[NLP-50] DP-Bench: A Benchmark for Evaluating Data Product Creation Systems
【速读】: 该论文旨在解决自动数据产品(Data Product)生成缺乏统一评估基准的问题。当前虽然已有大量人工或半自动方法用于构建数据产品,但尚无公开、标准化的评测框架来衡量自动化生成的效果。为此,作者提出了首个针对该任务的基准测试平台DP-Bench,其关键创新在于借鉴了ELT(Extract-Load-Transform)流程和Text-to-SQL评测基准的设计思路,构建了一个可量化、可复现的评估体系。同时,论文还提出了一系列基于大语言模型(Large Language Model, LLM)的基线方法,为未来自动数据产品生成研究提供了起点和参考标准。
链接: https://arxiv.org/abs/2512.15798
作者: Faisal Chowdhury,Sola Shirai,Sarthak Dash,Nandana Mihindukulasooriya,Horst Samulowitz
机构: IBM ResearchYorktown HeightsUSA; IBM ResearchNew YorkUSA
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater insights about their data. Since it was first introduced over a decade ago, there has been considerable work, especially in industry, to create data products manually or semi-automatically. However, there exists hardly any benchmark to evaluate automatic data product creation. In this work, we present a benchmark, first of its kind, for this task. We call it DP-Bench. We describe how this benchmark was created by taking advantage of existing work in ELT (Extract-Load-Transform) and Text-to-SQL benchmarks. We also propose a number of LLM based approaches that can be considered as baselines for generating data products automatically. We make the DP-Bench and supplementary materials available in this https URL .
zh
[NLP-51] Explainable Ethical Assessment on Human Behaviors by Generating Conflicting Social Norms
【速读】: 该论文旨在解决当前AI系统在评估人类行为情感倾向(valence)时缺乏可解释性与可信度的问题,尤其是在未显式引入社会规范(social norms)的情况下,模型难以模拟人类道德判断的逻辑。其解决方案的关键在于提出一种名为ClarityEthic的新颖伦理评估方法,通过生成冲突的社会规范来增强语言模型对人类行为的道德推理能力,并采用对比学习策略提升模型对不同规范间张力的理解,从而更准确地预测和解释人类行为的正负向倾向。
链接: https://arxiv.org/abs/2512.15793
作者: Yuxi Sun,Wei Gao,Hongzhan Lin,Jing Ma,Wenxuan Zhang
机构: Hong Kong Baptist University (香港浸会大学); Singapore Management University (新加坡管理大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Acceppt by Asia-Pacific Chapter of the Association for Computational Linguistics (2025)
Abstract:Human behaviors are often guided or constrained by social norms, which are defined as shared, commonsense rules. For example, underlying an action \textitreport a witnessed crime" are social norms that inform our conduct, such as \textitIt is expected to be brave to report crimes’‘. Current AI systems that assess valence (i.e., support or oppose) of human actions by leveraging large-scale data training not grounded on explicit norms may be difficult to explain, and thus untrustworthy. Emulating human assessors by considering social norms can help AI models better understand and predict valence. While multiple norms come into play, conflicting norms can create tension and directly influence human behavior. For example, when deciding whether to ``\textitreport a witnessed crime’', one may balance \textitbravery against \textitself-protection. In this paper, we introduce \textitClarityEthic, a novel ethical assessment approach, to enhance valence prediction and explanation by generating conflicting social norms behind human actions, which strengthens the moral reasoning capabilities of language models by using a contrastive learning strategy. Extensive experiments demonstrate that our method outperforms strong baseline approaches, and human evaluations confirm that the generated social norms provide plausible explanations for the assessment of human behaviors.
zh
[NLP-52] A Systematic Analysis of Biases in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多样化情境下可能存在的偏见与倾向性问题,以确保其在信息获取和辅助决策中的公平性与可靠性。解决方案的关键在于通过系统性实验设计,从政治中立性、意识形态倾向、地缘联盟偏好、语言多样性以及性别关联性五个维度对四类主流LLM进行多角度测评,从而揭示其潜在的非中立特征,并为后续模型的公平性优化提供实证依据。
链接: https://arxiv.org/abs/2512.15792
作者: Xulang Zhang,Rui Mao,Erik Cambria
机构: Nanyang Technological University (南洋理工大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types.
zh
[NLP-53] Evaluation of AI Ethics Tools in Language Models: A Developers Perspective Case Stud
【速读】: 该论文试图解决的问题是:当前用于语言模型的AI伦理工具(AI Ethics Tools, AIETs)在实际应用中缺乏充分的文档支持、使用案例以及有效性验证,导致其在促进负责任开发和部署语言模型方面作用有限。解决方案的关键在于提出并实施一种系统性的评估方法,通过文献综述筛选出具有代表性的四类AIETs(Model Cards、ALTAI、FactSheets 和 Harms Modeling),并将其应用于葡萄牙语语言模型的开发实践中,结合对35小时开发者访谈的数据,从开发者视角评估这些工具在识别伦理考量方面的实用性与局限性。结果表明,尽管AIETs可作为通用伦理指南,但未能有效捕捉语言模型特有的问题(如习语表达)及针对特定语言(如葡萄牙语)可能产生的负面影响。
链接: https://arxiv.org/abs/2512.15791
作者: Jhessica Silva,Diego A. B. Moreira,Gabriel O. dos Santos,Alef Ferreira,Helena Maia,Sandra Avila,Helio Pedrini
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 figures, 11 tables. Accepted for publication in AI and Ethics
Abstract:In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of simulating realistic conversations with humans through text generation. Because of their impact on society, developing and deploying these language models must be done responsibly, with attention to their negative impacts and possible harms. In this scenario, the number of AI Ethics Tools (AIETs) publications has recently increased. These AIETs are designed to help developers, companies, governments, and other stakeholders establish trust, transparency, and responsibility with their technologies by bringing accepted values to guide AI’s design, development, and use stages. However, many AIETs lack good documentation, examples of use, and proof of their effectiveness in practice. This paper presents a methodology for evaluating AIETs in language models. Our approach involved an extensive literature survey on 213 AIETs, and after applying inclusion and exclusion criteria, we selected four AIETs: Model Cards, ALTAI, FactSheets, and Harms Modeling. For evaluation, we applied AIETs to language models developed for the Portuguese language, conducting 35 hours of interviews with their developers. The evaluation considered the developers’ perspective on the AIETs’ use and quality in helping to identify ethical considerations about their model. The results suggest that the applied AIETs serve as a guide for formulating general ethical considerations about language models. However, we note that they do not address unique aspects of these models, such as idiomatic expressions. Additionally, these AIETs did not help to identify potential negative impacts of models for the Portuguese language.
zh
[NLP-54] Auto-Tuning Safety Guardrails for Black-Box Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在部署时依赖人工调优、脆弱且难以复现的安全防护机制(即安全护栏,safety guardrails)的问题。其关键解决方案是将安全护栏的设计视为一个可优化的超参数问题,在冻结基础模型的前提下,通过系统性地组合不同模块化的系统提示(system prompts)和内容过滤策略(如ModernBERT-based有害性分类器),并基于多维度指标(包括恶意软件生成成功率、经典越狱攻击成功率、良性查询下的有害响应率及端到端延迟)进行评估。研究采用网格搜索与黑盒贝叶斯优化(Optuna)相结合的方法,在显著减少评估次数(约减少一个数量级)和计算时间(约节省8倍)的同时,可靠地复现了最优配置,验证了将安全护栏作为超参数进行自动优化的可行性与有效性。
链接: https://arxiv.org/abs/2512.15782
作者: Perry Abdulkadir
机构: University of St. Thomas (圣托马斯大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 7 figures, 1 table. Work completed as part of the M.S. in Artificial Intelligence at the University of St. Thomas using publicly available models and datasets; all views and any errors are the author’s own
Abstract:Large language models (LLMs) are increasingly deployed behind safety guardrails such as system prompts and content filters, especially in settings where product teams cannot modify model weights. In practice these guardrails are typically hand-tuned, brittle, and difficult to reproduce. This paper studies a simple but practical alternative: treat safety guardrail design itself as a hyperparameter optimization problem over a frozen base model. Concretely, I wrap Mistral-7B-Instruct with modular jailbreak and malware system prompts plus a ModernBERT-based harmfulness classifier, then evaluate candidate configurations on three public benchmarks covering malware generation, classic jailbreak prompts, and benign user queries. Each configuration is scored using malware and jailbreak attack success rate, benign harmful-response rate, and end-to-end latency. A 48-point grid search over prompt combinations and filter modes establishes a baseline. I then run a black-box Optuna study over the same space and show that it reliably rediscovers the best grid configurations while requiring an order of magnitude fewer evaluations and roughly 8x less wall-clock time. The results suggest that viewing safety guardrails as tunable hyperparameters is a feasible way to harden black-box LLM deployments under compute and time constraints.
zh
[NLP-55] D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models
【速读】: 该论文旨在解决预训练多模态模型在零样本图像分类中因数据多样性不足和人口统计学偏差导致的性能下降与不公平预测问题。具体而言,低容量模型易出现欠拟合,尤其在细粒度分类任务中表现不佳;同时,若训练数据未充分覆盖各类人群的代表性样本,模型会偏向高频类别,忽视少数群体,从而引入有害的人口统计学偏差。解决方案的关键在于提出一种无需训练的零样本方法——Diverse Demographic Data Generation (D3G),通过在推理阶段利用Stable Diffusion XL生成多样化人口统计特征的数据,增强预训练CLIP模型对不同类别的判别能力,从而在不修改模型参数的前提下提升分类准确率并降低偏差。
链接: https://arxiv.org/abs/2512.15747
作者: Javon Hickmon
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.
zh
[NLP-56] LLaDA2.0: Scaling Up Diffusion Language Models to 100B
【速读】: 该论文旨在解决大规模语言模型在部署时面临的训练成本高、效率低以及难以平衡性能与资源消耗的问题。传统自回归(Auto-regressive, AR)模型虽然性能优异,但其串行生成机制限制了推理效率,且从头训练大模型代价高昂。为此,作者提出LLaDA2.0——一种基于离散扩散机制的大规模语言模型(discrete diffusion large language models, dLLM),通过系统性转换预训练AR模型实现百亿参数级扩展,采用“三阶段块级WSD训练策略”作为核心解决方案:第一阶段(warm-up)逐步增加块大小以稳定扩散过程;第二阶段(stable)进行全序列大规模扩散训练;第三阶段(decay)恢复紧凑块扩散以提升效率。该方法兼顾知识继承、渐进适应与效率设计,最终获得两个面向实际部署的指令微调MoE模型(LLaDA2.0-mini和LLaDA2.0-flash),在保持并行解码优势的同时显著提升了前沿规模下的性能与推理效率。
链接: https://arxiv.org/abs/2512.15745
作者: Tiwei Bie,Maosong Cao,Kun Chen,Lun Du,Mingliang Gong,Zhuochen Gong,Yanmei Gu,Jiaqi Hu,Zenan Huang,Zhenzhong Lan,Chengxi Li,Chongxuan Li,Jianguo Li,Zehuan Li,Huabin Liu,Ling Liu,Guoshan Lu,Xiaocheng Lu,Yuxin Ma,Jianfeng Tan,Lanning Wei,Ji-Rong Wen,Yipeng Xing,Xiaolu Zhang,Junbo Zhao,Da Zheng,Jun Zhou,Junlin Zhou,Zhanchao Zhou,Liwang Zhu,Yihong Zhuang
机构: Ant Group(蚂蚁集团); Renmin University of China (中国人民大学); Zhejiang University (浙江大学); Westlake University (西湖大学); HongKong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages
Abstract:This paper presents LLaDA2.0 – a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models – establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
zh
[NLP-57] Value Lens: Using Large Language Models to Understand Human Values ECAI2025
【速读】: 该论文旨在解决自主决策系统在运行过程中如何确保其行为符合人类价值观的问题,核心挑战在于系统需能够评估自身决策对人类价值观的促进或损害程度。解决方案的关键在于提出Value Lens模型,该模型基于生成式AI(Generative AI)技术,利用大型语言模型(Large Language Models, LLMs)构建双阶段检测机制:第一阶段通过LLM生成形式化的价值理论描述并由专家验证;第二阶段采用一对LLM协作——一个负责识别文本中的价值观,另一个作为批评者对识别过程进行审查与优化。此设计实现了对人类价值观的精准识别与可解释性评估,显著提升了决策系统的伦理对齐能力。
链接: https://arxiv.org/abs/2512.15722
作者: Eduardo de la Cruz Fernández,Marcelo Karanik,Sascha Ossowski
机构: Universidad Politécnica de Madrid (马德里理工大学); CETINIA, Universidad Rey Juan Carlos (雷伊·胡安·卡洛斯大学 CETINIA 中心)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages. 2 figures. Published in ECAI 2025, Frontiers in Artificial Intelligence and Applications, Volume 413, pages 5175-5178
Abstract:The autonomous decision-making process, which is increasingly applied to computer systems, requires that the choices made by these systems align with human values. In this context, systems must assess how well their decisions reflect human values. To achieve this, it is essential to identify whether each available action promotes or undermines these values. This article presents Value Lens, a text-based model designed to detect human values using generative artificial intelligence, specifically Large Language Models (LLMs). The proposed model operates in two stages: the first aims to formulate a formal theory of values, while the second focuses on identifying these values within a given text. In the first stage, an LLM generates a description based on the established theory of values, which experts then verify. In the second stage, a pair of LLMs is employed: one LLM detects the presence of values, and the second acts as a critic and reviewer of the detection process. The results indicate that Value Lens performs comparably to, and even exceeds, the effectiveness of other models that apply different methods for similar tasks.
zh
计算机视觉
[CV-0] he World is Your Canvas: Painting Promptable Events with Reference Images Trajectories and Text
【速读】:该论文旨在解决现有世界模型在生成可控、多智能体交互及具象化场景方面的能力不足问题,尤其是如何实现用户意图驱动的动态事件模拟。其解决方案的关键在于提出WorldCanvas框架,通过融合自然语言(用于语义意图表达)、轨迹信息(编码运动、时间与可见性)和参考图像(提供物体身份视觉锚定),构建一个多模态输入控制机制,从而生成具有时序一致性、对象恒常性以及复杂交互行为的可控视频事件,推动世界模型从被动预测向主动交互式仿真演进。
链接: https://arxiv.org/abs/2512.16924
作者: Hanlin Wang,Hao Ouyang,Qiuyu Wang,Yue Yu,Yihao Meng,Wen Wang,Ka Leong Cheng,Shuailei Ma,Qingyan Bai,Yixuan Li,Cheng Chen,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen
机构: HKUST(香港科技大学); Ant Group(蚂蚁集团); ZJU(浙江大学); NEU(东北大学); CUHK(香港中文大学); NTU(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL
Abstract:We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories – encoding motion, timing, and visibility – with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: this https URL.
zh
[CV-1] Generative Refocusing: Flexible Defocus Control from a Single Image
【速读】:该论文旨在解决单图像重对焦(single-image refocusing)中的关键挑战,即如何从模糊输入中恢复清晰内容并生成逼真的背景虚化效果(bokeh),同时实现对光圈参数的可控调节。现有方法普遍存在依赖全焦距输入、依赖模拟数据或缺乏对光圈控制能力等问题。其解决方案的关键在于提出了一种两阶段生成式重对焦(Generative Refocusing)框架:首先使用DeblurNet从多种输入中恢复全焦距图像,再通过BokehNet生成可调控的bokeh效果;创新性地采用半监督训练策略,融合合成配对数据与未配对的真实bokeh图像,并利用EXIF元数据捕捉真实光学特性,从而突破模拟器的局限,显著提升在去模糊、背景虚化和重对焦任务上的性能表现。
链接: https://arxiv.org/abs/2512.16923
作者: Chun-Wei Tuan Mu,Jia-Bin Huang,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.
zh
[CV-2] Next-Embedding Prediction Makes Strong Vision Learners
【速读】:该论文旨在解决视觉自监督学习中模型泛化能力不足与架构复杂性高的问题,试图通过借鉴自然语言处理中生成式预训练的成功经验,构建一种更简洁、可扩展的视觉表征学习范式。其解决方案的关键在于提出Next-Embedding Predictive Autoregression(NEPA)方法,即直接以嵌入(embedding)为预测目标进行因果掩码建模和梯度截断,使模型通过预测未来图像块嵌入来学习结构化表示,而非传统方式中的像素重建、离散token或对比损失等机制。此方法仅依赖单一生成式任务即可实现高性能迁移,无需任务特定头设计,从而在ImageNet-1K上达到83.8%~85.3%的top-1准确率,并有效迁移到ADE20K语义分割任务。
链接: https://arxiv.org/abs/2512.16922
作者: Sihan Xu,Ziqiao Ma,Wenhao Chai,Xuweiyi Chen,Weiyang Jin,Joyce Chai,Saining Xie,Stella X. Yu
机构: University of Michigan (密歇根大学); New York University (纽约大学); Princeton University (普林斯顿大学); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.
zh
[CV-3] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)评估方法缺乏可解释性、难以全面揭示模型间显著能力差距的问题。其解决方案的关键在于提出一种自动化框架AuditDM,通过强化学习训练一个审计代理(auditor)来主动发现并修正MLLM的失败模式:该代理生成具有挑战性的提问和反事实图像以最大化目标模型间的分歧;训练完成后,审计代理能揭示多样且可解释的典型失败案例,这些案例无需人工标注即可用于模型微调,从而有效提升模型性能。
链接: https://arxiv.org/abs/2512.16921
作者: Qihao Liu,Chengzhi Mao,Yaojie Liu,Alan Yuille,Wen-Sheng Chu
机构: Google(谷歌); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project page: this https URL
Abstract:Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
zh
[CV-4] EasyV2V: A High-quality Instruction-based Video Editing Framework
【速读】:该论文旨在解决视频编辑中长期存在的挑战,包括一致性(consistency)、控制精度(control)和泛化能力(generalization),这些问题在图像编辑已取得显著进展的背景下尤为突出。其解决方案的关键在于提出一个名为EasyV2V的简洁而高效的框架,通过三个核心设计实现:首先,在数据层面,利用具有快速逆变换的现有专家模型构建多样化视频对,结合单帧监督与伪配对(pseudo pairs)引入共享仿射运动,同时挖掘密集字幕片段以增强语义一致性,并加入过渡监督引导编辑过程的时序演化;其次,在模型架构上,发现预训练文本到视频生成模型本身具备编辑潜力,从而采用简单的序列拼接作为条件输入并辅以轻量级LoRA微调即可获得强性能;最后,在控制机制上,统一使用单一掩码(mask)实现时空控制,并支持可选参考图像提升细节保真度。该方法支持灵活输入组合(如视频+文本、视频+掩码+文本等),并在多项指标上超越同期及商业系统,展现出卓越的视频编辑效果。
链接: https://arxiv.org/abs/2512.16920
作者: Jinjie Mai,Chaoyang Wang,Guocheng Gordon Qian,Willi Menapace,Sergey Tulyakov,Bernard Ghanem,Peter Wonka,Ashkan Mirzaei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emphEasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: this https URL
zh
[CV-5] DVGT: Driving Visual Geometry Transformer
【速读】:该论文旨在解决自动驾驶场景中缺乏针对驾驶任务的稠密几何感知模型的问题,尤其在不同场景和相机配置下难以实现鲁棒、灵活的3D场景重建。解决方案的关键在于提出Driving Visual Geometry Transformer (DVGT),其通过使用DINO骨干网络提取图像特征,并结合交替的视图内局部注意力、跨视图空间注意力与跨帧时间注意力机制,从无位姿约束的多视角图像序列中推断几何关系;进而利用多头解码器直接输出以第一帧为参考坐标系的全局稠密3D点云和各帧的自车位姿,无需依赖精确相机参数或外部传感器后对齐,从而实现了对任意相机配置的灵活处理和度量尺度的直接预测。
链接: https://arxiv.org/abs/2512.16919
作者: Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Shengyin Jiang,Long Chen,Zhi-Xin Yang,Jiwen Lu
机构: Tsinghua University (清华大学); Xiaomi EV (小米汽车); University of Macau (澳门大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Code is available at this https URL
Abstract:Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at this https URL.
zh
[CV-6] AdaTooler-V: Adaptive Tool-Use for Images and Videos
【速读】:该论文旨在解决当前开源多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉工具调用上的盲目性问题,即模型在无需使用视觉工具时仍频繁调用,导致推理开销增加且性能下降。其解决方案的关键在于提出AdaTooler-V,通过引入一种自适应工具使用机制实现精准的工具调用决策:首先设计AT-GRPO强化学习算法,基于每个样本的工具收益得分(Tool Benefit Score)动态调整奖励尺度,引导模型仅在工具真正提升推理效果时才调用;其次构建两个大规模训练数据集(AdaTooler-V-CoT-100k用于监督微调冷启动,AdaTooler-V-300k用于带可验证奖励的强化学习),覆盖单图、多图和视频等多种模态输入,从而显著提升模型在复杂视觉推理任务中的准确率与效率。
链接: https://arxiv.org/abs/2512.16918
作者: Chaoyang Wang,Kaituo Feng,Dongyang Chen,Zhongyu Wang,Zhixun Li,Sicheng Gao,Meng Meng,Xu Zhou,Manyuan Zhang,Yuzhang Shang,Xiangyu Yue
机构: MMLab, CUHK(香港中文大学); THU(清华大学); SJTU(上海交通大学); DB Group, CUHK(香港中文大学); UCF(美国中佛罗里达大学); Sangfor(深信服); JMU(江西农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
zh
[CV-7] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
【速读】:该论文旨在解决单目视频到立体视频转换(Monocular-to-Stereo conversion)中因多阶段“深度-扭曲-修复”(Depth-Warp-Inpaint, DWI)流水线导致的误差传播、深度歧义以及平行与汇聚式立体配置间格式不一致的问题。其解决方案的关键在于提出UniStereo——首个大规模统一立体视频转换数据集,覆盖两种立体格式以支持公平基准测试和鲁棒模型训练;并在此基础上设计StereoPilot,一种无需显式深度图或迭代扩散采样的前馈式高效模型,通过可学习域切换器(domain switcher)和循环一致性损失(cycle consistency loss)实现对不同立体格式的无缝适配与性能提升。
链接: https://arxiv.org/abs/2512.16915
作者: Guibao Shen,Yihua Du,Wenhang Ge,Jing He,Chirui Chang,Donghao Zhou,Zhen Yang,Luozhou Wang,Xin Tao,Ying-Cong Chen
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学); Kling Team, Kuaishou Technology(快手科技Kling团队); CUHK(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint’’ (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: this https URL.
zh
[CV-8] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
【速读】:该论文旨在解决全景图像中度量深度(metric depth)预测在跨场景距离、室内/室外环境及合成/真实数据域之间泛化能力不足的问题。其核心解决方案在于构建一个大规模多源数据集并设计三阶段伪标签精炼流程以缩小域间差异,同时采用DINOv3-Large作为骨干网络增强预训练泛化性,并引入可插拔的距离掩码头(range mask head)、基于清晰度的优化策略和几何一致性约束机制,从而提升模型对不同距离的鲁棒性和多视角几何一致性,实现在多个基准测试上的优异性能与零样本迁移能力。
链接: https://arxiv.org/abs/2512.16913
作者: Xin Lin,Meixi Song,Dizhe Zhang,Wenxuan Lu,Haodong Li,Bo Du,Ming-Hsuan Yang,Truong Nguyen,Lu Qi
机构: Insta360 Research; University of California, San Diego; Wuhan University; University of California, Merced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \hrefthis https URL this https URL_website/
zh
[CV-9] SFTok: Bridging the Performance Gap in Discrete Tokenizers
【速读】:该论文旨在解决离散图像分词器(discrete tokenizer)在高分辨率图像生成任务中重建质量不足的问题,尤其是在多步迭代重建过程中存在的训练-推理不一致性问题。其解决方案的关键在于提出SFTok,一种结合自强化引导视觉重建(self-forcing guided visual reconstruction)与去偏-拟合训练策略(debias-and-fitting training strategy)的离散分词器,通过多步迭代机制显著提升重建精度,并在仅用64个token压缩图像的情况下实现ImageNet上的最优重建质量(rFID = 1.21)和类到图像生成任务中的优异表现(gFID = 2.29)。
链接: https://arxiv.org/abs/2512.16910
作者: Qihang Rao,Borui Zhang,Wenzhao Zheng,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review. Code is available at this https URL
Abstract:Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbfSFTok, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbfself-forcing guided visual reconstruction and \textbfdebias-and-fitting training strategy, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).
zh
[CV-10] MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
【速读】:该论文旨在解决家庭环境中移动操作机器人(mobile manipulators)在导航与操作任务中缺乏统一、语义丰富且动态更新的场景表示问题。现有方法通常将空间关系与功能关系分离,忽略物体状态变化和任务相关性,导致场景理解能力受限。解决方案的关键在于提出MomaGraph——一种融合空间-功能关系与部件级交互元素的统一场景图表示,并配套构建了首个大规模任务驱动的场景图数据集MomaGraph-Scenes及系统化评估基准MomaGraph-Bench。基于此,研究进一步开发了MomaGraph-R1(一个7B参数的视觉语言模型),通过强化学习训练实现零样本任务规划,在Graph-then-Plan框架下显著提升了任务导向的场景图预测精度与跨场景泛化能力。
链接: https://arxiv.org/abs/2512.16909
作者: Yuanchen Ju,Yongyuan Liang,Yen-Jen Wang,Nandiraju Gireesh,Yuanliang Ju,Seungjae Lee,Qiao Gu,Elvis Hsieh,Furong Huang,Koushil Sreenath
机构: University of California, Berkeley (加州大学伯克利分校); University of Maryland, College Park (马里兰大学学院公园分校); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 25 pages, 10 figures. Project page: this https URL
Abstract:Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
zh
[CV-11] SceneDiff: A Benchmark and Method for Multiview Object Change Detection
【速读】:该论文旨在解决多视角场景下物体被添加、移除或移动的检测问题(multiview object change detection),这是机器人整理、施工进度与安全监控等应用中的关键任务。其核心挑战在于不同视角带来的外观变化可能引发误判。解决方案的关键在于提出了一种无需训练的新方法——SceneDiff,它利用预训练的3D重建、分割和图像编码模型,在不依赖特定数据集微调的前提下,通过将多视角图像对齐至统一3D空间、提取物体区域并比较其空间与语义特征来实现鲁棒的改变检测。该方法在多个基准测试中显著优于现有技术(多视角和双视角基准上相对AP提升分别为94%和37.4%)。
链接: https://arxiv.org/abs/2512.16908
作者: Yuqun Wu,Chih-hao Lin,Henry Che,Aditi Tiwari,Chuhang Zou,Shenlong Wang,Derek Hoiem
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.
zh
[CV-12] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
【速读】:该论文旨在解决现有3D手部轨迹预测方法中两个关键问题:一是数据集将运动与语义监督解耦,导致模型难以理解交互场景中的语义信息;二是模型在推理与动作生成之间缺乏强关联,限制了对复杂交互阶段的准确建模。为应对上述挑战,作者提出了EgoMAN数据集和EgoMAN模型:前者提供包含219K 6DoF手部轨迹和300万结构化问答对的大规模自中心视角数据,支持语义、空间与运动推理;后者通过轨迹标记(trajectory-token)接口将视觉-语言推理与运动生成相连接,并采用渐进式训练策略使推理与运动动力学对齐,从而实现高精度且阶段感知的3D手部轨迹预测,具备跨真实场景的泛化能力。
链接: https://arxiv.org/abs/2512.16907
作者: Mingfei Chen,Yifan Wang,Zhengqin Li,Homanga Bharadhwaj,Yujin Chen,Chuan Qin,Ziyi Kou,Yuan Tian,Eric Whitmire,Rajinder Sodhi,Hrvoje Benko,Eli Shlizerman,Yue Liu
机构: Meta; University of Washington(华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project website: this https URL
Abstract:Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
zh
[CV-13] VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
【速读】:该论文旨在解决当前基于扩散模型(diffusion-based)的指令式视频编辑方法在面对多样化、复杂的真实世界自然语言指令时泛化能力不足的问题,其根本原因在于现有方法通常依赖于简单编辑操作的配对数据进行训练,难以适应真实场景中的复杂语义需求。解决方案的关键在于提出一个可扩展的框架VIVA,其核心创新包括:一是引入基于视觉语言模型(VLM)的指导器(instructor),将文本指令、源视频首帧及可选参考图像编码为具有细粒度空间与语义上下文的视觉引导表示,从而增强扩散Transformer主干对指令的理解能力;二是设计后训练阶段Edit-GRPO,通过相对奖励优化策略(Group Relative Policy Optimization)直接优化模型在指令忠实性、内容保真度和美学质量上的表现,显著提升编辑效果的多样性与可控性。
链接: https://arxiv.org/abs/2512.16906
作者: Xiaoyan Cong,Haotian Yang,Angtian Wang,Yizhi Wang,Yiding Yang,Canyu Zhang,Chongyang Ma
机构: Brown University (布朗大学); Intelligent Creation, ByteDance (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: this https URL
zh
[CV-14] Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型训练中因低质量或冗余数据导致的视觉保真度下降、训练不稳定及计算效率低下的问题。现有方法依赖人工筛选或基于单一维度特征的启发式评分,难以高效识别高价值样本。其解决方案的关键在于提出Alchemist——一个基于元梯度(meta-gradient)的自动、可扩展的数据选择框架,通过两个核心阶段实现:首先训练轻量级“评分器”(rater)以多粒度感知方式评估每个样本对模型优化的影响;其次采用Shift-Gsampling策略筛选信息量丰富的子集用于高效训练。该方法首次将元学习思想引入图像模态的数据选择任务,显著提升了数据利用效率与模型性能。
链接: https://arxiv.org/abs/2512.16905
作者: Kaixin Ding,Yang Zhou,Xi Chen,Miao Yang,Jiarong Ou,Rui Chen,Xin Tao,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); South China University of Technology (华南理工大学); Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose Alchemist, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample’s influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.
zh
[CV-15] FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
【速读】:该论文旨在解决基于扩散模型(diffusion-based)的长视频人像动画生成中身份一致性(identity consistency, ID)难以保障的问题。解决方案的关键在于提出了一种端到端的视频扩散变换器 FlashPortrait,其核心创新包括:首先利用预训练提取器获得与身份无关的面部表情特征,并通过引入归一化面部表情模块(Normalized Facial Expression Block)对特征进行均值和方差归一化,从而提升面部建模中的身份稳定性;其次,在推理阶段采用动态滑动窗口结合重叠区域加权融合策略,确保长视频中帧间过渡平滑且身份一致;此外,基于特定时间步的潜在变量变化率及扩散层间导数比值,FlashPortrait 在当前时间步使用高阶潜在导数直接预测未来时间步的潜在表示,跳过多个去噪步骤,实现最高达6倍的推理加速。
链接: https://arxiv.org/abs/2512.16900
作者: Shuyuan Tu,Yueming Pan,Yinming Huang,Xintong Han,Zhen Xing,Qi Dai,Kai Qiu,Chong Luo,Zuxuan Wu
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院); Xi’an Jiaotong University (西安交通大学); Tencent Inc. (腾讯公司); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.
zh
[CV-16] Sceniris: A Fast Procedural Scene Generation Framework
【速读】:该论文旨在解决现有程序化场景生成方法在输出吞吐量方面的瓶颈问题,从而限制大规模、无碰撞场景数据集的高效创建。其核心解决方案在于提出Sceniris框架,通过引入批量采样(batch sampling)和基于cuRobo的快速碰撞检测机制,显著提升了生成效率,相较先前的Scene Synthesizer方法实现了至少234倍的速度提升;同时,Sceniris扩展了对象间空间关系的支持范围,以满足多样化场景构建需求,并可选地集成机器人可达性检查,确保生成场景具备操作可行性,适用于机器人任务训练与物理智能(Physical AI)开发。
链接: https://arxiv.org/abs/2512.16896
作者: Jinghuan Shang,Harsh Patel,Ran Gong,Karl Schmeckpeper
机构: Robotics and AI Institute (机器人与人工智能研究所); University of Waterloo (滑铁卢大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Code is available at this https URL
Abstract:Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at this https URL
zh
[CV-17] Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation
【速读】:该论文旨在解决当前人脸动画方法在3D一致性、运行速度与表达细节之间的权衡问题:2D扩散模型虽能生成高质量动画但缺乏3D一致性且推理慢,而3D感知的前馈方法虽具备3D一致性和高速度,却难以保留丰富的表情细节。解决方案的关键在于通过知识蒸馏(knowledge distillation)将2D扩散模型中蕴含的丰富表情信息迁移到一个轻量级前馈编码器中,从而构建一个解耦于3D几何表示的隐式运动学习机制;同时采用高效的局部融合策略替代传统计算密集型全局融合模块(如多层注意力机制),在保持高帧率(107.31 FPS)的同时实现接近最先进水平的动画质量。
链接: https://arxiv.org/abs/2512.16893
作者: Kaiwen Jiang,Xueting Li,Seonwook Park,Ravi Ramamoorthi,Shalini De Mello,Koki Nagano
机构: University of California, San Diego (加州大学圣地亚哥分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website is this https URL
Abstract:Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods – built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting – ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face’s 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is this https URL
zh
[CV-18] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
【速读】:该论文旨在解决视频推荐系统中部署视频大语言模型(Video Large Language Models, VLLMs)所面临的三大挑战:(1) 仅解码生成模式导致顺序推理延迟高,(2) 典型接口不支持多视频输入,(3) 输出受限于文本形式而丢失对下游视觉任务至关重要的细粒度视觉信息。其核心解决方案是提出一种名为LinkedOut的新表示方法,该方法通过VLLM从原始视频帧中提取语义锚定且具备知识感知能力的token,利用可提示查询和可选辅助模态引导特征提取,并引入跨层知识融合的MoE(Mixture of Experts)机制,动态选择合适抽象层级的VLLM特征,从而实现低延迟、支持多视频历史、保留像素级细节的推荐系统。这一方案首次在无需人工标注标签的情况下,基于原始帧完成视频推荐任务,显著提升了性能与可解释性。
链接: https://arxiv.org/abs/2512.16891
作者: Haichao Zhang,Yao Lu,Lichen Wang,Yunzhe Li,Daiwei Chen,Yunpeng Xu,Yun Fu
机构: Northeastern University (东北大学); LinkedIn (领英); University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
zh
[CV-19] M-PhyGs: Multi-Material Object Dynamics from Video
【速读】:该论文旨在解决从视频中准确估计复杂多材料自然物体(如花朵)的物理材料属性和参数的问题,这类物体通常具有非均匀材质组成、复杂几何结构以及真实世界交互下的动态行为,而现有方法往往局限于单一材质、预学习动力学或简化拓扑结构。解决方案的关键在于提出多材料物理高斯模型(Multi-material Physical Gaussians, M-PhyGs),其通过联合分割物体为相似材料区域并恢复各区域连续介质力学参数(如弹性模量、密度等),同时考虑重力影响;该方法借助新设计的级联3D与2D损失函数及时间维度上的小批量处理策略,实现了高效且精确的物理参数估计。
链接: https://arxiv.org/abs/2512.16885
作者: Norika Wada,Kohei Yamashita,Ryo Kawahara,Ko Nishino
机构: Kyoto University (京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.
zh
[CV-20] Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation
【速读】:该论文旨在解决内窥镜视频中手术器械分割的难题,尤其针对频繁遮挡、快速运动、镜面伪影及器械长时间重复进入视野等挑战。现有基于SAM3(Spatio-Temporal Attention Model for Video Object Segmentation)的方法在手术场景下受限于无差别记忆更新、固定记忆容量以及遮挡后身份恢复能力弱等问题。其解决方案的关键在于提出一种无需训练的记忆增强型扩展方法ReMeDI-SAM3,核心创新包括:(i) 基于相关性的记忆过滤机制与专用遮挡感知记忆模块,用于存储遮挡前帧信息;(ii) 分段插值策略以扩展有效记忆容量;(iii) 基于特征的重识别模块结合时间投票机制,实现遮挡后的可靠身份辨识。上述组件协同作用,显著降低误差累积并提升遮挡后恢复能力,在EndoVis17和EndoVis18数据集上零样本设置下相较原始SAM3分别获得约7%和16%的mcIoU绝对提升,优于已有训练型方法。
链接: https://arxiv.org/abs/2512.16880
作者: Valay Bundele,Mehran Hosseinzadeh,Hendrik P.A. Lensch
机构: University of Tuebingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: this https URL.
zh
[CV-21] raining Together Diagnosing Better: Federated Learning for Collagen VI-Related Dystrophies
【速读】:该论文旨在解决罕见病(如胶原蛋白VI相关营养不良症,COL6-RD)诊断中因数据稀缺和分散导致的机器学习(Machine Learning, ML)模型性能受限问题。传统方法在跨机构、跨国界扩展样本时面临隐私保护、监管合规及物流协调等障碍,难以构建高泛化能力的模型。其解决方案的关键在于采用联邦学习(Federated Learning, FL)技术,通过在多个国际机构间分布式训练模型,同时保持患者数据本地存储与隐私安全,从而实现多中心协作建模。实验表明,该方法显著提升了模型性能(F1-score达0.82),优于单一机构模型(0.57–0.75),验证了FL在提升罕见病诊断准确性与可推广性方面的有效性。
链接: https://arxiv.org/abs/2512.16876
作者: Astrid Brull,Sara Aguti,Véronique Bolduc,Ying Hu,Daniel M. Jimenez-Gutierrez,Enrique Zuazua,Joaquin Del-Rio,Oleksii Sliusarenko,Haiyan Zhou,Francesco Muntoni,Carsten G. Bönnemann,Xabi Uribe-Etxebarria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the this http URL FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.
zh
[CV-22] Pixel Seal: Adversarial-only training for invisible image and video watermarking KR
【速读】:该论文旨在解决当前图像和视频水印技术中难以兼顾鲁棒性与不可见性的核心难题。现有方法普遍依赖于MSE或LPIPS等代理感知损失函数,无法准确模拟人类视觉系统,导致可见水印伪影;同时,由于鲁棒性和不可见性目标之间的冲突,优化过程不稳定,需大量超参数调优;此外,在高分辨率场景下模型性能显著下降。解决方案的关键在于:(1) 提出仅使用对抗训练范式,摒弃不可靠的像素级不可见性损失;(2) 设计三阶段训练流程以解耦鲁棒性和不可见性优化,提升收敛稳定性;(3) 通过基于JND(Just Noticeable Difference)的衰减机制和训练时推理模拟,有效缓解高分辨率下的上采样伪影问题,从而实现对图像和视频内容在真实场景中的可靠溯源。
链接: https://arxiv.org/abs/2512.16874
作者: Tomáš Souček,Pierre Fernandez,Hady Elsahar,Sylvestre-Alvise Rebuffi,Valeriu Lacatusu,Tuan Tran,Tom Sander,Alexandre Mourachko
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Code and model available at this https URL
Abstract:Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
zh
[CV-23] RePlan: Reasoning -guided Region Planning for Complex Instruction-based Image Editing
【速读】:该论文旨在解决指令-视觉复杂度(Instruction-Visual Complexity, IV-Complexity)问题,即当自然语言指令与杂乱或模糊的视觉场景同时存在时,现有图像编辑模型性能显著下降的问题。其解决方案的核心是提出RePlan(Region-aligned Planning)框架,该框架采用“先规划后执行”的策略,通过一个视觉-语言规划器(vision-language planner)对指令进行分步推理并显式地将语义锚定到目标区域,随后由扩散编辑器利用无需训练的注意力区域注入机制,在不依赖迭代修复(iterative inpainting)的前提下实现高精度、并行的多区域编辑。这一方法在IV-Edit基准测试中表现出优于大量数据训练基线模型的区域精度和整体保真度。
链接: https://arxiv.org/abs/2512.16864
作者: Tianyuan Qu,Lei Ke,Xiaohang Zhan,Longxiang Tang,Yuqi Liu,Bohao Peng,Bei Yu,Dong Yu,Jiaya Jia
机构: Tencent AI Lab (腾讯人工智能实验室); CUHK (香港中文大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Precise region control and planning for instruction-based image editing. Our project page: this https URL
Abstract:Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: this https URL
zh
[CV-24] GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型评估中因基准测试漂移(benchmark drift)导致的评价失准问题。具体而言,现有基准如GenEval虽在发布时与人类判断高度一致,但随模型能力提升逐渐失效,甚至出现高达17.7%的绝对误差,表明其已趋于饱和。解决方案的关键在于提出新一代基准GenEval 2,通过增强基础视觉概念(primitive visual concepts)的覆盖度和组合复杂性(compositionality),使评估更具挑战性;同时引入Soft-TIFA评估方法,基于视觉原语的判别组合来提升与人类判断的一致性,并降低未来漂移风险——相较于整体式判官(如VQAScore),该方法更具备长期稳定性。
链接: https://arxiv.org/abs/2512.16853
作者: Amita Kamath,Kai-Wei Chang,Ranjay Krishna,Luke Zettlemoyer,Yushi Hu,Marjan Ghazvininejad
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time – resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
zh
[CV-25] OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction
【速读】:该论文旨在解决当前人视角(egocentric)感知中缺乏对触觉交互的精确建模问题,尤其是在真实场景下无法获取同步的视频、触觉与手部姿态数据,从而限制了多模态感知与具身学习的发展。其解决方案的关键在于构建首个真实场景下的人视角全手触觉数据集OpenTouch,包含5.1小时同步的视频-触觉-姿态数据及2,900个带详细文本标注的片段,并基于此提出检索与分类基准,验证触觉信号在抓握理解、跨模态对齐和从自然视频中可靠检索方面的有效性,为具身智能与接触密集型机器人操作提供关键数据支撑与评估框架。
链接: https://arxiv.org/abs/2512.16842
作者: Yuxin Ray Song,Jinzhou Li,Rao Fu,Devin Murphy,Kaichen Zhou,Rishi Shiv,Yaqi Li,Haoyu Xiong,Crystal Elaine Owens,Yilun Du,Yiyue Luo,Xianyi Cheng,Antonio Torralba,Wojciech Matusik,Paul Pu Liang
机构: MIT(麻省理工学院); Duke University(杜克大学); Brown University(布朗大学); University of Washington(华盛顿大学); Harvard University(哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: this https URL
Abstract:The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
zh
[CV-26] Radiology Report Generation with Layer-Wise Anatomical Attention
【速读】:该论文旨在解决当前生成式AI(Generative AI)在放射学报告自动生成任务中对大规模多模态训练数据、临床元数据及多视角影像高度依赖的问题,从而限制了其在资源有限场景下的应用。其解决方案的关键在于提出一种紧凑的图像到文本架构,仅需单张正位胸片即可生成“发现”(Findings)部分报告:通过冻结的自蒸馏无标签视觉Transformer(DINOv3 ViT)编码器与增强层间解剖注意力机制的GPT-2解码器相结合,并引入基于肺部和心脏分割掩膜的分层高斯平滑策略,在不增加可训练参数的前提下引导注意力聚焦于临床相关区域,显著提升空间定位准确性和结构连贯性。
链接: https://arxiv.org/abs/2512.16841
作者: Emmanuel D. Muñiz-De-León,Jorge A. Rosales-de-Golferichs,Ana S. Muñoz-Rodríguez,Alejandro I. Trejo-Castro,Eduardo de Avila-Armenta,Antonio Martínez-Torteya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 - 0.238) and Micro-F1 by 146% (0.137 - 0.337), while broader performance across 14 observations improved by 86% (0.170 - 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: this https URL.
zh
[CV-27] Next-Generation License Plate Detection and Recognition System using YOLOv8
【速读】:该论文旨在解决智能交通系统(Intelligent Transportation Systems, ITS)中车牌识别(License Plate Recognition, LPR)与字符识别任务在复杂环境下的实时准确性问题。其关键解决方案是采用YOLOv8系列模型构建优化的检测与识别流水线:使用YOLOv8 Nano进行车牌定位,实现高精度(Precision=0.964)和mAP50(0.918);利用YOLOv8 Small完成字符识别,达到Precision=0.92、mAP50=0.91的性能表现,并结合基于x轴位置的自定义字符排序方法,有效提升字符序列化准确率。该方案兼顾计算效率与识别精度,为边缘设备部署提供了可靠的技术基础。
链接: https://arxiv.org/abs/2512.16826
作者: Arslan Amin,Rafia Mumtaz,Muhammad Jawad Bashir,Syed Mohammad Hassan Zaidi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures. Accepted and published in the 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET)
Abstract:In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.
zh
[CV-28] DenseBEV: Transforming BEV Grid Cells into 3D Objects WACV2026
【速读】:该论文旨在解决多摄像头三维目标检测中传统基于鸟瞰图(Bird’s-Eye-View, BEV)的Transformer模型因随机查询(random queries)作为锚点导致效率低下、难以充分利用BEV特征空间的问题。其核心解决方案是提出一种端到端的密集BEV查询生成方法——将BEV特征网格中的每个单元直接作为对象查询(object query),从而更直观且高效地利用BEV空间的稠密结构。该方法通过引入基于BEV的非极大值抑制(Non-Maximum Suppression, NMS)来缓解大规模查询带来的注意力计算膨胀问题,同时保留梯度流以实现无需后处理的高效训练;此外,还结合时间信息建模机制,利用历史检测结果进一步提升性能,尤其在小目标(如行人)检测上表现显著增强。
链接: https://arxiv.org/abs/2512.16818
作者: Marius Dähling,Sebastian Krebs,J. Marius Zöllner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, accepted by WACV 2026
Abstract:In current research, Bird’s-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at this https URL.
zh
[CV-29] GeoPredict: Leverag ing Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中普遍存在的局限性,即其主要依赖反应式决策和二维感知,导致在需要精确三维(3D)推理的任务中表现不可靠。解决方案的关键在于提出GeoPredict框架,该框架通过引入两个预测模块增强连续动作策略:一是轨迹级模块,用于编码运动历史并预测机器人臂的多步3D关键点轨迹;二是预测性3D高斯几何模块,能够沿未来关键点轨迹进行跟踪引导的细化,从而预测工作空间的几何结构。这两个模块仅在训练阶段提供基于深度图渲染的监督信号,推理时则通过轻量级查询标记实现高效扩展,无需任何3D解码过程,显著提升了模型在几何密集型和空间复杂场景中的性能表现。
链接: https://arxiv.org/abs/2512.16811
作者: Jingjing Qian,Boyao Han,Chen Shi,Lei Xiao,Long Yang,Shaoshuai Shi,Li Jiang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Hunan University (湖南大学); LiAuto Inc. (小鹏汽车); Voyager Research, Didi Chuxing (滴滴出行 Voyager 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
zh
[CV-30] KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals AAAI2026
【速读】:该论文旨在解决基于头戴式显示器(Head-Mounted Display, HMD)获取的稀疏信号进行全身姿态重建时面临的挑战,即如何在保证高精度和时间一致性的同时实现高效计算。现有方法往往因分别建模空间与时间依赖性而效率低下,或难以兼顾准确性与运动平滑性。其解决方案的关键在于提出一种新颖的运动学引导的状态空间模型(KineST),通过两个核心创新实现:一是将状态空间对偶框架中的扫描策略重构为运动学引导的双向扫描,嵌入关节运动学先验以更好地捕捉复杂关节关系;二是采用混合时空表示学习方法,紧密耦合空间与时间上下文,从而在轻量化架构中实现精度与平滑性的平衡。此外,引入几何角速度损失函数,进一步约束旋转变化的物理合理性,提升运动稳定性。
链接: https://arxiv.org/abs/2512.16791
作者: Shuting Zhao,Zeyu Xiao,Xinrong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: this https URL
zh
[CV-31] R3ST: A Synthetic 3D Dataset With Realistic Trajectories
【速读】:该论文旨在解决合成数据集在交通场景中车辆轨迹真实性不足的问题,即现有合成数据通常依赖AI模型或规则系统生成轨迹,导致缺乏真实世界中人类驾驶行为的复杂性和多样性。解决方案的关键在于提出R3ST(Realistic 3D Synthetic Trajectories)数据集,通过构建一个合成的3D环境并融合来自SinD(鸟瞰视角无人机影像数据集)的真实车辆轨迹,从而实现既具备精确多模态标注又保留真实人类驾驶轨迹特征的合成数据集,有效弥合了合成数据与现实轨迹之间的差距,推动道路车辆轨迹预测研究的发展。
链接: https://arxiv.org/abs/2512.16784
作者: Simone Teglia,Claudia Melis Tonti,Francesco Pro,Leonardo Russo,Andrea Alfarano,Leonardo Pentassuglia,Irene Amerini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird’s-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.
zh
[CV-32] Kling-Omni Technical Report
【速读】:该论文旨在解决当前视频生成、编辑与智能推理任务之间功能割裂的问题,即传统方法通常采用离散的流水线架构,难以实现多模态输入下的统一建模与高质量输出。解决方案的关键在于提出Kling-Omni框架,该框架以端到端的方式整合了视频生成、编辑和推理能力,通过构建统一的多模态表示来处理文本指令、参考图像和视频上下文等多种输入形式,从而实现电影级质量的视频内容生成与智能交互。其核心技术优势包括大规模预训练策略、高效的推理基础设施优化以及一个支撑多模态视频创作的综合数据系统。
链接: https://arxiv.org/abs/2512.16776
作者: Kling Team:Jialu Chen,Yuanzheng Ci,Xiangyu Du,Zipeng Feng,Kun Gai,Sainan Guo,Feng Han,Jingbin He,Kang He,Xiao Hu,Xiaohua Hu,Boyuan Jiang,Fangyuan Kong,Hang Li,Jie Li,Qingyu Li,Shen Li,Xiaohan Li,Yan Li,Jiajun Liang,Borui Liao,Yiqiao Liao,Weihong Lin,Quande Liu,Xiaokun Liu,Yilun Liu,Yuliang Liu,Shun Lu,Hangyu Mao,Yunyao Mao,Haodong Ouyang,Wenyu Qin,Wanqi Shi,Xiaoyu Shi,Lianghao Su,Haozhi Sun,Peiqin Sun,Pengfei Wan,Chao Wang,Chenyu Wang,Meng Wang,Qiulin Wang,Runqi Wang,Xintao Wang,Xuebo Wang,Zekun Wang,Min Wei,Tiancheng Wen,Guohao Wu,Xiaoshi Wu,Zhenhua Wu,Da Xie,Yingtong Xiong,Yulong Xu,Sile Yang,Zikang Yang,Weicai Ye,Ziyang Yuan,Shenglong Zhang,Shuaiyu Zhang,Yuanxing Zhang,Yufan Zhang,Wenzheng Zhao,Ruiliang Zhou,Yan Zhou,Guosheng Zhu,Yongjie Zhu
机构: Kuaishou Technology(快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Kling-Omni Technical Report
Abstract:We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
zh
[CV-33] FlowDet: Unifying Object Detection and Generative Transport Flows
【速读】:该论文旨在解决目标检测任务中传统扩散模型(Diffusion-based)方法在推理效率和性能扩展性方面的局限性,尤其是其依赖弯曲的随机传输路径导致的收敛速度慢、计算开销大等问题。解决方案的关键在于提出FlowDet,首次将条件流匹配(Conditional Flow Matching)技术引入目标检测领域,通过学习更简单、更直的生成传输路径来替代扩散过程中的曲率路径,从而实现检测性能随推理步数增加而更快提升。这一重构使得模型能够在不重新训练的情况下灵活调整预测框数量与推理步数,并在多个数据集(如COCO和LVIS)上显著优于基于扩散的目标检测系统及非生成式基线方法,尤其在召回约束场景下展现出更强的生成传输优势,AP提升最高达+3.6%,稀有类别AP提升+4.2%。
链接: https://arxiv.org/abs/2512.16771
作者: Enis Baty,C. P. Bridges,Simon Hadfield
机构: CVSSP, University of Surrey (计算机视觉与智能系统研究中心,萨里大学); SSC, University of Surrey (软件与计算科学系,萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP _rare over DiffusionDet on the COCO and LVIS datasets, respectively.
zh
[CV-34] Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation
【速读】:该论文旨在解决3D角色姿态生成中因皮肤权重预测不准、拓扑缺陷和姿态一致性差等问题导致的鲁棒性和泛化能力不足的问题。其解决方案的关键在于提出了一种名为Make-It-Poseable的前馈式框架,将角色姿态生成重构为潜在空间中的变换问题:通过一个潜在空间姿态变换器(latent posing transformer)基于骨骼运动操作形状令牌(shape tokens),并结合密集姿态表示实现精确控制;同时引入潜在空间监督策略与自适应补全模块,以保障几何保真度并支持拓扑变化,从而显著提升姿态质量并拓展至部件替换等3D编辑任务。
链接: https://arxiv.org/abs/2512.16767
作者: Zhiyang Guo,Ori Zhang,Jax Xiang,Alan Zhao,Wengang Zhou,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.
zh
[CV-35] reeNet: A Light Weight Model for Low Bitrate Image Compression
【速读】:该论文旨在解决学习-based图像压缩技术中计算复杂度高导致难以广泛应用的问题。其核心解决方案是提出TreeNet模型,该模型采用二叉树结构的编码器-解码器架构,以实现高效的特征表示与重建;关键创新在于引入注意力机制的特征融合模块,有效整合多分支特征,从而在保持高质量重建的同时显著降低模型复杂度——实验表明,在低比特率下相比JPEG AI平均BD-rate提升4.83%,且模型复杂度降低87.82%。
链接: https://arxiv.org/abs/2512.16743
作者: Mahadev Prasad Panda,Purnachandra Rao Makkena,Srivatsa Prativadibhayankaram,Siegfried Fößel,André Kaup
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.
zh
[CV-36] ask-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
【速读】:该论文旨在解决遥感(Remote Sensing, RS)语义分割任务中因标注数据稀缺而导致的模型性能受限问题,特别是合成数据在控制语义掩码(semantic mask)精度和采样质量不确定性方面的挑战。解决方案的关键在于提出一种面向任务的数据合成框架(Task-Oriented Data Synthesis, TODSynth),其核心包括:1)基于多模态扩散Transformer(Multimodal Diffusion Transformer, MM-DiT)的统一三重注意力机制,实现文本-图像-掩码联合控制;2)引入一种由任务反馈引导的即插即用采样策略,并结合控制-校正流匹配(Control-Rectify Flow Matching, CRFM)方法,在生成早期高可塑阶段动态调整采样方向以降低不稳定性,从而提升合成数据与下游分割任务的一致性。实验表明,该方法在少样本和复杂场景下显著优于现有可控生成方法。
链接: https://arxiv.org/abs/2512.16740
作者: Yunkai Yang,Yudong Zhang,Kunquan Zhang,Jinxiao Zhang,Xinying Chen,Haohuan Fu,Runmin Dong
机构: Sun Yat-Sen University (中山大学); Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.
zh
[CV-37] OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
【速读】:该论文旨在解决虚拟现实(VR)/增强现实(AR)交互中基于手部骨骼的在线微手势识别问题,其核心挑战在于公开数据集稀缺以及任务特定算法的局限性。微手势具有细微运动模式,导致构建具备精确骨骼坐标和帧级标注的数据集极为困难。为应对这一问题,作者提出了一种多视角自监督数据生成管道,结合启发式规则与专家精修实现半自动标注,并在此基础上构建了首个大规模公开基准OMG-Bench,包含40个细粒度手势类别、13,948个实例及1,272个序列,涵盖快速动态性和连续执行特性。解决方案的关键创新在于Hierarchical Memory-Augmented Transformer (HMATr)——一种端到端框架,通过分层记忆库存储帧级细节与窗口级语义信息以保留历史上下文,并利用可学习的位置感知查询从记忆中初始化,隐式编码手势位置与语义信息,从而统一手势检测与分类任务。实验表明,HMATr在检测率上比现有最优方法提升7.6%,为在线微手势识别建立了强有力的基线。
链接: https://arxiv.org/abs/2512.16727
作者: Haochen Chang,Pengfei Ren,Buyuan Zhang,Da Li,Tianhao Han,Haoyang Zhang,Liang Xie,Hongbo Chen,Erwei Yin
机构: Sun Yat-sen University (中山大学); Beijing University of Posts and Telecommunications (北京邮电大学); Shanghai Jiao Tong University (上海交通大学); Nankai University (南开大学); Academy of Military Sciences (军事科学院); Tianjin Artificial Intelligence Innovation Center (天津人工智能创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Project page: this https URL
Abstract:Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: this https URL
zh
[CV-38] VERM: Leverag ing Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
【速读】:该论文旨在解决机器人在执行三维操作任务时,因多固定相机设置引入的冗余信息和无关数据导致计算成本高、模型训练时间长以及难以精准提取任务相关特征的问题。解决方案的关键在于提出一种名为VERM(Virtual Eye for Robotic Manipulation)的方法,该方法利用基础模型的知识,从构建的三维点云中“想象”出一个虚拟的任务自适应视角,从而高效捕获必要信息并缓解遮挡问题;同时设计了深度感知模块和动态粗粒度到细粒度的处理流程,以提升3D动作规划与精细操作的能力。
链接: https://arxiv.org/abs/2512.16724
作者: Yixiang Chen,Yan Huang,Keji He,Peiyan Li,Liang Wang
机构: New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; FiveAges; Shandong University
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at RA-L 2025
Abstract:When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at this https URL .
zh
[CV-39] A multi-centre multi-device benchmark dataset for landmark-based comprehensive fetal biometry
【速读】:该论文旨在解决胎儿超声(Ultrasound, US)生物测量中因人工标记耗时、依赖操作者且受设备与中心差异影响而导致的可重复性差的问题,从而阻碍了人工智能辅助胎儿生长评估的推广应用。其关键解决方案是构建一个公开的多中心、多设备胎儿超声图像基准数据集,包含专家标注的解剖学标志点,覆盖所有主要胎儿生物测量指标(如双顶径、头围、腹径及股骨长度),并提供标准化的训练/测试划分、评估代码和基线结果,以支持域适应和跨中心泛化研究,显著提升AI模型在不同临床环境下的可靠性与通用性。
链接: https://arxiv.org/abs/2512.16710
作者: Chiara Di Vece,Zhehua Mao,Netanell Avisdris,Brian Dromey,Raffaele Napolitano,Dafna Ben Bashat,Francisco Vasconcelos,Danail Stoyanov,Leo Joskowicz,Sophia Bano
机构: University College London(伦敦大学学院); Tel Aviv Sourasky Medical Center(特拉维夫索拉斯基医疗中心); The Hebrew University of Jerusalem(希伯来大学); Tel Aviv University(特拉维夫大学); UCLH NHS Foundation Trust(伦敦大学学院医院国家健康服务体系基金会); Elizabeth Garrett Anderson Institute for Women’s Health(伊丽莎白·加特纳·安德森女性健康研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 3 tables
Abstract:Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.
zh
[CV-40] SDFoam: Signed-Distance Foam for explicit surface reconstruction
【速读】:该论文旨在解决当前基于神经辐射场(Neural Radiance Fields, NeRF)和基于点绘制的方法(如3D Gaussian Splatting, 3DGS)在精细网格重建(mesh reconstruction)方面表现不佳的问题。尽管RadiantFoam通过显式Voronoi Diagram(VD)实现了接近3DGS的渲染效率,但仍难以获得高质量的几何结构。其解决方案的关键在于提出一种混合隐式-显式表示方法SDFoam,即联合学习一个显式的Voronoi Diagram与一个隐式的符号距离场(Signed Distance Field, SDF)。该方法利用SDF提供的度量一致性等值面,引导近表面的Voronoi单元面与其零水平集对齐,从而生成更清晰、视角一致且拓扑更优的表面,同时保持与RadiantFoam相当的训练速度和图像保真度(PSNR、SSIM)。
链接: https://arxiv.org/abs/2512.16706
作者: Antonella Rech,Nicola Conci,Nicola Garau
机构: University of Trento (特伦托大学); CNIT
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.
zh
[CV-41] Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像局部篡改(如修复填充 inpainting 和区域级编辑)场景下,现有深度伪造检测模型泛化能力不足的问题。其关键解决方案在于系统性评估原本针对全合成图像训练的先进检测模型在局部修复检测任务中的迁移性能,结果表明:在大规模多生成器数据集上训练的模型对基于修复填充的篡改具有部分迁移能力,尤其能可靠识别中大范围篡改或再生风格的修复操作,显著优于许多现有多数针对性设计的检测方法。
链接: https://arxiv.org/abs/2512.16688
作者: Serafino Pandolfini,Lorenzo Pellegrini,Matteo Ferrara,Davide Maltoni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures, 9 tables
Abstract:The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.
zh
[CV-42] Few-Shot Fingerprinting Subject Re-Identification in 3D-MRI and 2D-X-Ray
【速读】:该论文旨在解决多源开源数据集合并时因同一受试者出现在多个数据集中而导致的数据泄露(data leakage)问题,从而引发模型性能虚高的现象。其解决方案的关键在于提出基于潜在空间(latent space)的受试者指纹识别(subject fingerprinting)方法:通过训练ResNet-50网络并采用三元组边缘损失(triplet margin loss),将同一受试者的多张图像映射到潜在空间中的唯一区域,进而利用相似性匹配实现受试者再识别(re-identification)。实验表明,在胸部X光(ChestXray-14)和脑肿瘤MRI(BraTS-2021)数据上,该方法在标准与挑战性少样本场景下均表现出高准确率,验证了其有效性。
链接: https://arxiv.org/abs/2512.16685
作者: Gonçalo Gaspar Alves,Shekoufeh Gorgi Zadeh,Andreas Husch,Ben Bausch
机构: University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.
zh
[CV-43] FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering
【速读】:该论文旨在解决交互式应用中神经渲染(Neural Rendering)的两大核心挑战:一是现有基于扩散模型的方法在帧间缺乏时间一致性(如RGBX仅生成单帧),二是视频类模型(如DiffusionRenderer)计算开销过大且需预先获取完整序列,难以适应用户实时输入的交互场景。解决方案的关键在于提出FrameDiffuser——一种自回归神经渲染框架,通过双重条件控制实现高效、稳定的时间一致性图像生成:一方面利用ControlNet对几何与材质信息(G-buffer)进行结构引导,另一方面引入ControlLoRA模块以自身先前生成帧为参考,保障时序连贯性;同时采用三阶段训练策略优化自回归过程稳定性,并针对特定环境进行专项训练,在保证推理速度的同时显著提升光照、阴影和反射等细节的逼真度。
链接: https://arxiv.org/abs/2512.16670
作者: Ole Beisswenger,Jan-Niklas Dihlmann,Hendrik P.A. Lensch
机构: University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL
Abstract:Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.
zh
[CV-44] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion
【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models, LDMs)在图像合成中因重建式去噪目标导致的语义监督间接性问题,即高阶语义信息涌现缓慢、训练周期长且生成样本质量受限。现有方法通过外部对齐或仅建模VFM(Vision Foundation Model)特征的局部片段来注入语义,未能充分利用VFM中丰富的非线性多层空间语义信息。其解决方案的核心是提出REGLUE框架,该框架在统一的SiT骨干网络中联合建模三类要素:VAE图像潜在表示、紧凑的局部(patch-level)VFM语义以及全局(image-level)[CLS] token;其中,轻量级卷积语义压缩器将多层VFM特征非线性聚合为低维结构化表示,并与VAE潜在变量在扩散过程中纠缠编码;同时引入外部对齐损失进一步约束内部表示向冻结的VFM目标靠拢。实验表明,空间语义信息至关重要,非线性压缩是释放其潜力的关键,而全局token与外部对齐则作为互补的轻量增强机制,在全局-局部-潜在联合建模框架内显著提升性能并加速收敛。
链接: https://arxiv.org/abs/2512.16636
作者: Giorgos Petsangourakis,Christos Sgouropoulos,Bill Psomas,Theodoros Giannakopoulos,Giorgos Sfikas,Ioannis Kakogeorgiou
机构: IIT, National Centre for Scientific Research “Demokritos” (国家科学研究中心“德谟克利特”); University of West Attica (西阿提卡大学); VRG, FEE, Czech Technical University in Prague (捷克理工大学电气工程学院VRG实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and © global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at this https URL .
zh
[CV-45] SARMAE: Masked Autoencoder for SAR Representation Learning
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像在深度学习应用中面临的两大挑战:一是数据稀缺问题,二是由物理特性决定的斑点噪声(speckle noise)对细粒度语义表示学习的干扰。为应对这些问题,作者提出SARMAE——一种面向SAR的噪声感知掩码自编码器(Noise-Aware Masked Autoencoder),其核心创新在于三点:首先构建了首个百万级SAR数据集SAR-1M并附带配对光学图像,支持大规模预训练;其次设计了斑点感知表示增强(Speckle-Aware Representation Enhancement, SARE),通过向掩码自编码器注入SAR特有斑点噪声以提升模型对噪声的鲁棒性;最后引入语义锚点表示约束(Semantic Anchor Representation Constraint, SARC),利用配对光学先验对齐SAR特征空间,保障跨模态语义一致性。实验表明,SARMAE在分类、检测和分割任务上均达到当前最优性能。
链接: https://arxiv.org/abs/2512.16635
作者: Danxu Liu,Di Wang,Hebaixu Wang,Haoyang Chen,Wentao Jiang,Yilin Cheng,Haonan Guo,Wei Cui,Jing Zhang
机构: Beijing Institute of Technology (北京理工大学); Wuhan University (武汉大学); Fudan University (复旦大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and models will be available at this https URL
Abstract:Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at this https URL.
zh
[CV-46] DeContext as Defense: Safe Image Editing in Diffusion Transformers
【速读】:该论文旨在解决在上下文扩散模型(in-context diffusion models)中,输入图像可能被未经授权地编辑所引发的隐私泄露问题,例如身份冒用或虚假信息传播。解决方案的关键在于提出一种名为DeContext的新方法,其核心思想是:通过向输入图像注入小而有针对性的扰动,削弱多模态注意力层中的跨注意力路径,从而阻断源图像信息向输出图像的传播。该方法基于观察——早期去噪步骤和特定Transformer块主导了上下文信息的传递,因此扰动可集中作用于这些关键位置,实现高效且鲁棒的防御,同时保持输出图像的视觉质量。
链接: https://arxiv.org/abs/2512.16625
作者: Linghui Shen,Mingyue Cui,Xingyi Yang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures
Abstract:In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner’s consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.
zh
[CV-47] Plug to Place: Indoor Multimedia Geolocation from Electrical Sockets for Digital Investigation
【速读】:该论文旨在解决室内多媒体地理定位(indoor multimedia geolocation)在数字取证中的应用难题,尤其是在人贩子、儿童剥削等严重犯罪案件中缺乏可靠定位手段的问题。其核心挑战包括房间布局相似性高、视觉模糊、光照变化大、GPS信号不可靠以及敏感领域数据稀缺。解决方案的关键在于利用标准化的插头插座类型作为稳定的室内地标:首先通过YOLOv11模型检测图像中的插头插座(mAP@0.5 = 0.843),进而使用Xception网络对插座进行12类分类(准确率0.912),最后基于插座类型映射至国家(置信度阈值90%时准确率达0.96)。为缓解数据不足问题,作者构建了两个专用数据集,并在TraffickCam子集上验证了该流程在真实场景下的有效性,展示了面向实际数字取证应用的可行路径。
链接: https://arxiv.org/abs/2512.16620
作者: Kanwal Aftab,Graham Adams,Mark Scanlon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at 90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.
zh
[CV-48] rainable Log-linear Sparse Attention for Efficient Diffusion Transformers
【速读】:该论文旨在解决扩散 Transformer (Diffusion Transformers, DiTs) 在处理长序列 token 时因自注意力机制的二次计算复杂度(quadratic self-attention cost)而导致的扩展瓶颈问题。现有基于 Top-K 稀疏注意力的方法虽能降低计算量,但仍面临两个关键限制:一是压缩 token 上的选择成本仍为二次复杂度,二是随着序列增长需增大 K 值以维持模型质量。作者指出其根本原因在于单层粗粒度设计无法有效建模全局结构。解决方案的关键在于提出一种分层稀疏注意力机制——Log-linear Sparse Attention (LLSA),通过引入层级结构实现从二次到对数线性复杂度的优化:首先进行分层 Top-K 选择,逐级利用前一层选出的索引进行稀疏筛选;同时设计 Hierarchical KV Enrichment 机制,在减少 token 数量的同时保留多粒度的全局上下文信息。该方法在无需 patchification 或 VAE 编码的前提下实现了高分辨率图像生成,显著加速了 DiT 推理(28.27x)和训练(6.09x),且保持生成质量。
链接: https://arxiv.org/abs/2512.16615
作者: Yifan Zhou,Zeqi Xiao,Tianyi Wei,Shuai Yang,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学); Wangxuan Institute of Computer Technology, Peking University (北京大学计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL
Abstract:Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: this https URL
zh
[CV-49] Dont Guess Escalate: Towards Explainable Uncertainty-Calibrated AI Forensic Agents
【速读】:该论文旨在解决当前多媒体伪造检测中缺乏统一、可靠且具备不确定性评估能力的验证流程的问题。其解决方案的关键在于提出“AI取证代理”(AI forensic agents),即能够智能选择和组合多种取证检测器、识别内容来源与上下文信息,并提供不确定性感知评估的可信协调机制,从而构建一个统一的框架以提升真实性验证的准确性与可靠性。
链接: https://arxiv.org/abs/2512.16614
作者: Giulia Boato,Andrea Montibeller,Edward Delp,Luisa Verdoliva,Daniele Miorandi
机构: Truebees(真蜜蜂); University of Trento (特伦托大学); Purdue University (普渡大学); University of Naples Federico II (那不勒斯腓立比二世大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:AI is reshaping the landscape of multimedia forensics. We propose AI forensic agents: reliable orchestrators that select and combine forensic detectors, identify provenance and context, and provide uncertainty-aware assessments. We highlight pitfalls in current solutions and introduce a unified framework to improve the authenticity verification process.
zh
[CV-50] Hazedefy: A Lightweight Real-Time Image and Video Dehazing Pipeline for Practical Deployment
【速读】:该论文旨在解决雾霾图像和视频中可见度低、对比度差的问题,尤其针对实时视频流和现场摄像头画面的增强需求。解决方案的关键在于提出一个轻量级且面向应用的去雾流程 Hazedefy,其核心包括:基于暗通道先验(Dark Channel Prior, DCP)与大气散射模型的改进架构;采用伽马自适应重建提升视觉质量;引入具有下界约束的快速透射率近似以保证数值稳定性;通过分数顶部像素平均法稳定大气光估计;以及可选的颜色平衡模块。该方案在不依赖GPU加速的情况下,可在消费级硬件上高效运行,适用于移动设备和嵌入式系统。
链接: https://arxiv.org/abs/2512.16609
作者: Ayush Bhavsar
机构: National Institute of Technology, Raipur, India.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures. Code and demo available at this https URL
Abstract:This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.
zh
[CV-51] Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成任务中因卷积神经网络(CNN)局部性限制而难以捕捉长距离语义信息的问题。其核心解决方案是提出Yuan-TecSwin,一种基于文本条件的扩散模型,用Swin-Transformer模块替代原架构中的CNN块,以增强特征提取与图像恢复过程中的非局部建模能力;同时通过优化文本编码器选择、文本嵌入的有效利用及文本条件的精心融合设计,显著提升文本-图像对齐效果,并引入自适应时间步长策略以提升推理性能,最终在ImageNet生成基准上达到FID分数1.37的最先进水平。
链接: https://arxiv.org/abs/2512.16586
作者: Shaohua Wu,Tong Yu,Shenling Wang,Xudong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model’s ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.
zh
[CV-52] Sketch-in-Latents: Eliciting Unified Reasoning in MLLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在需要视觉想象(visual imagination)的任务中表现不足的问题,尤其在于其缺乏像人类一样在推理过程中动态生成并交互式使用视觉表征的能力。现有方法通常依赖预定义的外部工具包或在思维过程中生成离散图像,而人类则能在大脑统一空间内灵活地构建视觉-文本联合推理过程。解决方案的关键在于提出了一种名为“Latent Sketch Tokens”的新范式——Sketch-in-Latents (SkiLa),它将视觉令牌无缝嵌入到由文本令牌驱动的推理流程中,使模型能够在多步推理中动态切换文本思考模式与视觉草图模式,从而原生生成连续的视觉嵌入(latent sketch tokens),并借助潜在视觉语义重建机制确保这些视觉表征的语义一致性。这一设计实现了视觉想象过程的隐式编码,显著提升了模型在视觉主导任务上的性能及跨多模态基准的泛化能力。
链接: https://arxiv.org/abs/2512.16584
作者: Jintao Tong,Jiaqi Gu,Yujing Lou,Lubin Fan,Yixiong Zou,Yue Wu,Jieping Ye,Ruixuan Li
机构: Huazhong University of Science and Technology (华中科技大学); Alibaba Cloud Computing (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures
Abstract:While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at this https URL.
zh
[CV-53] CRONOS: Continuous Time Reconstruction for 4D Medical Longitudinal Series
【速读】:该论文旨在解决3D医学影像序列在非规则采样条件下进行体素级时间演化预测的问题,现有模型通常依赖单一先验扫描、固定时间网格或仅关注全局标签,难以支持连续时间点的精准预测。其解决方案的关键在于提出CRONOS框架,该框架能够从多个历史扫描中进行多对一预测,并统一处理离散(基于网格)与连续(实数值)时间戳,首次实现3D医学数据的连续序列到图像预测;CRONOS通过学习一个时空速度场(spatio-temporal velocity field),直接在3D体素空间中将上下文体积传输至任意目标时间点的体积,从而实现高精度且计算高效的体素级时序建模。
链接: https://arxiv.org/abs/2512.16577
作者: Nico Albert Disch,Saikat Roy,Constantin Ulrich,Yannick Kirchhoff,Maximilian Rokuss,Robin Peretzke,David Zimmerer,Klaus Maier-Hein
机构: German Cancer Research Center (德国癌症研究中心); HIDSS4Health - Helmholtz Information and Data Science School for Health (赫尔姆霍兹健康信息与数据科学学院); University of Heidelberg (海德堡大学); Heidelberg University Hospital (海德堡大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Forecasting how 3D medical scans evolve over time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.
zh
[CV-54] Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation AAAI2026
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFMs)在领域泛化语义分割(Domain Generalized Semantic Segmentation, DGSS)任务中因长期预训练导致的特征伪影(artifacts)问题,这些伪影通常由非因果因素引起,存在于VFM频谱的低频和高频分量中,从而干扰有效表征的利用并降低模型在未见域上的泛化性能。解决方案的关键在于提出一种名为Causal-Tune的新颖微调策略,通过离散余弦变换(Discrete Cosine Transform, DCT)显式分析特征中的因果与非因果成分:首先使用高斯带通滤波器分离出因果与非因果频域分量;随后引入一组因果感知的可学习令牌(causal-aware learnable tokens)在频域内优化因果成分,同时丢弃非因果分量;最后经逆DCT将精炼后的特征映射回空间域继续传递至下一层,实现对因果因素的有效提取与非因果干扰的抑制,显著提升跨域泛化能力,尤其在恶劣天气条件下表现优异(如雪天提升+4.8% mIoU)。
链接: https://arxiv.org/abs/2512.16567
作者: Yin Zhang,Yongqiang Zhang,Yaoyue Zheng,Bogdan Raducanu,Dan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.
zh
[CV-55] 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
【速读】:该论文旨在解决从单目RGB视频中实现动态场景的完整且持久的4D重建问题,即不仅重建当前可见区域,还需恢复历史视角下的场景结构,从而支持跨时间步长的可回放3D重建。其核心解决方案是将场景分解为一组刚性3D基元(rigid 3D primitives),假设这些基元在场景中运动,并利用估计的密集2D对应关系,在优化流程中联合推断各基元的刚性运动,最终获得动态变化的4D场景表示(4D reconstruction)。关键创新在于引入基于运动分组(motion-grouping)的机制,以对暂时不可见的对象进行运动外推,从而维持重建的时空连续性,实现对象恒常性(object permanence)与多物体扫描等能力。
链接: https://arxiv.org/abs/2512.16564
作者: Kirill Mazur,Marwan Taher,Andrew J. Davison
机构: Dyson Robotics Lab, Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For project page, see this https URL
Abstract:We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively. Comments: For project page, see this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.16564 [cs.CV] (or arXiv:2512.16564v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.16564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-56] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
【速读】:该论文旨在解决当前多模态模型在处理3D场景时缺乏内在的3D物体感知能力的问题,这限制了其对空间关系和深度线索的理解。解决方案的关键在于提出一种统一框架N3D-VLM,该框架将原生3D物体感知与3D感知的视觉推理无缝集成,使模型能够基于文本描述直接在3D空间中定位物体,并在此基础上进行显式的3D空间推理,从而实现更可解释、结构化的空间理解。此外,作者构建了一个可扩展的数据构造流水线,利用深度估计将大规模2D标注提升至3D空间,显著扩充了3D物体定位数据集规模(超过现有最大单图3D检测数据集六倍),并生成面向链式思维(Chain-of-Thought, CoT)推理的3D空间问答数据集,支持3D物体定位与3D空间推理的联合训练。
链接: https://arxiv.org/abs/2512.16561
作者: Yuxin Wang,Lei Ke,Boqiang Zhang,Tianyuan Qu,Hanxun Yu,Zhenpeng Huang,Meng Yu,Dan Xu,Dong Yu
机构: HKUST(香港科技大学); Tencent AI Lab(腾讯人工智能实验室); CUHK(香港中文大学); ZJU(浙江大学); NJU(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.
zh
[CV-57] P: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在零样本识别任务中虽性能优异却对对抗扰动高度敏感的问题,尤其是在安全关键场景下存在显著风险。现有训练时防御方法依赖于对抗微调,需标注数据且成本高昂;而测试时策略难以可靠区分干净输入与对抗输入,导致无法同时实现最优的鲁棒性和准确率。其解决方案的关键在于提出一种轻量级测试时防御框架——Test-Time Padding (TTP),通过空间填充前后CLIP特征嵌入的余弦相似度变化进行通用阈值检测,实现跨架构和数据集的可靠对抗输入识别;对于检测到的对抗样本,TTP采用可训练填充以恢复被破坏的注意力模式,并结合相似性感知集成策略提升最终预测鲁棒性;而对于干净样本则保持不变或可选地融合其他测试时自适应技术进一步提升精度。
链接: https://arxiv.org/abs/2512.16523
作者: Zhiwei Li,Yitian Pang,Weining Wang,Zhenan Sun,Qi Li
机构: NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); School of Automation, Tsinghua University (清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.
zh
[CV-58] Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images
【速读】:该论文旨在解决在非约束光照条件下对人脸图像进行精确的内在分解(intrinsic decomposition)问题,这是实现逼真重光照(photorealistic relighting)、高保真数字替身(high-fidelity digital doubles)和增强现实(augmented-reality)效果的前提。解决方案的关键在于提出了一种多尺度注意力引导的内在网络(MAGINet),其核心创新包括:分层残差编码、瓶颈层中的空间与通道注意力机制,以及解码器中自适应多尺度特征融合,从而提升漫反射反照率(diffuse albedo)边界清晰度和光照不变性;进一步通过轻量级三层CNN(RefinementNet)将初始预测上采样至1024×1024并精修,再基于此精修反照率使用Pix2PixHD架构生成五项物理基础渲染通道(环境遮蔽、表面法线、镜面反射、半透明性和原始漫反射颜色),最终形成完整的六通道内在分解结果,显著优于现有U-Net变体方法,在FFHQ-UV-Intrinsics数据集上结合掩码均方误差、VGG、边缘和Patch-LPIPS损失训练后实现了最先进的漫反射反照率估计性能及更高质量的整体渲染堆栈。
链接: https://arxiv.org/abs/2512.16511
作者: Hossein Javidnia
机构: Trinity College Dublin (都柏林圣三一学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a 512\times512 light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to 1024\times1024 and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.
zh
[CV-59] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization
【速读】:该论文旨在解决基于骨架的动作定位(skeleton-based temporal action localization)中有效特征表示学习困难的问题,尤其针对动作边界检测所需的时序敏感特征不足这一挑战。其关键解决方案在于提出一种片段判别(snippet discrimination)的自监督预训练任务,通过将骨架序列密集划分为非重叠片段,并利用对比学习促进跨视频区分这些片段的特征表示;同时,在骨干网络基础上引入U形模块融合中间特征以提升帧级定位的特征分辨率,从而显著改善现有对比学习方法在BABEL数据集上的动作定位性能,并在PKUMMD数据集上实现基于NTU RGB+D和BABEL预训练的最优迁移学习效果。
链接: https://arxiv.org/abs/2512.16504
作者: Qiushuo Cheng,Jingjing Liu,Catherine Morgan,Alan Whone,Majid Mirmehdi
机构: University of Bristol (布里斯托大学); North Bristol NHS Trust (北布里斯托国民保健服务信托)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level action recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
zh
[CV-60] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
【速读】:该论文旨在解决现有GUI grounding(图形用户界面定位)基准测试存在的局限性问题,包括数据量不足、领域覆盖狭窄、平台单一以及对特定领域知识依赖过强等。其解决方案的关键在于提出VenusBench-GD,一个大规模、多平台、双语的GUI grounding基准,具备广泛的应用程序覆盖、多样化的UI元素和丰富的标注数据;同时构建高质量的数据生成流程以提升标注准确性,并创新性地引入分层任务分类法(hierarchical task taxonomy),将接地任务细分为基础与高级两类共六个子任务,从而实现对模型能力的多维度评估。这一设计推动了从通用多模态模型到专用GUI模型的系统性对比分析,揭示了当前模型在不同层级任务上的性能差异与改进方向。
链接: https://arxiv.org/abs/2512.16501
作者: Beitong Zhou,Zhexiao Huang,Yuan Guo,Zhangxuan Gu,Tianyu Xia,Zichen Luo,Fei Tang,Dehan Kong,Yanyi Shang,Suling Ou,Zhenlin Guo,Changhua Meng,Shuheng Shen
机构: AntGroup(蚂蚁集团); iMean AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.
zh
[CV-61] PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation
【速读】:该论文旨在解决基于lifting的单目3D人体姿态估计方法中因2D姿态与未知深度在特征空间中纠缠编码而导致的深度不确定性干扰2D姿态特征、从而限制整体估计精度的问题。其解决方案的关键在于提出一种基于专家混合(Mixture-of-Experts, MoE)的网络结构PoseMoE:首先通过专用专家模块分别精炼已检测到的2D姿态特征并学习深度特征,实现2D姿态与深度特征的解耦编码,从而降低不确定深度对2D姿态特征的显式影响;其次引入跨专家知识聚合模块,通过双向映射机制融合2D姿态与深度之间的时空上下文信息,增强特征表达能力。实验表明,该方法在Human3.6M、MPI-INF-3DHP和3DPW三个主流数据集上均优于传统lifting方法。
链接: https://arxiv.org/abs/2512.16494
作者: Mengyuan Liu,Jiajie Liu,Jinyan Zhang,Wenhao Li,Junsong Yuan
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); Nanyang Technological University (南洋理工大学); University at Buffalo, State University of New York (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE Transactions on Image Processing (T-IP)
Abstract:The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
zh
[CV-62] YOLO1 1-4K: An Efficient Architecture for Real-Time Small Object Detection in 4K Panoramic Images
【速读】:该论文旨在解决360度全景图像(omnidirectional 360-degree images)在目标检测中面临的挑战,包括固有的空间畸变、宽视场和超高分辨率输入带来的计算压力。传统目标检测模型如YOLO针对标准图像尺寸(如640×640像素)优化,在处理4K及以上分辨率的全景图像时效率低下且精度受限。解决方案的关键在于提出YOLO11-4K框架:其核心创新包括引入一个包含P2层的多尺度检测头以增强对小目标的敏感性,以及采用GhostConv结构的骨干网络以降低计算复杂度而不损失表征能力;同时构建了CVIP360数据集并标注6,876个帧级边界框,为4K全景场景提供公开可用的检测基准。该方法在保持高精度(mAP@0.5=0.95)的同时实现28.3毫秒/帧的推理速度,相较YOLO11提升75%的实时性,从而有效支撑高分辨率全景环境下的鲁棒目标检测应用。
链接: https://arxiv.org/abs/2512.16493
作者: Huma Hafeez,Matthew Garratt,Jo Plested,Sankaran Iyer,Arcot Sowmya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference paper just submitted
Abstract:The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.
zh
[CV-63] Smile on the Face Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors
【速读】:该论文旨在解决当前情感识别(Emotion Recognition, ER)研究中过度依赖面部表情识别(Facial Expression Recognition, FER)而导致的“真实情绪与表层表情不一致”的问题,即FER难以捕捉个体内在真实情感状态。其核心解决方案是引入眼动行为(eye behaviors)作为关键情感线索,并构建了一个多模态情感识别数据集EMER(Eye-behavior-aided Multimodal Emotion Recognition),其中通过自发情绪诱导范式采集非侵入性眼动序列和注视图谱等眼动数据,结合面部表情视频形成多视角标注的情感标签。进一步地,提出EMERT(Eye-behavior-aided MER Transformer)模型,采用模态对抗特征解耦与多任务Transformer架构,将眼动行为建模为对脸部表情的有效补充,从而显著提升情感识别的鲁棒性。实验表明,EMERT在七种基准协议下均优于现有先进方法,验证了眼动行为在弥合FER与ER之间差距中的关键作用。
链接: https://arxiv.org/abs/2512.16485
作者: Kejun Liu,Yuanyuan Liu,Lin Wei,Chang Tang,Yibing Zhan,Zijing Chen,Zhe Chen
机构: China University of Geosciences (Wuhan) (中国地质大学(武汉)); Huazhong University of Science and Technology (华中科技大学); Wuhan University (武汉大学); La Trobe University (拉特罗布大学); Cisco-La Trobe Centre for Artificial Intelligence and Internet of Things (思科-拉特罗布大学人工智能与物联网中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by TMM
Abstract:Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at this https URL.
zh
[CV-64] Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment
【速读】:该论文旨在解决盲图像质量评估(Blind Image Quality Assessment, BIQA)中模型缺乏人类感知与推理一致性的问题,即如何使模型不仅能够准确预测图像质量分数,还能模仿人类从感官线索到内在推理的完整判断过程。解决方案的关键在于引入基于强化学习的框架,利用人类标注作为奖励信号,引导模型学习人类般的感知与推理行为;同时设计一种鼓励模型仅依据自身生成的描述推断图像质量的奖励机制,从而内化自洽的推理能力。实验表明,该方法在评分预测性能上达到当前最优水平,并在人类-模型对齐度上显著优于基线,通过ROUGE-1指标验证了模型生成的推理链与人类解释的高度相似性。
链接: https://arxiv.org/abs/2512.16484
作者: Yuan Li,Yahan Yu,Youyuan Lin,Yong-Hao Yang,Chenhui Chu,Shin’ya Nishida
机构: Kyoto University (京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.
zh
[CV-65] StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models
【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在大规模生成步骤下计算复杂度高、运行时间长的问题。现有加速方法虽能降低运行时间,但依赖人工步长选择,且未考虑生成过程中不同阶段的重要性差异。解决方案的关键在于提出StageVAR框架,通过系统性分析发现:早期步骤对保持语义和结构一致性至关重要,应保留完整计算;而后期步骤主要进行细节优化,可被剪枝或近似以实现加速。基于此洞察,StageVAR引入一种无需额外训练的即插即用加速策略,利用晚期计算中的语义无关性和低秩特性,在不显著损失图像质量的前提下实现最高达3.4倍的速度提升。
链接: https://arxiv.org/abs/2512.16483
作者: Senmao Li,Kai Wang,Salman Khan,Fahad Shahbaz Khan,Jian Yang,Yaxing Wang
机构: Nankai University (南开大学); City University of Hong Kong (东莞) (香港城市大学(东莞)); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Linkoping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.
zh
[CV-66] SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
【速读】:该论文旨在解决自主机器人系统在动态环境中实现可靠导航与交互所面临的时空理解难题,即如何融合视觉语言模型(VLM)提供的开放世界语义先验与点云几何结构及时间一致性信息。现有方法中,VLM虽具备丰富的语义能力但缺乏三维空间和时序的精确锚定,而几何感知则能捕捉场景结构与运动却语义稀疏。解决方案的关键在于提出一种无需训练且与主干网络无关的统一4D场景理解框架SNOW(Scene Understanding with Open-World Knowledge),其核心创新包括:利用HDBSCAN聚类生成对象级提议并引导SAM2分割,通过提出的时空令牌化补丁编码(Spatio-Temporal Tokenized Patch Encoding, STEP)提取局部语义、几何与时间属性,构建增量式4D场景图(4D Scene Graph, 4DSG)作为下游推理的结构化先验,并借助轻量级SLAM后端对所有STEP令牌进行空间锚定,从而实现跨时间的无歧义空间定位。这一机制使VLM能够直接解析空间结构与时序动态,显著提升4D场景理解精度与空间接地推理能力。
链接: https://arxiv.org/abs/2512.16461
作者: Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Esslingen University of Applied Sciences (埃斯林根应用科学大学); Dr. Ing. h.c. F. Porsche AG (保时捷股份公司); University of Michigan (密歇根大学); Voxel51 Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
zh
[CV-67] Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach
【速读】:该论文旨在解决生成式 AI (Generative AI) 在人类运动生成任务中的真实性与行为合理性问题,特别是针对“物体/位置预瞄”这一自然人类行为模式的建模。其核心挑战在于如何让模型生成的运动不仅精准到达目标位置(reach),还能合理地表现出先远距离注视目标再接近的动作序列(prime)。解决方案的关键在于:首次构建了一个包含23.7K条 gaze-primed 人类运动序列的数据集(整合自HD-EPIC、MoGaze等五个公开数据集),并基于此预训练一个文本条件扩散模型(text-conditioned diffusion-based motion generation model),随后在目标位姿或位置条件下进行微调。通过引入“Prime Success”和“Reach Success”两个评估指标,实验证明该方法在HD-EPIC数据集上实现了60%的预瞄成功率和89%的到达成功率,显著提升了生成动作对真实人类行为的模仿能力。
链接: https://arxiv.org/abs/2512.16456
作者: Masashi Hatano,Saptarshi Sinha,Jacob Chalk,Wei-Hong Li,Hideo Saito,Dima Damen
机构: Keio University(庆应义塾大学); University of Bristol(布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down – that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the ‘Reach Success’ and a newly introduced ‘Prime Success’ metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.
zh
[CV-68] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
【速读】:该论文旨在解决文本到图像扩散模型在生成多帧图像时面临的主体一致性(subject consistency)不足问题,即模型在不同输出中难以保持同一主体的视觉特征一致,从而限制了其在视觉叙事中的应用。现有方法如微调或图像条件控制虽有效但计算成本高且需针对每个主体单独优化;而1Prompt1Story通过拼接场景描述并重缩放词嵌入实现训练-free方案,却因语义泄露(semantic leakage)导致帧间嵌入纠缠,引发文本对齐错误。本文提出一种基于几何视角的训练-free解决方案,核心在于通过精炼文本嵌入来抑制非目标语义信息,从而缓解语义纠缠,显著提升主体一致性和文本对齐精度。
链接: https://arxiv.org/abs/2512.16443
作者: Shangxun Li,Youngjung Uh
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
zh
[CV-69] CountZES: Counting via Zero-Shot Exemplar Selection
【速读】:该论文旨在解决零样本目标计数(Zero-Shot Object Counting, ZOC)中的关键挑战,即在未见过类别仅通过类别名称指定的情况下,如何准确地从复杂场景中进行实例计数。现有方法要么依赖于开集检测器产生多实例候选,要么采用随机图像块采样无法精确界定目标实例。其解决方案的关键在于提出一种无需训练的框架CountZES,通过三个协同阶段实现零样本示例选择:Detection-Anchored Exemplar (DAE) 精细化开集检测以获得单实例示例;Density-Guided Exemplar (DGE) 引入密度驱动的自监督机制识别统计一致且语义紧凑的示例;Feature-Consensus Exemplar (FCE) 则通过特征空间聚类强化视觉一致性。三阶段共同构建出兼顾文本锚定、计数一致性和特征代表性的一组多样化互补示例,显著提升ZOC性能并在自然图像、遥感和医学图像等多个领域展现出良好泛化能力。
链接: https://arxiv.org/abs/2512.16415
作者: Muhammad Ibraheem Siddiqui,Muhammad Haris Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.
zh
[CV-70] BrepLLM : Native Boundary Representation Understanding with Large Language Models
【速读】:该论文旨在解决当前基于token序列的大语言模型(Large Language Models, LLMs)难以直接处理包含复杂几何与拓扑信息的三维边界表示(Boundary Representation, Brep)模型的问题,即如何在结构化3D几何与自然语言之间建立有效的模态对齐。其解决方案的关键在于提出BrepLLM框架,采用两阶段训练策略:第一阶段通过自适应UV采样将Brep转换为带几何与拓扑信息的图结构,并设计分层BrepEncoder提取面和边的几何特征及拓扑关系,生成全局token和节点序列token,再利用对比学习将全局token与冻结的CLIP文本编码器输出对齐;第二阶段将预训练的BrepEncoder集成进LLM,通过三阶段渐进式微调(包括基于MLP的语义映射、LLM微调以及引入Mixture-of-Query Experts增强几何多样性建模),实现从原始Brep数据到自然语言的理解与生成,最终在3D物体分类和描述任务上达到最先进性能。
链接: https://arxiv.org/abs/2512.16413
作者: Liyuan Deng,Hao Guo,Yunpeng Bai,Yongkang Dai,Huaxi Huang,Yilei Shi
机构: Northwestern Polytechnical University (西北工业大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.
zh
[CV-71] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture CVPR2026
【速读】:该论文旨在解决从一组未校准的人脸图像中重建高保真、结构一致的三维几何与纹理表示的问题,尤其在仅使用少量图像(如11张)的情况下实现精确的中性姿态重建。其核心解决方案在于结合高斯点绘(Gaussian Splatting)与三角网格约束,利用分割标注对齐面部语义区域以提升重建稳定性,并通过软约束将高斯点映射至底层三角网格表面,从而获得结构化且可进一步优化的几何表示。此外,该方法创新性地将高斯点转换到纹理空间,生成视点依赖的神经纹理(view-dependent neural texture),实现无需修改图形管线其他组件即可直接应用于任意场景资产的高保真渲染;同时借助可重照明的高斯模型分离光照与反照率信息,输出高分辨率可用的反照率纹理,显著增强了系统的灵活性与鲁棒性,尤其适用于不同光照条件下图像的联合训练与正则化。
链接: https://arxiv.org/abs/2512.16397
作者: Haodi He,Jihun Yu,Ronald Fedkiw
机构: Epic Games; Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Submitted to CVPR 2026. 21 pages, 22 figures
Abstract:We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.
zh
[CV-72] Adaptive Frequency Domain Alignment Network for Medical image segmentation
【速读】:该论文旨在解决医学图像分割中高质量标注数据稀缺的问题,这一问题源于人工标注过程的耗时性和高劳动强度。为应对挑战,作者提出了一种新颖的域适应框架——自适应频域对齐网络(Adaptive Frequency Domain Alignment Network, AFDAN),其核心创新在于通过频域特征对齐实现跨域知识迁移。解决方案的关键包括三个模块:对抗域学习模块用于源域到目标域的特征迁移,源-目标频域融合模块实现跨域频率表示的混合,以及空间-频域融合模块将空间与频率特征整合以提升分割精度。该方法在新构建的VITILIGO2025数据集上实现了90.9%的交并比(IoU),并在DRIVE基准上达到82.6%的IoU,优于现有最先进方法。
链接: https://arxiv.org/abs/2512.16393
作者: Zhanwei Li,Liang Li,Jiawan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)–a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.
zh
[CV-73] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models
【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)扩散模型在生成复杂场景或遵循逻辑时间指令时频繁失败的问题,其核心症结在于模型难以构建语义正确且逻辑一致的初始帧。解决方案的关键在于提出分步式视频生成(Factorized Video Generation, FVG)框架,将T2V任务解耦为三个专业化阶段:首先由大语言模型(Large Language Model, LLM)重写视频提示词以明确初始场景,消除时间歧义;接着通过文本到图像(Text-to-Image, T2I)模型生成高质量、构图正确的锚定帧;最后利用微调后的视频模型专注于基于该锚定帧进行动画合成与指令遵循。此方法显著提升了生成质量与效率,并在多个基准测试中达到新SOTA性能。
链接: https://arxiv.org/abs/2512.16371
作者: Mariam Hassan,Bastien Van Delft,Wuyang Li,Alexandre Alahi
机构: École Polytechnique Fédérale de Lausanne (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model’s inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis
zh
[CV-74] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation
【速读】:该论文旨在解决多角色动画中身份对应(Identity Correspondence, IC)的正确性问题,尤其是在角色位置交换场景下,如何确保生成帧与参考帧之间角色身份的一致性。解决方案的关键在于提出了一种基于身份匹配图(Identity Matching Graph, IMG)的系统性方法,将生成帧和参考帧中的角色建模为带权完全二部图的两个节点集,通过提出的掩码查询注意力机制(Mask-Query Attention, MQA)计算边权重以量化角色间的亲和度,并将IC正确性形式化为图结构指标,在训练过程中进行优化。这一核心设计使得身份对应关系可学习且可控制,显著提升了多角色动画的准确性与视觉保真度。
链接: https://arxiv.org/abs/2512.16360
作者: Haotian Ling,Zequn Chen,Qiuying Chen,Donglin Di,Yongjia Ma,Hao Li,Chen Wei,Zhulin Tao,Xun Yang
机构: University of Science and Technology of China (中国科学技术大学); Li Auto (理想汽车); Communication University of China (中国传媒大学); MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC (教育部脑启发智能感知与认知重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.
zh
[CV-75] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction
【速读】:该论文旨在解决预训练潜空间扩散模型(Latent Diffusion Models, LDMs)在多曝光高动态范围(HDR)重建任务中面临的三大挑战:(1) 由于8位潜变量压缩导致的动态范围表示受限;(2) 多步去噪带来的高推理成本;(3) 生成式模型固有的内容幻觉问题。其解决方案的关键在于提出GMODiff框架,将HDR重建重构为条件引导的增益图(Gain Map, GM)估计任务——GM以与低动态范围(LDR)图像相同的位深度编码扩展的动态范围。通过从基于回归的初始估计而非纯噪声出发进行单步去噪,并融合回归模型的内容保真先验来指导LDM的去噪过程和潜变量解码,有效抑制了幻觉现象并保留结构准确性,从而在保持高质量输出的同时实现100倍于以往LDM方法的推理速度提升。
链接: https://arxiv.org/abs/2512.16357
作者: Tao Hu,Weiyu Zhou,Yanjie Tu,Peng Wu,Wei Dong,Qingsen Yan,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); Xi’an University of Architecture and Technology (西安建筑科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.
zh
[CV-76] Collaborative Edge-to-Server Inference for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在边缘到服务器协同推理过程中通信开销过高且精度易受损失的问题。现有部署通常将边缘设备捕获的原始图像上传至服务器进行推理,但为适配视觉编码器输入分辨率而对图像进行重缩放会丢失细粒度信息,从而导致准确率下降。解决方案的关键在于提出一种两阶段协作推理框架:第一阶段由服务器基于全局图像进行初步推理,并利用VLM内部注意力机制定位感兴趣区域(Region of Interest, RoI),同时通过输出token的最小熵(min-entropy)作为置信度指标判断是否需要重传;若置信度不足,则请求边缘设备发送保留细节的局部图像(RoI),服务器再融合全局与局部图像进行精细化推理。该选择性重传策略仅传输必要视觉内容,显著降低通信成本并维持推理准确性。
链接: https://arxiv.org/abs/2512.16349
作者: Soochang Song,Yongjune Kim
机构: Pohang University of Science and Technology (POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 12 figures
Abstract:We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder’s input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM’s internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.
zh
[CV-77] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing
【速读】:该论文旨在解决非专用车联网移动众包感知(Non-dedicated Vehicular Mobile Crowdsensing, NVMCS)系统中如何实现最优信息质量(Quality of Information, QoI)的问题,其核心挑战在于感知覆盖范围、感知可靠性与车辆动态参与之间的相互制约关系。解决方案的关键是提出一种基于质量感知的激励驱动多智能体调度系统(QUIDS),通过引入聚合感知质量(Aggregated Sensing Quality, ASQ)这一新指标,定量融合覆盖与可靠性以表征QoI,并设计了一种相互辅助的信念感知车辆调度算法,在不确定性环境下估计感知可靠性并分配激励,从而在预算约束下显著提升感知质量。实证结果表明,QUIDS相较无调度场景提升ASQ达38%,较现有最优方法提升10%,同时降低重建地图误差39%-74%。
链接: https://arxiv.org/abs/2512.16325
作者: Nan Zhou,Zuxin Li,Fanhang Man,Xuecheng Chen,Susu Xu,Fan Dang,Chaopeng Hong,Yunhao Liu,Xiao-Ping Zhang,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University, China; Department of Civil and System Engineering, Johns Hopkins University, United States of America; School of Software and BNRist, Tsinghua University, Beijing 100084, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non-dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality-informed Incentive-driven multi-agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real-world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non-dispatching scenarios and by 10% over state-of-the-art methods. It also reduces reconstruction map errors by 39-74% across algorithms. By jointly optimizing coverage and reliability via a quality-informed incentive mechanism, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure, applicable to smart-city scenarios like traffic and environmental sensing.
zh
[CV-78] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)搭载多传感器在复杂观测条件下(如远距离、小交会角和大倾斜角)进行目标定位时,因设计矩阵列向量严重多重共线性导致最小二乘估计不稳定、鲁棒性差的问题。解决方案的关键在于引入岭估计(Ridge Estimation),通过引入正则化项有效缓解多重共线性问题,从而提升融合视觉影像丰富场景信息与激光测距高精度优势后的定位准确性与系统鲁棒性,尤其在观测条件受限时表现更为显著。
链接: https://arxiv.org/abs/2512.16314
作者: Huayu Huang,Chen Chen,Banglei Guan,Ze Tan,Yang Shang,Zhang Li,Qifeng Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.
zh
[CV-79] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation
【速读】:该论文旨在解决当前全一体化视频修复(all-in-one video restoration)方法在处理时变退化(time-varying degradations)时面临的两大挑战:一是退化特征干扰时间建模,导致模型关注伪影而非视频内容;二是现有方法依赖大模型实现统一修复,掩盖了底层技术难点。其解决方案的关键在于提出一种轻量级网络 LaverNet(仅 362K 参数),通过引入一种新颖的传播机制,仅在帧间传递与退化无关的特征(degradation-agnostic features),从而有效缓解退化对时间建模的干扰,实现小模型下高性能的全一体化视频修复。
链接: https://arxiv.org/abs/2512.16313
作者: Haiyu Zhao,Yiwen Shan,Yuanbiao Gou,Xi Peng
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.
zh
[CV-80] PixelArena: A benchmark for Pixel-Precision Visual Intelligence
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像生成能力评估中普遍依赖美学指标、缺乏对细粒度生成精度客观衡量的问题。解决方案的关键在于提出PixelArena基准,通过语义分割任务(semantic segmentation task)以像素级精度客观评估模型的细粒度生成智能,从而揭示模型在零样本(zero-shot)条件下生成高保真语义掩码(semantic masks)的能力,首次展现出前所未有的视觉智能与真正的一般化图像生成能力。
链接: https://arxiv.org/abs/2512.16303
作者: Feng Liang,Sizhe Cheng,Chenqi Yi
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 11 figures, project page: this https URL
Abstract:Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.
zh
[CV-81] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval
【速读】:该论文旨在解决多标签遥感图像检索中因地物类别语义重叠、标签分布高度不均衡以及类别间共现模式复杂导致的表示学习困难问题。解决方案的关键在于提出多标签自适应对比学习(Multi-Label Adaptive Contrastive Learning, MACL),其核心机制包括:基于标签感知的采样策略以增强类别区分度,频率敏感的权重调整以缓解标签分布不均衡,以及动态温度缩放技术以实现对常见与稀有类别的平衡表征学习。该方法在DLRSD、ML-AID和WHDLD三个基准数据集上均显著优于基于对比损失的基线模型,有效提升了大规模遥感图像库中的检索可靠性。
链接: https://arxiv.org/abs/2512.16294
作者: Amna Amir,Erchan Aptoula
机构: Sabanci University (萨班哲大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at this https URL upon acceptance.
zh
[CV-82] GFLAN: Generative Functional Layouts
【速读】:该论文旨在解决自动楼层平面生成中长期存在的挑战,即如何在组合搜索、几何约束满足与功能设计需求之间实现统一的计算处理,尤其针对现有深度学习方法难以捕捉建筑推理(如拓扑关系优先于几何实例化、功能约束通过邻接网络传播、以及从局部连通性决策中涌现通行流线)的问题。其解决方案的关键在于提出GFLAN框架,通过显式分解为拓扑规划(topological planning)与几何实现(geometric realization)两个阶段:第一阶段利用双编码器卷积架构分离不变空间上下文与演化布局状态,基于离散概率图序列分配房间质心;第二阶段构建异构图结构连接房间节点与边界顶点,并采用Transformer增强的图神经网络(GNN)联合回归房间边界,从而实现从抽象拓扑到精确几何的可解释生成。
链接: https://arxiv.org/abs/2512.16275
作者: Mohamed Abouagour,Eleftherios Garyfallidis
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 15 figures
Abstract:Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements – a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders – separating invariant spatial context from evolving layout state – to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.
zh
[CV-83] xtEditBench: Evaluating Reasoning -aware Text Editing Beyond Rendering
【速读】:该论文旨在解决图像中文本编辑(text editing in images)这一长期被忽视但至关重要的问题,即在保持语义一致性、几何合理性及上下文连贯性的前提下生成可读字符。当前主流扩散模型和多模态模型虽能处理基础像素级修改,但在复杂场景下仍难以实现逻辑合理、物理可信且布局感知的文本替换或重写。其解决方案的关键在于提出一个全新的评估基准——TextEditBench,该基准不仅涵盖传统图像编辑任务,更聚焦于需要推理能力的文本编辑场景,并引入“语义期望”(Semantic Expectation, SE)作为核心评价维度,量化模型在编辑过程中维持语义一致性、上下文连贯性和跨模态对齐的能力,从而为文本引导的图像编辑与多模态推理提供系统化测试平台。
链接: https://arxiv.org/abs/2512.16270
作者: Rui Gui,Yang Wan,Haochen Han,Dongxing Mao,Fangming Liu,Min Li,Alex Jinpeng Wang
机构: Central South University (中南大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.
zh
[CV-84] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning
【速读】:该论文旨在解决荧光寿命成像显微镜(FLIM)在临床应用中面临的两大关键瓶颈:一是像素采集时间过长导致成像速度受限,二是信噪比(SNR)较低限制了图像质量,进而制约了其在实时、无标记诊断中的转化潜力。解决方案的核心在于提出一种基于深度学习的多通道像素超分辨率(PSR)框架——FLIM_PSR_k,该框架利用条件生成对抗网络(cGAN)进行训练,能够从像素尺寸扩大至5倍的低分辨率FLIM数据中重建出高分辨率图像,实现5倍超分辨率因子(k=5),从而将空间带宽积提升25倍。相较于基于扩散模型的替代方案,该方法具备更强的重建鲁棒性和显著更短的推理时间,有利于实际部署,并可有效缓解自体荧光FLIM中的SNR限制问题。
链接: https://arxiv.org/abs/2512.16266
作者: Paloma Casteleiro Costa,Parnian Ghapandar Kashani,Xuhui Liu,Alexander Chen,Ary Portes,Julien Bec,Laura Marcu,Aydogan Ozcan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph); Optics (physics.optics)
备注: 30 Pages, 9 Figures
Abstract:Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM’s effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.
zh
[CV-85] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models
【速读】:该论文旨在解决多视角人群计数(multi-view crowd counting)中因标注数据稀缺而导致的模型性能受限问题。其关键解决方案在于提出两种半监督框架,通过引入基于模型预测结果或模型不确定性的排序约束来优化多视角融合模型的训练:第一种方法(vanilla model)要求输入摄像头视角数量较少时的预测值不大于视角较多时的预测值;第二种方法则利用预测误差指导下的不确定性估计,确保视角越多时模型不确定性越小。这些约束以半监督方式融入训练过程,从而在标签数据有限的情况下提升计数精度。
链接: https://arxiv.org/abs/2512.16243
作者: Qi Zhang,Yunfei Gong,Zhidan Xie,Zhizi Wang,Antoni B. Chan,Hui Huang
机构: Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, under review
Abstract:Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models’ prediction results of different numbers of camera-view inputs, namely, the model’s predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models’ prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.
zh
[CV-86] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection
【速读】:该论文旨在解决皮肤科疾病诊断中因专科医生资源有限和临床表现复杂而导致的准确率不足问题,尤其关注家族史(family history)在诊断过程中常被忽视的现状。其核心解决方案是构建一个融合多模态数据的可解释人工智能(AI)框架,关键在于将基于深度学习的皮肤图像分析与结构化临床数据(特别是详细的家族病史模式)相结合,并通过可解释卷积神经网络与包含遗传风险因素的临床决策树集成,从而提升对遗传性皮肤病(如黑色素瘤、银屑病和特应性皮炎)的诊断准确性。该方法不仅增强了模型的透明度与临床可信度,还为未来在真实世界医疗环境中部署及支持临床试验验证提供了可行路径。
链接: https://arxiv.org/abs/2512.16235
作者: Satya Narayana Panda,Vaishnavi Kukkala,Spandana Iyer
机构: University of New Haven(新黑文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 1 table. Code available at this https URL
Abstract:Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms. Comments: 9 pages, 5 figures, 1 table. Code available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) MSC classes: 68T05 ACMclasses: J.3; I.2.1; I.5.4 Cite as: arXiv:2512.16235 [cs.CV] (or arXiv:2512.16235v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.16235 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-87] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
【速读】:该论文旨在解决3D人体反应生成中的三大挑战:高运动保真度、实时推理以及在线场景下的自回归适应性,现有方法难以同时满足这三项要求。其解决方案的核心是提出ARMFlow框架,该框架基于MeanFlow设计,通过因果上下文编码器和基于MLP的速度预测器建模动作参与者(actor)与反应者(reactor)之间的时序依赖关系;关键创新在于训练阶段引入Bootstrap Contextual Encoding(BSCE),用生成的历史序列替代真实历史序列进行编码,从而缓解自回归生成中的误差累积问题。此外,论文还提出了离线版本ReMFlow,在保持最快速度的同时达到当前最优性能。
链接: https://arxiv.org/abs/2512.16234
作者: Zichen Geng,Zeeshan Hayder,Wei Liu,Hesheng Wang,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学); Commonwealth Scientific and Industrial Research Organisation (澳大利亚联邦科学与工业研究组织); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
zh
[CV-88] Image Compression Using Singular Value Decomposition
【速读】:该论文旨在解决图像压缩效率问题,即如何通过数学方法降低图像数据的存储和传输成本。其解决方案的关键在于利用奇异值分解(Singular Value Decomposition, SVD)与低秩矩阵逼近技术对图像进行压缩,通过保留主要奇异值来近似原始图像矩阵,从而实现数据降维。该方法在理论上可有效减少冗余信息,但在实际评估中发现,其压缩比和误差控制性能仍显著劣于JPEG、JPEG2000及WEBP等主流图像编码标准,尤其在低误差容忍度下甚至出现压缩后文件体积大于原图的现象,表明SVD-based低秩逼近在实践中不具备与工业级编解码器竞争的能力。
链接: https://arxiv.org/abs/2512.16226
作者: Justin Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.
zh
[CV-89] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models
【速读】:该论文旨在解决单视图新视角合成(Single-view Novel View Synthesis, NVS)模型中初始噪声质量对生成图像效果影响显著的问题,尤其关注如何从随机高斯噪声中学习并生成高质量的初始噪声以提升扩散模型的生成性能。其解决方案的关键在于:首先设计了一种离散化的欧拉反演方法(discretized Euler inversion method),将图像语义信息注入随机噪声中,构建出随机噪声与高质量噪声的配对数据集;其次提出一种基于编码器-解码器网络(Encoder-Decoder Network, EDN)的学习框架,直接实现从随机噪声到高质量噪声的映射。该EDN可无缝集成至多种NVS模型(如SV3D和MV-Adapter),并在多个数据集上显著提升生成质量。
链接: https://arxiv.org/abs/2512.16219
作者: Zhihao Zhang,Xuejun Yang,Weihua Liu,Mouquan Shen
机构: Nanjing Tech University (南京工业大学); Yongjiang Laboratory (甬江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 9 figures
Abstract:Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: this https URL.
zh
[CV-90] Enhanced 3D Shape Analysis via Information Geometry
【速读】:该论文旨在解决三维点云(3D point clouds)形状比较中的关键挑战,即传统几何度量(如Hausdorff距离和Chamfer距离)难以捕捉全局统计结构且对异常值敏感,而现有高斯混合模型(Gaussian Mixture Models, GMMs)的Kullback-Leibler(KL)散度近似方法存在无界或数值不稳定的问题。解决方案的关键在于构建一个信息几何框架,将点云表示为统计流形上的GMM,并提出一种改进的对称KL散度(Modified Symmetric Kullback-Leibler, MSKL),其理论保证了上下界的存在,从而确保所有GMM比较的数值稳定性。实验表明,MSKL能稳定、单调地反映几何变化,在人体姿态识别(MPI-FAUST数据集)和动物形状比较(G-PCD数据集)中显著优于传统距离和现有KL近似方法。
链接: https://arxiv.org/abs/2512.16213
作者: Amit Vishwakarma,K.S. Subrahamanian Moosath
机构: Indian Institute of Space Science and Technology (印度空间科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
备注:
Abstract:Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.
zh
[CV-91] Open Ad-hoc Categorization with Contextualized Feature Learning
【速读】:该论文致力于解决开放场景下动态生成类别(open ad-hoc categorization)的问题,即在仅有少量标注样本和大量未标注数据的情况下,如何发现潜在语境并扩展临时性类别(ad-hoc categories),以支持AI代理在不断变化的任务中实现自适应分类。其核心挑战在于区分临时性类别与通用类别(如植物或动物)之间的语义差异,并确保模型具备良好的泛化能力与可解释性。解决方案的关键在于提出OAK模型——该模型通过在冻结的CLIP(Contrastive Language–Image Pretraining)输入端引入少量可学习的上下文标记(context tokens),并联合优化CLIP的图像-文本对齐目标与GCD(Generative Clustering for Discovery)的视觉聚类目标,从而实现语义扩展与视觉聚类的协同优化。实验表明,OAK在Stanford和Clevr-4数据集上显著优于CLIP和GCD,在多个分类任务中达到最先进的准确率与概念发现性能,同时生成可解释的显著性图谱,凸显了其在动作、情绪和位置等不同语境下的关注区域(如手部、面部和背景),提升了模型透明度与可信度。
链接: https://arxiv.org/abs/2512.16202
作者: Zilin Wang,Sangwoo Mo,Stella X. Yu,Sima Behpour,Liu Ren
机构: University of Michigan (密歇根大学); UC Berkeley (加州大学伯克利分校); Bosch Center for AI (博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 17 figures
Abstract:Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP’s image-text alignment objective and GCD’s visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization. Comments: 26 pages, 17 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16202 [cs.CV] (or arXiv:2512.16202v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.16202 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: CVPR 2025 Related DOI: https://doi.org/10.1109/cvpr52734.2025.01407 Focus to learn more DOI(s) linking to related resources
zh
[CV-92] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation
【速读】:该论文旨在解决生成式医学视觉语言模型(Med-VLM)在放射科报告生成(RRG)中存在幻觉(hallucination)的问题,即由于视觉与语言表征之间跨模态对齐不佳,导致生成的报告缺乏事实准确性与视觉一致性。解决方案的关键在于提出VALOR框架,其核心是基于强化学习的后对齐机制,采用组相对近端优化(Group-Relative Proximal Optimization, GRPO)策略,分两阶段优化:首先利用文本奖励提升模型使用临床精准术语的能力,其次对齐文本引导模型中的视觉投影模块与疾病发现区域,从而增强注意力聚焦于诊断相关图像区域,显著改善报告的事实准确性和视觉接地性(visual grounding)。
链接: https://arxiv.org/abs/2512.16201
作者: Sarosij Bose,Ravi K. Rajendran,Biplob Debnath,Konstantinos Karydis,Amit K. Roy-Chowdhury,Srimat Chakradhar
机构: NEC Laboratories America; University of California, Riverside
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
zh
[CV-93] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation
【速读】:该论文旨在解决现有合成人体运动数据集在特定领域(如体育)中缺乏针对性、可控性不足以及难以迁移至真实场景的问题。传统方法通常聚焦于日常通用动作,且控制粒度有限,无法满足特定应用场景的需求。其解决方案的关键在于提出Avatar4D,一个可移植的、端到端的合成人体运动数据生成流程,能够实现对身体姿态、外观、摄像机视角和环境上下文的细粒度控制,且无需人工标注;该系统生成的高保真4D(3D几何随时间变化)序列可应用于训练与评估姿态估计模型,并在体育场景下验证了其在监督学习、零样本迁移及跨运动泛化中的有效性,同时通过特征空间对齐分析证明了合成数据与真实数据的高度一致性,从而为多样化领域任务提供了可扩展、可控且可迁移的人体运动数据生成范式。
链接: https://arxiv.org/abs/2512.16199
作者: Jerrin Bright,Zhibo Wang,Dmytro Klepachevskyi,Yuhao Chen,Sirisha Rambhatla,David Clausi,John Zelek
机构: Vision and Image Processing Lab (视觉与图像处理实验室); Critical ML Lab (关键机器学习实验室); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.
zh
[CV-94] owards Closing the Domain Gap with Event Cameras
【速读】:该论文旨在解决传统摄像头在不同光照条件下(如白天与夜间)因域偏移(domain gap)导致性能显著下降的问题。其解决方案的关键在于采用事件相机(event camera)替代传统摄像头,利用其对光照变化不敏感的特性,在无需额外调整的情况下保持跨光照条件下的稳定性能,从而有效缓解域偏移带来的负面影响。
链接: https://arxiv.org/abs/2512.16178
作者: M. Oltan Sevinc,Liao Wu,Francisco Cruz
机构: University of New South Wales (新南威尔士大学); Universidad Central de Chile (智利中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to Australasian Conference on Robotics and Automation (ACRA), 2025
Abstract:Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.
zh
[CV-95] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation
【速读】:该论文旨在解决生成式 AI(Generative AI)在无监督域适应(Unsupervised Domain Adaptation, UDA)任务中因源域与目标域间分布差异导致的性能下降问题,特别是现有提示调优(prompt tuning)方法仅关注边缘分布对齐而忽视条件分布差异所引发的类别原型错位和语义区分能力弱化问题。其解决方案的关键在于提出一种类中心双对齐生成提示适配方法(Class-Centric Dual Alignment Generative Prompt Adaptation, C-DGPA),通过新颖的双分支架构协同优化边缘分布对齐与条件分布对齐:其中边缘分布对齐分支采用动态对抗训练框架缓解边际差异,条件分布对齐分支引入类映射机制(Class Mapping Mechanism, CMM)标准化语义提示理解并抑制源域过依赖,从而实现领域不变且语义可区分的表示学习。
链接: https://arxiv.org/abs/2512.16164
作者: Chao Li,Dasha Hu,Chengyang Li,Yuming Jiang,Yuncheng Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.
zh
[CV-96] SegGraph: Leverag ing Graphs of SAM Segments for Few-Shot 3D Part Segmentation
【速读】:该论文旨在解决少样本(few-shot)3D部件分割中如何有效聚合来自2D基础模型(foundation models)的知识以提升分割性能的问题。现有方法要么忽略三维几何结构,导致特征学习不充分,要么未充分利用Segment Anything Model (SAM) 提供的高质量分组线索,从而引发欠分割和部件标签不一致的问题。解决方案的关键在于提出一种基于SAM分割图(SegGraph)的传播机制:首先构建一个类地图(atlas)结构的段落图(segment graph),其中节点表示SAM生成的分割区域,边编码其空间关系(重叠/邻接);然后通过图神经网络(GNN)传播并融合2D基础模型特征,同时利用视图方向加权融合策略对每个3D点映射段落特征,抑制低质量段落的贡献,从而在保持段内语义一致性的同时显式建模全局几何结构。此方法显著提升了小部件和边界区域的分割精度,在PartNet-E数据集上mIoU指标优于所有基线至少6.9%。
链接: https://arxiv.org/abs/2512.16143
作者: Yueyang Hu,Haiyong Jiang,Haoxuan Song,Jun Xiao,Hao Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work presents a novel framework for few-shot 3D part segmentation. Recent advances have demonstrated the significant potential of 2D foundation models for low-shot 3D part segmentation. However, it is still an open problem that how to effectively aggregate 2D knowledge from foundation models to 3D. Existing methods either ignore geometric structures for 3D feature learning or neglects the high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. We devise a novel SAM segment graph-based propagation method, named SegGraph, to explicitly learn geometric features encoded within SAM’s segmentation masks. Our method encodes geometric features by modeling mutual overlap and adjacency between segments while preserving intra-segment semantic consistency. We construct a segment graph, conceptually similar to an atlas, where nodes represent segments and edges capture their spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are then propagated via a graph neural network to learn global geometric structures. To enforce intra-segment semantic consistency, we map segment features to 3D points with a novel view-direction-weighted fusion attenuating contributions from low-quality segments. Extensive experiments on PartNet-E demonstrate that our method outperforms all competing baselines by at least 6.9 percent mIoU. Further analysis reveals that SegGraph achieves particularly strong performance on small components and part boundaries, demonstrating its superior geometric understanding. The code is available at: this https URL.
zh
[CV-97] ResDynUNet: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT
【速读】:该论文旨在解决双能谱CT(Dual-Spectral CT, DSCT)重建中因噪声、伪影及通道不平衡导致的图像质量下降问题,尤其关注基础材料分解的准确性与最终图像的清晰度。解决方案的关键在于提出一种融合知识驱动与数据驱动的混合重建框架:首先利用斜投影修正技术(Oblique Projection Modification Technique, OPMT)快速生成基础材料图像的中间解,其优势在于收敛速度快且能有效完成基础材料分解;随后引入ResDynUNet++神经网络对中间解进行精细化优化,该网络基于UNet++结构,将标准卷积替换为残差动态卷积模块(Residual Dynamic Convolution Blocks),从而结合动态卷积的输入自适应特征提取能力与残差连接的稳定训练特性,显著缓解了DSCT中的通道不平衡和近边界大伪影问题,最终实现高质量、高保真的重建结果。
链接: https://arxiv.org/abs/2512.16140
作者: Ze Yuan,Wenbin Li,Shusen Zhao
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Southern University of Science and Technology (南方科技大学); Detection Institute for Advanced Technology Longhua-Shenzhen (龙华-深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:We propose a hybrid reconstruction framework for dual-spectral CT (DSCT) that integrates iterative methods with deep learning models. The reconstruction process consists of two complementary components: a knowledge-driven module and a data-driven module. In the knowledge-driven phase, we employ the oblique projection modification technique (OPMT) to reconstruct an intermediate solution of the basis material images from the projection data. We select OPMT for this role because of its fast convergence, which allows it to rapidly generate an intermediate solution that successfully achieves basis material decomposition. Subsequently, in the data-driven phase, we introduce a novel neural network, ResDynUNet++, to refine this intermediate solution. The ResDynUNet++ is built upon a UNet++ backbone by replacing standard convolutions with residual dynamic convolution blocks, which combine the adaptive, input-specific feature extraction of dynamic convolution with the stable training of residual connections. This architecture is designed to address challenges like channel imbalance and near-interface large artifacts in DSCT, producing clean and accurate final solutions. Extensive experiments on both synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.
zh
[CV-98] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space WACV2026
【速读】:该论文旨在解决在单张图像中自动检测放牧牛只之间行为交互的问题,这对于智能畜牧业管理(如发情期检测)至关重要。由于牛只交互行为较为罕见且缺乏涵盖此类行为的综合性数据集,传统方法难以有效识别。其解决方案的关键在于提出一种名为CattleAct的数据高效交互检测方法,该方法通过将交互行为分解为个体牛只动作的组合,并利用大规模牛只动作数据集预训练得到的动作潜在空间,结合对比学习对稀有交互进行微调,从而构建统一的动作与交互潜在空间,实现高精度交互检测。
链接: https://arxiv.org/abs/2512.16133
作者: Ren Nakagawa,Yang Yang,Risa Shinoda,Hiroaki Santo,Kenji Oyama,Fumio Okura,Takenao Ohkawa
机构: Kobe University (神户大学); The University of Osaka (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026
Abstract:This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at this https URL.
zh
[CV-99] Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure AAAI2026
【速读】:该论文旨在解决机器遗忘(Machine Unlearning)技术在实际应用中引入的隐私风险问题,特别是针对被保留数据(retained data)的隐私泄露隐患。现有研究多关注被遗忘数据的隐私保护,而忽视了在双视角(dual-view)场景下——即攻击者可同时访问原始模型和遗忘后模型——所引发的信息泄露增强问题。论文的关键解决方案是提出DVIA(Dual-View Inference Attack),一种基于黑盒查询的推理攻击方法,通过比较两个模型输出的似然比来推断保留数据的成员身份信息,无需训练额外的攻击模型,且具备轻量级、高效推理的特点,从而首次系统揭示了双视角设置下的隐私知识增益(privacy knowledge gain)现象及其对保留数据的显著威胁。
链接: https://arxiv.org/abs/2512.16126
作者: Lulu Xue,Shengshan Hu,Linqiang Qian,Peijin Guo,Yechao Zhang,Minghui Li,Yanjun Zhang,Dayong Ye,Leo Yu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepeted by AAAI2026
Abstract:Machine unlearning is a newly popularized technique for removing specific training data from a trained model, enabling it to comply with data deletion requests. While it protects the rights of users requesting unlearning, it also introduces new privacy risks. Prior works have primarily focused on the privacy of data that has been unlearned, while the risks to retained data remain largely unexplored. To address this gap, we focus on the privacy risks of retained data and, for the first time, reveal the vulnerabilities introduced by machine unlearning under the dual-view setting, where an adversary can query both the original and the unlearned models. From an information-theoretic perspective, we introduce the concept of privacy knowledge gain and demonstrate that the dual-view setting allows adversaries to obtain more information than querying either model alone, thereby amplifying privacy leakage. To effectively demonstrate this threat, we propose DVIA, a Dual-View Inference Attack, which extracts membership information on retained data using black-box queries to both models. DVIA eliminates the need to train an attack model and employs a lightweight likelihood ratio inference module for efficient inference. Experiments across different datasets and model architectures validate the effectiveness of DVIA and highlight the privacy risks inherent in the dual-view setting.
zh
[CV-100] Autoencoder-based Denoising Defense against Adversarial Attacks on Object Detection
【速读】:该论文旨在解决深度学习目标检测模型在真实应用场景(如自动驾驶和安防监控)中因对抗样本(adversarial examples)导致性能下降的问题。其解决方案的关键在于提出一种基于单层卷积自编码器(convolutional autoencoder)的去噪防御机制,通过重建受扰图像以恢复原始特征表示,从而部分缓解对抗攻击带来的检测性能损失,且无需对检测模型进行重新训练。实验表明,该方法可在不修改原模型的情况下实现约3.7%的mAP恢复和10.8%的bbox mAP@50提升。
链接: https://arxiv.org/abs/2512.16123
作者: Min Geun Song,Gang Min Kim,Woonmin Kim,Yongsik Kim,Jeonghyun Sim,Sangbeom Park,Huy Kang Kim
机构: Korea University (韩国科学技术院); Hacking and Countermeasure Research Lab (黑客与反制研究实验室)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures
Abstract:Deep learning-based object detection models play a critical role in real-world applications such as autonomous driving and security surveillance systems, yet they remain vulnerable to adversarial examples. In this work, we propose an autoencoder-based denoising defense to recover object detection performance degraded by adversarial perturbations. We conduct adversarial attacks using Perlin noise on vehicle-related images from the COCO dataset, apply a single-layer convolutional autoencoder to remove the perturbations, and evaluate detection performance using YOLOv5. Our experiments demonstrate that adversarial attacks reduce bbox mAP from 0.2890 to 0.1640, representing a 43.3% performance degradation. After applying the proposed autoencoder defense, bbox mAP improves to 0.1700 (3.7% recovery) and bbox mAP@50 increases from 0.2780 to 0.3080 (10.8% improvement). These results indicate that autoencoder-based denoising can provide partial defense against adversarial attacks without requiring model retraining.
zh
[CV-101] Flexible Camera Calibration using a Collimator System
【速读】:该论文旨在解决相机标定(camera calibration)中因复杂运动约束导致的精度不足与灵活性差的问题。传统方法通常依赖于相机与标定目标之间的6自由度(6DOF)相对运动,这不仅增加了标定过程的复杂性,还容易引入误差。解决方案的关键在于设计了一种新型准直仪系统(collimator system),并基于其独特的光学几何特性提出了角度不变性约束(angle invariance constraint),证明了标定目标与相机之间的相对运动符合球面运动模型(spherical motion model),从而将原6DOF运动简化为3DOF纯旋转运动。这一约束显著降低了标定问题的复杂度,并在此基础上提出适用于多图像的闭式线性求解器和两幅图像的最小解法,同时进一步开发了仅需单张准直仪图像即可完成标定的算法,无需相机运动,实现了快速、灵活且高精度的标定。
链接: https://arxiv.org/abs/2512.16113
作者: Shunkun Liang,Banglei Guan,Zhenbao Yu,Dongcai Tan,Pengju Sun,Zibin Liu,Qifeng Yu,Yang Shang
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at this https URL
zh
[CV-102] A Tri-Dynamic Preprocessing Framework for UGC Video Compression ICASSP2024
【速读】:该论文旨在解决用户生成内容(User Generated Content, UGC)视频在编码优化中因多样性与高变异性导致的数据驱动机器学习算法效果下降的问题。其解决方案的关键在于提出一种三动态预处理框架(Tri-Dynamic Preprocessing framework),通过三个自适应机制实现:1)引入自适应因子调节预处理强度;2)采用自适应量化等级微调编码器模拟器;3)利用自适应lambda权衡调整率失真损失函数,从而提升模型在复杂UGC场景下的编码优化性能。
链接: https://arxiv.org/abs/2512.16101
作者: Fei Zhao,Mengxi Guo,Shijie Zhao,Junlin Li,Li Zhang,Xiaodong Xie
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a POSTER and for publication in the ICASSP 2024 proceedings
Abstract:In recent years, user generated content (UGC) has become the dominant force in internet traffic. However, UGC videos exhibit a higher degree of variability and diverse characteristics compared to traditional encoding test videos. This variance challenges the effectiveness of data-driven machine learning algorithms for optimizing encoding in the broader context of UGC scenarios. To address this issue, we propose a Tri-Dynamic Preprocessing framework for UGC. Firstly, we employ an adaptive factor to regulate preprocessing intensity. Secondly, an adaptive quantization level is employed to fine-tune the codec simulator. Thirdly, we utilize an adaptive lambda tradeoff to adjust the rate-distortion loss function. Experimental results on large-scale test sets demonstrate that our method attains exceptional performance.
zh
[CV-103] urboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
【速读】:该论文旨在解决视频生成中扩散模型(Diffusion Model)推理速度慢的问题,尤其是在端到端视频生成任务中计算复杂度高、耗时长的瓶颈。解决方案的关键在于提出TurboDiffusion框架,其核心创新包括:(1)注意力机制加速,采用低比特SageAttention和可训练稀疏线性注意力(Sparse-Linear Attention, SLA)以降低注意力计算开销;(2)步长蒸馏优化,利用rCM方法实现高效步长蒸馏,减少采样步骤;(3)W8A8量化技术,将模型参数与激活值量化至8位,显著加速线性层运算并压缩模型体积。实验表明,该方案在单张RTX 5090 GPU上即可实现100–200倍的速度提升,同时保持与原始模型相当的视频质量。
链接: https://arxiv.org/abs/2512.16093
作者: Jintao Zhang,Kaiwen Zheng,Kai Jiang,Haoxu Wang,Ion Stoica,Joseph E. Gonzalez,Jianfei Chen,Jun Zhu
机构: Tsinghua University (清华大学); Shengshu Technology (盛数科技); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.16093 [cs.CV] (or arXiv:2512.16093v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.16093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-104] Collimator-assisted high-precision calibration method for event cameras
【速读】:该论文旨在解决事件相机(event camera)在远距离测量场景下的几何标定问题,即如何准确获取其内参和外参以实现高精度的空间感知。解决方案的关键在于提出一种基于闪烁星形图案的准直仪(collimator)标定方法:首先利用准直仪的球面运动模型线性求解相机参数,再通过非线性优化对初始估计进行精细化调整,从而在保证长距离适用性的前提下显著提升标定精度与鲁棒性。
链接: https://arxiv.org/abs/2512.16092
作者: Zibin Liu,Shunkun Liang,Banglei Guan,Dongcai Tan,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures
Abstract:Event cameras are a new type of brain-inspired visual sensor with advantages such as high dynamic range and high temporal resolution. The geometric calibration of event cameras, which involves determining their intrinsic and extrinsic parameters, particularly in long-range measurement scenarios, remains a significant challenge. To address the dual requirements of long-distance and high-precision measurement, we propose an event camera calibration method utilizing a collimator with flickering star-based patterns. The proposed method first linearly solves camera parameters using the sphere motion model of the collimator, followed by nonlinear optimization to refine these parameters with high precision. Through comprehensive real-world experiments across varying conditions, we demonstrate that the proposed method consistently outperforms existing event camera calibration methods in terms of accuracy and reliability.
zh
[CV-105] LAPX: Lightweight Hourglass Network with Global Context
【速读】:该论文旨在解决当前人体姿态估计(Human Pose Estimation)模型在边缘设备部署时面临的两大挑战:一是现有高精度模型参数量大、计算成本高,难以高效运行于资源受限的边缘设备;二是轻量化模型虽降低了计算开销,但常因结构过于简化导致精度显著下降。解决方案的关键在于提出LAPX,一种基于先前工作LAP改进的Hourglass网络架构,其核心创新包括引入自注意力机制(Self-Attention)以捕获全局上下文信息,并优化阶段设计与轻量级注意力模块,从而在仅2.3M参数下实现MPII和COCO数据集上的竞争力性能及实时推理能力,有效平衡了精度与效率,提升了边缘部署可行性。
链接: https://arxiv.org/abs/2512.16089
作者: Haopeng Zhao,Marsha Mariya Kappan,Mahdi Bamdad,Francisco Cruz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Human pose estimation is a crucial task in computer vision. Methods that have SOTA (State-of-the-Art) accuracy, often involve a large number of parameters and incur substantial computational cost. Many lightweight variants have been proposed to reduce the model size and computational cost of them. However, several of these methods still contain components that are not well suited for efficient deployment on edge devices. Moreover, models that primarily emphasize inference speed on edge devices often suffer from limited accuracy due to their overly simplified designs. To address these limitations, we propose LAPX, an Hourglass network with self-attention that captures global contextual information, based on previous work, LAP. In addition to adopting the self-attention module, LAPX advances the stage design and refine the lightweight attention modules. It achieves competitive results on two benchmark datasets, MPII and COCO, with only 2.3M parameters, and demonstrates real-time performance, confirming its edge-device suitability.
zh
[CV-106] Auto-Vocabulary 3D Object Detection
【速读】:该论文旨在解决开放词汇三维目标检测(Open-vocabulary 3D object detection)中依赖人工指定类别的局限性,提出自动词汇三维目标检测(Auto-Vocabulary 3D Object Detection, AV3DOD),即在无需用户输入的情况下自动为检测到的目标生成类名。其解决方案的关键在于引入语义得分(Semantic Score, SS)作为评估生成类别质量的新指标,并构建一个利用二维视觉语言模型(Vision-Language Models, VLMs)的框架:通过图像描述生成、伪三维边界框构造以及特征空间语义扩展,从多模态信息中挖掘丰富且准确的候选类别,从而实现高精度的定位(mAP)与高质量的语义匹配(SS)。该方法在ScanNetV2和SUNRGB-D数据集上均达到当前最优性能,显著优于现有最先进方法CoDA。
链接: https://arxiv.org/abs/2512.16077
作者: Haomeng Zhang,Kuan-Chuan Peng,Suhas Lohit,Raymond A. Yeh
机构: Purdue University (普渡大学); Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
Abstract:Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.
zh
[CV-107] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution
【速读】:该论文旨在解决从单壳低角分辨率扩散磁共振成像(low angular resolution diffusion MRI, LAR-FOD)中准确估计高角分辨率纤维方向分布(high angular resolution fiber orientation distribution, HAR-FOD)的难题,该问题在临床实践中因扫描时间过长而限制了多壳高角分辨率数据的应用。解决方案的关键在于提出一种3D多通道图像块扩散模型,通过引入脑部解剖先验设计FOD-patch适配器以提升基于图像块的学习效率,并结合体素级条件协调模块增强模型对全局结构的理解能力;此外,创新性地设计球谐(spherical harmonic, SH)注意力模块,有效建模SH系数间的复杂相关性,从而实现高效且高精度的HAR-FOD预测。
链接: https://arxiv.org/abs/2512.16075
作者: Hao Tang,Hanyu Liu,Alessandro Perelli,Xi Chen,Chao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion MRI (dMRI) is a critical non-invasive technique to estimate fiber orientation distribution (FOD) for characterizing white matter integrity. Estimating FOD from single-shell low angular resolution dMRI (LAR-FOD) is limited by accuracy, whereas estimating FOD from multi-shell high angular resolution dMRI (HAR-FOD) requires a long scanning time, which limits its applicability. Diffusion models have shown promise in estimating HAR-FOD based on LAR-FOD. However, using diffusion models to efficiently generate HAR-FOD is challenging due to the large number of spherical harmonic (SH) coefficients in FOD. Here, we propose a 3D multi-channel patch diffusion model to predict HAR-FOD from LAR-FOD. We design the FOD-patch adapter by introducing the prior brain anatomy for more efficient patch-based learning. Furthermore, we introduce a voxel-level conditional coordinating module to enhance the global understanding of the model. We design the SH attention module to effectively learn the complex correlations of the SH coefficients. Our experimental results show that our method achieves the best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.
zh
[CV-108] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving
【速读】:该论文旨在解决真实世界中难以获取的安全关键边缘场景(safety-critical corner cases)对端到端自动驾驶系统评估的挑战,尤其是在复杂交互情境下模型性能退化问题。现有方法多局限于简化仿真环境,缺乏对真实世界场景的有效建模与生成能力。解决方案的关键在于构建一个闭环评估平台,其中基于流匹配(flow matching)的实时图像生成器与高效对抗性周边车辆策略协同工作:前者根据交通环境信息稳定生成高保真度的真实驾驶图像,后者则通过模拟具有挑战性的交通交互行为来主动构造当前自动驾驶系统难以应对的边缘案例,从而实现对端到端模型(如UniAD和VAD)在安全敏感场景下的鲁棒性量化评估。
链接: https://arxiv.org/abs/2512.16055
作者: Jiaheng Geng,Jiatong Du,Xinyu Zhang,Ye Li,Panqu Wang,Yanjun Huang
机构: Tongji University (同济大学); ZERON
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Safety-critical corner cases, difficult to collect in the real world, are crucial for evaluating end-to-end autonomous driving. Adversarial interaction is an effective method to generate such safety-critical corner cases. While existing adversarial evaluation methods are built for models operating in simplified simulation environments, adversarial evaluation for real-world end-to-end autonomous driving has been little explored. To address this challenge, we propose a closed-loop evaluation platform for end-to-end autonomous driving, which can generate adversarial interactions in real-world scenes. In our platform, the real-world image generator cooperates with an adversarial traffic policy to evaluate various end-to-end models trained on real-world data. The generator, based on flow matching, efficiently and stably generates real-world images according to the traffic environment information. The efficient adversarial surrounding vehicle policy is designed to model challenging interactions and create corner cases that current autonomous driving systems struggle to handle. Experimental results demonstrate that the platform can generate realistic driving images efficiently. Through evaluating the end-to-end models such as UniAD and VAD, we demonstrate that based on the adversarial policy, our platform evaluates the performance degradation of the tested model in corner cases. This result indicates that this platform can effectively detect the model’s potential issues, which will facilitate the safety and robustness of end-to-end autonomous driving.
zh
[CV-109] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion
【速读】:该论文旨在解决机器人策略学习中视频扩散模型缺乏动作标注的问题,从而限制了其在具身智能任务中的应用。现有方法要么采用两阶段流水线导致跨模态信息耦合不足,要么通过单一模态扩散模型适配联合分布,无法充分利用预训练视频知识。解决方案的关键在于:(1) 在预训练视频扩散模型基础上引入并行的专用动作扩散模型以保留预训练知识;(2) 设计Bridge Attention机制实现跨模态有效交互;(3) 构建动作精修模块将粗粒度动作转化为低分辨率数据集所需的精确控制指令。该框架显著提升了视频与动作生成质量,并在多个公开基准和真实世界数据集上优于现有基线方法。
链接: https://arxiv.org/abs/2512.16023
作者: Liudi Yang,Yang Bai,George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Ziyuan Liu,Abhinav Valada
机构: University of Freiburg(弗莱堡大学); Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心); Technical University of Munich(慕尼黑工业大学); Huawei Heisenberg Research Center (Munich)(华为海森堡研究中心(慕尼黑))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures
Abstract:We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot’s joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.
zh
[CV-110] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings
【速读】:该论文旨在解决传统草坪维护方式对生态多样性的负面影响问题,即通过被动式“再野化”(rewilding)策略难以有效保护和提升花园生物多样性。其解决方案的关键在于构建一个基于视觉感知与自适应决策的机器人割草框架,利用预训练于PlantNet300K数据集的ResNet50网络提取植物图像的深度特征嵌入(deep feature-space embeddings),并通过全局偏离度量(global deviation metric)无监督地估算植被多样性,从而驱动选择性割草算法,在不破坏高多样性区域的前提下动态切换割草与保育行为。该方法首次将嵌入空间中的视觉多样性作为生态丰富度的代理指标,并在真实花园环境中验证了其有效性,为实现城市绿地从单一草坪向多功能生物栖息地转变提供了技术路径。
链接: https://arxiv.org/abs/2512.15993
作者: Lars Beckers,Arno Waes,Aaron Van Campenhout,Toon Goedemé
机构: KU Leuven (鲁汶大学); EAVISE-PSI research group (EAVISE-PSI 研究组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.
zh
[CV-111] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在农业决策支持场景中的可靠性问题,即这些通用型模型是否具备足够的准确性和稳定性以替代或辅助专业农业诊断系统。研究通过在AgML数据集上对多种开源与闭源VLM进行系统性基准测试,发现零样本VLM性能显著低于专用任务模型(如YOLO11),且其表现高度依赖于提示策略(如多选提示 vs 开放式提示)和评估方法(如是否引入大语言模型(LLM)语义判别)。解决方案的关键在于:采用受限提示接口、显式标签本体(label ontology)以及领域感知的评估策略,可显著提升VLM在农业场景下的实用性,使其从不可靠的独立诊断工具转变为有效的辅助组件。
链接: https://arxiv.org/abs/2512.15977
作者: Earl Ranario,Mason J. Earles
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Draft version
Abstract:Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
zh
[CV-112] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection
【速读】:该论文旨在解决多光谱目标检测在标注数据稀缺场景下的训练难题,尤其关注如何利用有限的标注数据实现鲁棒的感知能力。其关键解决方案是引入视觉-语言模型(Vision-Language Models, VLMs),通过整合文本、可见光与热红外模态的信息,将VLM中预训练获得的语义先验迁移到未见光谱模态中,从而在少样本甚至全监督条件下均实现优异性能。
链接: https://arxiv.org/abs/2512.15971
作者: Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre
机构: Univ Bretagne Sud, IRISA, UMR 6074, Vannes, France; Univ Rennes, IRISA, UMR 6074, Rennes, France; ATERMES, Montigny-le-Bretonneux, France; UiT The Arctic University of Norway, Tromsø, Norway
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.
zh
[CV-113] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models WACV
【速读】:该论文旨在解决移动机器人在多人环境中从第三人称视角准确预测人类行为的难题。现有研究多集中于单人场景下的第一人称视角行为预测,而实际应用中需理解多个人类与场景之间的交互关系。其解决方案的关键在于提出一种基于视觉语言模型(Vision Language Model, VLM)的框架——CAMP-VLM,该框架融合了来自视觉输入的上下文特征和基于场景图的空间感知信息,从而增强对人类-场景交互的预测能力。此外,由于缺乏适合第三人称视角下多人类行为预测的数据集,作者利用逼真模拟器生成合成数据进行监督微调(Supervised Fine-Tuning, SFT)和直接偏好优化(Direct Preference Optimization, DPO),显著提升了模型在真实和合成场景中的泛化性能,相较最优基线预测准确率提升达66.9%。
链接: https://arxiv.org/abs/2512.15957
作者: Utsav Panchal,Yuchen Liu,Luigi Palmieri,Ilche Georgievski,Marco Aiello
机构: Institute of Architecture of Application Systems, University of Stuttgart (斯图加特大学应用系统架构研究所); Bosch Research (博世研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract:Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.
zh
[CV-114] he Perceptual Observatory Characterizing Robustness and Grounding in MLLM s WACV2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉感知能力方面缺乏系统性评估的问题,尤其是现有评价方法过度关注端到端任务准确率,而忽视了模型在受控扰动下的鲁棒性、归因真实性及视觉推理能力。其解决方案的关键在于提出“感知观测站”(The Perceptual Observatory)框架,该框架通过构建结构化垂直任务(如人脸匹配、图文理解、局部到全局视觉定位等),并引入像素级增强与扩散风格幻觉扰动,对模型的视觉接地(visual grounding)能力进行系统性测试,从而揭示模型在扰动下是否保持感知一致性与关系结构,为分析当前及未来MLLMs的优劣提供可解释、可量化的基础。
链接: https://arxiv.org/abs/2512.15949
作者: Tejas Anvekar,Fenil Bardoliya,Pavan K. Turaga,Chitta Baral,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026
Abstract:Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.
zh
[CV-115] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
【速读】:该论文旨在解决当前视觉语言模型(VLMs)在动态环境中缺乏对时空信息进行持续、结构化记忆与推理的能力,从而限制了其在具身任务(如问答和导航)中的表现。解决方案的关键在于提出R4框架,该框架无需训练即可实现4D时空空间中的检索增强推理:通过将对象级语义描述锚定在度量空间和时间中,持续构建一个持久的4D知识数据库,形成可跨智能体共享的世界模型;在推理阶段,自然语言查询被分解为语义、空间和时间键,直接在4D空间中检索相关观测,并将其整合进VLM的推理流程,从而支持情景式和协作式推理。
链接: https://arxiv.org/abs/2512.15940
作者: Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Esslingen University of Applied Sciences (埃斯林根应用科学大学); Dr. Ing. h.c. F. Porsche AG (保时捷股份公司); University of Michigan (密歇根大学); Voxel51 Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM’s reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
zh
[CV-116] SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在性能优异的同时缺乏可解释性和可控性的问题。其核心挑战在于如何从模型内部提取出具有语义意义的特征,并实现对这些特征的精确干预以调控模型行为。解决方案的关键在于提出一种统一的“发现、验证与控制”框架——SALVE(Sparse Autoencoder-Latent Vector Editing),其中利用ℓ₁正则化自编码器无监督地学习稀疏且贴近模型本征结构的特征基底;通过Grad-FAM方法对特征进行可视化定位以验证其与输入数据的语义关联;并基于自编码器结构实施权重空间中的精准、永久性编辑,从而实现对类别判别特征和跨类特征的连续调节。此外,论文还推导出关键抑制阈值α_crit,用于量化模型对主导特征的依赖程度,支持细粒度鲁棒性诊断。
链接: https://arxiv.org/abs/2512.15938
作者: Vegard Flovik
机构: DNV
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified “discover, validate, and control” framework that bridges mechanistic interpretability and model editing. Using an \ell_1 -regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder’s structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, \alpha_crit , quantifying each class’s reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.
zh
[CV-117] City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实世界环境中进行复杂、知识密集型推理与顺序决策能力评估的不足问题。现有基准大多局限于语言中心或仿真环境,难以检验MLLMs在现实城市导航中自主定位、空间推理和路径规划的能力。为填补这一空白,作者提出“稀疏接地视觉导航”(Sparsely Grounded Visual Navigation)任务,并构建了CityNav基准,要求代理仅依赖视觉输入和内部多模态推理,在无额外标注或特殊架构支持下完成50余个决策点的导航。解决方案的关键在于提出“路径表述化”(Verbalization of Path, VoP),通过显式提取MLLM内部的认知地图(关键地标与目标方向),将抽象推理具象化为可操作的空间知识,从而显著提升导航成功率。
链接: https://arxiv.org/abs/2512.15933
作者: Dwip Dalal,Utkarsh Mishra,Narendra Ahuja,Nebojsa Jojic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent’s internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: this https URL
zh
[CV-118] Large Video Planner Enables Generalizable Robot Control
【速读】:该论文旨在解决通用机器人在多样化任务和环境中的决策建模问题,即如何构建具备跨任务与跨场景泛化能力的机器人基础模型。传统方法通常基于多模态大语言模型(Multimodal Large Language Models, MLLMs)扩展出动作输出模块,形成视觉-语言-动作(Vision-Language-Action, VLA)系统,但其依赖静态图像和语言预训练,难以充分捕捉物理世界中状态与动作的时序关联。本文提出一种替代范式:以大规模视频预训练为核心,利用视频天然蕴含的时空序列信息(spatio-temporal sequences)对齐机器人行为模式,从而实现更贴近真实交互的规划能力。关键创新在于首次在基础模型规模上训练了一个面向生成式机器人规划的开放视频模型(open video model),该模型能零样本生成新场景下的视频计划,并通过后处理提取可执行机器人动作;实验表明其具备强指令跟随能力、任务级泛化性能及真实机器人执行可行性。
链接: https://arxiv.org/abs/2512.15840
作者: Boyuan Chen,Tianyuan Zhang,Haoran Geng,Kiwhan Song,Caiyi Zhang,Peihao Li,William T. Freeman,Jitendra Malik,Pieter Abbeel,Russ Tedrake,Vincent Sitzmann,Yilun Du
机构: MIT(麻省理工学院); UC Berkeley(加州大学伯克利分校); Harvard(哈佛大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 16 figures
Abstract:General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs’ large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at this https URL.
zh
[CV-119] Human-like Working Memory from Artificial Intrinsic Plasticity Neurons
【速读】:该论文旨在解决传统人工神经网络在实现类人工作记忆(working memory)时面临的高能耗与噪声敏感性问题。现有方法如循环神经网络(RNN)和长短期记忆网络(LSTM)虽能模拟工作记忆功能,但其计算架构复杂、功耗高且难以适应动态视觉任务。论文提出了一种软硬件协同设计的类脑神经形态架构IPNet,其核心创新在于利用磁隧道结(Magnetic Tunnel Junction, MTJ)的焦耳热动力学特性,直接在硬件层面模拟神经元内在可塑性(neuronal intrinsic plasticity),从而物理实现具有生物合理性的工作记忆挥发性行为。这一机制使系统在无需参数优化的情况下,在n-back、自由回忆及干扰任务中表现出与人类受试者相似的记忆模式,并在DVS手势识别和自动驾驶转向预测等实际任务中显著优于RNN、LSTM及3D-CNN基线模型,同时实现了2874倍于LSTM的能效提升和20倍于标准泄漏积分发放(Leaky Integrate-and-Fire, LIF)神经元的面积压缩。关键突破在于通过器件级物理机制内生地构建工作记忆功能,而非依赖软件算法或额外存储单元,为低功耗、高效率的近传感处理提供了新范式。
链接: https://arxiv.org/abs/2512.15829
作者: Jingli Liu,Huannan Zheng,Bohao Zou,Kezhou Yang
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Working memory enables the brain to integrate transient information for rapid decision-making. Artificial networks typically replicate this via recurrent or parallel architectures, yet incur high energy costs and noise sensitivity. Here we report IPNet, a hardware-software co-designed neuromorphic architecture realizing human-like working memory via neuronal intrinsic plasticity. Exploiting Joule-heating dynamics of Magnetic Tunnel Junctions (MTJs), IPNet physically emulates biological memory volatility. The memory behavior of the proposed architecture shows similar trends in n-back, free recall and memory interference tasks to that of reported human subjects. Implemented exclusively with MTJ neurons, the architecture with human-like working memory achieves 99.65% accuracy on 11-class DVS gesture datasets and maintains 99.48% on a novel 22-class time-reversed benchmark, outperforming RNN, LSTM, and 2+1D CNN baselines sharing identical backbones. For autonomous driving (DDD-20), IPNet reduces steering prediction error by 14.4% compared to ResNet-LSTM. Architecturally, we identify a ‘Memory-at-the-Frontier’ effect where performance is maximized at the sensing interface, validating a bio-plausible near-sensor processing paradigm. Crucially, all results rely on raw parameters from fabricated devices without optimization. Hardware-in-the-loop validation confirms the system’s physical realizability. Separately, energy analysis reveals a reduction in memory power of 2,874x compared to LSTMs and 90,920x versus parallel 3D-CNNs. This capacitor-free design enables a compact ~1.5um2 footprint (28 nm CMOS): a 20-fold reduction over standard LIF neurons. Ultimately, we demonstrate that instantiating human-like working memory via intrinsic neuronal plasticity endows neural networks with the dual biological advantages of superior dynamic vision processing and minimal metabolic cost.
zh
[CV-120] wo-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real
【速读】:该论文旨在解决遮挡人脸检测与识别中的数据稀缺(data scarcity)和分布偏移(distribution shift)问题。其解决方案的关键在于提出了一种两阶段的生成式数据增强框架,结合基于规则的口罩变形(rule-based mask warping)与无配对图像到图像翻译(unpaired image-to-image translation)技术,利用生成对抗网络(GANs)生成真实感更强的带口罩人脸样本,从而超越纯合成变换的局限性。此外,论文引入非掩码保留损失(non-mask preservation loss)和随机噪声注入机制以稳定训练过程并提升样本多样性,显著增强了模型在实际场景下的泛化能力。
链接: https://arxiv.org/abs/2512.15774
作者: Yan Yang,George Bebis,Mircea Nicolescu
机构: University of Nevada, Reno (内华达大学雷诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 9 figures. Conference version
Abstract:Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.
zh
[CV-121] Surely Large Multimodal Models (Dont) Excel in Visual Species Recognition?
【速读】:该论文旨在解决视觉物种识别(Visual Species Recognition, VSR)任务中因标注数据稀缺导致的模型性能瓶颈问题,尤其是在仅提供少量标注样本时如何有效提升识别准确率。传统方法依赖于少样本学习(Few-Shot Learning, FSL)训练专家模型,但其性能受限于数据稀缺与领域知识门槛;而尽管大型多模态模型(Large Multimodal Models, LMMs)在通用识别任务中表现优异,研究发现它们在VSR任务中反而显著逊色于简单的FSL专家模型。论文的关键创新在于揭示了LMM具备对FSL专家模型错误预测进行后验修正(Post-hoc Correction, POC)的能力:通过将专家模型输出的top预测结果、软最大置信度分数及少量视觉示例作为增强提示(enriched prompts)输入LMM,使其能够重新排序候选标签并恢复真实标签。这一洞察催生了无需额外训练或人工干预的POC方法,在五个具有挑战性的VSR基准上实现了比现有FSL方法平均提升6.4%的准确率,且可兼容不同预训练骨干网络和LMM,构成一个即插即用的性能增强模块。
链接: https://arxiv.org/abs/2512.15748
作者: Tian Liu,Anwesha Basu,James Caverlee,Shu Kong
机构: Texas A&M University (德州农工大学); University of Macau (澳门大学); Institute of Collaborative Innovation (协同创新研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: website and code: this https URL
Abstract:Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ‘‘expert’’ model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models’ incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model’s top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.
zh
[CV-122] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations
【速读】:该论文旨在解决胸部X光影像中肺炎病灶区域定位依赖昂贵且耗时的像素级标注问题,从而限制了深度学习模型在临床实践中的应用。其解决方案的关键在于提出一种弱监督深度学习框架,利用图像级标签(image-level labels)结合梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)生成具有临床意义的热力图,实现对肺炎病灶区域的自动定位与可视化解释,无需精细标注即可获得可信赖的诊断辅助结果。
链接: https://arxiv.org/abs/2511.00456
作者: Kiran Shahi,Anup Bagale
机构: MBS Survey Software LTD.(MBS 调查软件有限公司); Frontline Hospital(前线医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Chest X-ray imaging is commonly used to diagnose pneumonia, but accurately localizing the pneumonia-affected regions typically requires detailed pixel-level annotations, which are costly and time consuming to obtain. To address this limitation, this study proposes a weakly supervised deep learning framework for pneumonia classification and localization using Gradient-weighted Class Activation Mapping (Grad-CAM). Instead of relying on costly pixel-level annotations, the proposed method utilizes image-level labels to generate clinically meaningful heatmaps that highlight pneumonia-affected regions. Furthermore, we evaluate seven pre-trained deep learning models, including a Vision Transformer, under identical training conditions, using focal loss and patient-wise splits to prevent data leakage. Experimental results suggest that all models achieved high classification accuracy (96–98%), with ResNet-18 and EfficientNet-B0 showing the best overall performance and MobileNet-V3 providing an efficient lightweight alternative. Grad-CAM heatmap visualizations confirm that the proposed methods focus on clinically relevant lung regions, supporting the use of explainable AI for radiological diagnostics. Overall, this work highlights the potential of weakly supervised, explainable models that enhance transparency and clinical trust in AI-assisted pneumonia screening.
zh
[CV-123] Machine Learning Enabled Graph Analysis of Particulate Composites: Application to Solid-state Battery Cathodes
【速读】:该论文旨在解决如何从高通量多模态X射线显微图像中提取物理洞察并建立局部微结构-性能关系的问题,以指导 particulate composites(颗粒复合材料)的微结构优化。其解决方案的关键在于开发了一种基于机器学习(Machine Learning, ML)的框架,能够将实验获取的多模态X射线图像自动转化为可扩展、拓扑感知的图结构(topology-aware graphs),从而在粒子级和网络级揭示微结构与功能之间的关联。该方法成功验证了三相界(triple phase junctions)及离子/电子共传导通道在固态锂离子电池正极材料中对局部电化学活性的关键作用,为连接多模态实验成像与功能理解提供了新的图表示范式,并推动了面向微结构的数据驱动材料设计。
链接: https://arxiv.org/abs/2512.16085
作者: Zebin Li,Shimao Deng,Yijin Liu,Jia-Mian Hu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Particulate composites underpin many solid-state chemical and electrochemical systems, where microstructural features such as multiphase boundaries and inter-particle connections strongly influence system performance. Advances in X-ray microscopy enable capturing large-scale, multimodal images of these complex microstructures with an unprecedentedly high throughput. However, harnessing these datasets to discover new physical insights and guide microstructure optimization remains a major challenge. Here, we develop a machine learning (ML) enabled framework that enables automated transformation of experimental multimodal X-ray images of multiphase particulate composites into scalable, topology-aware graphs for extracting physical insights and establishing local microstructure-property relationships at both the particle and network level. Using the multiphase particulate cathode of solid-state lithium batteries as an example, our ML-enabled graph analysis corroborates the critical role of triple phase junctions and concurrent ion/electron conduction channels in realizing desirable local electrochemical activity. Our work establishes graph-based microstructure representation as a powerful paradigm for bridging multimodal experimental imaging and functional understanding, and facilitating microstructure-aware data-driven materials design in a broad range of particulate composites.
zh
[CV-124] MCR-VQGAN: A Scalable and Cost-Effective Tau PET Synthesis Approach for Alzheimers Disease Imaging
【速读】:该论文旨在解决tau正电子发射断层扫描(Tau PET)在阿尔茨海默病(Alzheimer’s disease, AD)临床应用中面临的挑战,包括辐射暴露、设备可及性差、临床工作量大及高昂成本等问题。为克服这些限制,作者提出了一种多尺度CBAM残差向量量化生成对抗网络(Multi-scale CBAM Residual Vector Quantized Generative Adversarial Network, MCR-VQGAN),通过结构化T1加权磁共振成像(T1-weighted MRI)合成高保真度的tau PET图像。其解决方案的关键在于对标准VQGAN架构进行三项核心改进:引入多尺度卷积以增强跨尺度特征提取能力、嵌入ResNet块以改善梯度传播和特征复用、以及集成卷积块注意力模块(Convolutional Block Attention Module, CBAM)以提升空间与通道维度的注意力机制。实验表明,MCR-VQGAN在多个图像质量指标上优于cGAN、WGAN-GP、CycleGAN和VQGAN,并且合成图像在AD分类任务中的表现与真实tau PET图像相当,验证了其在保留诊断相关特征方面的有效性。
链接: https://arxiv.org/abs/2512.15947
作者: Jin Young Kim,Jeremy Hudson,Jeongchul Kim,Qing Lyu,Christopher T. Whitlow
机构: Wake Forest University School of Medicine (维克森林大学医学院); Wake Forest School of Medicine (维克森林大学医学院); Yale School of Medicine (耶鲁大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures. A preliminary version of this work was presented at RSNA 2025
Abstract:Tau positron emission tomography (PET) is a critical diagnostic modality for Alzheimer’s disease (AD) because it visualizes and quantifies neurofibrillary tangles, a hallmark of AD pathology. However, its widespread clinical adoption is hindered by significant challenges, such as radiation exposure, limited availability, high clinical workload, and substantial financial costs. To overcome these limitations, we propose Multi-scale CBAM Residual Vector Quantized Generative Adversarial Network (MCR-VQGAN) to synthesize high-fidelity tau PET images from structural T1-weighted MRI scans. MCR-VQGAN improves standard VQGAN by integrating three key architectural enhancements: multi-scale convolutions, ResNet blocks, and Convolutional Block Attention Modules (CBAM). Using 222 paired structural T1-weighted MRI and tau PET scans from Alzheimer’s Disease Neuroimaging Initiative (ADNI), we trained and compared MCR-VQGAN with cGAN, WGAN-GP, CycleGAN, and VQGAN. Our proposed model achieved superior image synthesis performance across all metrics: MSE of 0.0056 +/- 0.0061, PSNR of 24.39 +/- 4.49 dB, and SSIM of 0.9000 +/- 0.0453. To assess the clinical utility of the synthetic images, we trained and evaluated a CNN-based AD classifier. The classifier achieved comparable accuracy when tested on real (63.64%) and synthetic (65.91%) images. This result indicates that our synthesis process successfully preserves diagnostically relevant features without significant information loss. Our results demonstrate that MCR-VQGAN can offer a reliable and scalable surrogate for conventional tau PET imaging, potentially improving the accessibility and scalability of tau imaging biomarkers for AD research and clinical workflows.
zh
[CV-125] In search of truth: Evaluating concordance of AI-based anatomy segmentation models
【速读】:该论文旨在解决在缺乏金标准(ground truth)标注的情况下,对多种基于人工智能(AI)的解剖结构分割模型进行有效评估的难题。其关键解决方案在于提出一个实用的框架,通过将不同模型的分割结果统一转化为一种标准化、可互操作的表示形式(standard, interoperable representation),从而实现结构层面的一致性标签管理和跨模型比较;同时扩展3D Slicer和集成OHIF Viewer以支持交互式摘要图表与浏览器端可视化,显著提升了模型性能审查效率与准确性。
链接: https://arxiv.org/abs/2512.15921
作者: Lena Giebeler,Deepa Krishnaswamy,David Clunie,Jakob Wasserthal,Lalith Kumar Shiyam Sundar,Andres Diaz-Pinto,Klaus H. Maier-Hein,Murong Xu,Bjoern Menze,Steve Pieper,Ron Kikinis,Andrey Fedorov
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose AI-based methods for anatomy segmentation can help automate characterization of large imaging datasets. The growing number of similar in functionality models raises the challenge of evaluating them on datasets that do not contain ground truth annotations. We introduce a practical framework to assist in this task. Approach We harmonize the segmentation results into a standard, interoperable representation, which enables consistent, terminology-based labeling of the structures. We extend 3D Slicer to streamline loading and comparison of these harmonized segmentations, and demonstrate how standard representation simplifies review of the results using interactive summary plots and browser-based visualization using OHIF Viewer. To demonstrate the utility of the approach we apply it to evaluating segmentation of 31 anatomical structures (lungs, vertebrae, ribs, and heart) by six open-source models - TotalSegmentator 1.5 and 2.6, Auto3DSeg, MOOSE, MultiTalent, and CADS - for a sample of Computed Tomography (CT) scans from the publicly available National Lung Screening Trial (NLST) dataset. Results We demonstrate the utility of the framework in enabling automating loading, structure-wise inspection and comparison across models. Preliminary results ascertain practical utility of the approach in allowing quick detection and review of problematic results. The comparison shows excellent agreement segmenting some (e.g., lung) but not all structures (e.g., some models produce invalid vertebrae or rib segmentations). Conclusions The resources developed are linked from this https URL including segmentation harmonization scripts, summary plots, and visualization tools. This work assists in model evaluation in absence of ground truth, ultimately enabling informed model selection.
zh
[CV-126] BioimageAIpub: a toolbox for AI-ready bioimaging data publishing
【速读】:该论文旨在解决生物图像分析领域中数据获取与利用效率低下的问题,即研究人员难以从现有生物成像设施外获取高质量、结构化标注的图像数据集,且已有的公共数据存储库(如Image Data Resource和BioImage Archive)提供的数据通常需大量预处理才能被图像分析工具直接使用,导致研究者耗费大量时间进行数据整理与转换。解决方案的关键在于提出一种名为BioimageAIpub的工作流,该工作流可自动化完成生物图像数据的标准化转换,并无缝上传至HuggingFace平台,从而显著提升数据可用性并促进生成式AI (Generative AI) 等先进分析方法的开发与共享。
链接: https://arxiv.org/abs/2512.15820
作者: Stefan Dvoretskii,Anwai Archit,Constantin Pape,Josh Moore,Marco Nolden
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern bioimage analysis approaches are data hungry, making it necessary for researchers to scavenge data beyond those collected within their (bio)imaging facilities. In addition to scale, bioimaging datasets must be accompanied with suitable, high-quality annotations and metadata. Although established data repositories such as the Image Data Resource (IDR) and BioImage Archive offer rich metadata, their contents typically cannot be directly consumed by image analysis tools without substantial data wrangling. Such a tedious assembly and conversion of (meta)data can account for a dedicated amount of time investment for researchers, hindering the development of more powerful analysis tools. Here, we introduce BioimageAIpub, a workflow that streamlines bioimaging data conversion, enabling a seamless upload to HuggingFace, a widely used platform for sharing machine learning datasets and models.
zh
[CV-127] Foundation Models in Biomedical Imaging: Turning Hype into Reality
【速读】:该论文旨在解决基础模型(Foundation Models, FMs)在生物医学成像领域临床应用中面临的评估与部署难题,特别是其潜在能力与当前现实之间的显著差距。论文指出,尽管FMs具备超越狭义模式识别、模拟复杂临床推理和整合多模态数据的能力,但其实际落地受限于信任度不足、偏见、安全性等问题,且现有验证框架缺乏临床相关性和严谨性。解决方案的关键在于突破统计相关性的局限,转向因果推断(causal inference),构建兼具因果意识(causally aware)、可验证安全(verifiably safe)且融合人类专家知识的混合系统(hybrid systems),从而实现对临床实践的真正辅助而非替代。
链接: https://arxiv.org/abs/2512.15808
作者: Amgad Muneer,Kai Zhang,Ibraheem Hamdi,Rizwan Qureshi,Muhammad Waqas,Shereen Fouad,Hazrat Ali,Syed Muhammad Anwar,Jia Wu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 figures and 3 tables
Abstract:Foundation models (FMs) are driving a prominent shift in artificial intelligence across different domains, including biomedical imaging. These models are designed to move beyond narrow pattern recognition towards emulating sophisticated clinical reasoning, understanding complex spatial relationships, and integrating multimodal data with unprecedented flexibility. However, a critical gap exists between this potential and the current reality, where the clinical evaluation and deployment of FMs are hampered by significant challenges. Herein, we critically assess the current state-of-the-art, analyzing hype by examining the core capabilities and limitations of FMs in the biomedical domain. We also provide a taxonomy of reasoning, ranging from emulated sequential logic and spatial understanding to the integration of explicit symbolic knowledge, to evaluate whether these models exhibit genuine cognition or merely mimic surface-level patterns. We argue that a critical frontier lies beyond statistical correlation, in the pursuit of causal inference, which is essential for building robust models that understand cause and effect. Furthermore, we discuss the paramount issues in deployment stemming from trustworthiness, bias, and safety, dissecting the challenges of algorithmic bias, data bias and privacy, and model hallucinations. We also draw attention to the need for more inclusive, rigorous, and clinically relevant validation frameworks to ensure their safe and ethical application. We conclude that while the vision of autonomous AI-doctors remains distant, the immediate reality is the emergence of powerful technology and assistive tools that would benefit clinical practice. The future of FMs in biomedical imaging hinges not on scale alone, but on developing hybrid, causally aware, and verifiably safe systems that augment, rather than replace, human expertise.
zh
人工智能
[AI-0] Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning
【速读】:该论文旨在解决预训练策略(pretrained policy)对强化学习(Reinforcement Learning, RL)微调性能的影响问题,特别是如何确保预训练策略作为RL微调的初始值时具备有效性。传统方法通常采用行为克隆(Behavioral Cloning, BC)进行预训练,但研究表明,BC可能无法覆盖演示者的所有动作,这会阻碍后续RL微调的效果。论文提出的关键解决方案是:不直接拟合观测到的示范数据,而是通过建模给定演示数据集下演示者行为的后验分布(posterior distribution),得到一种新的预训练策略——后验行为克隆(Posterior Behavioral Cloning, PostBC)。该策略在保证预训练性能不低于BC的基础上,显著提升了动作覆盖率,从而为RL微调提供更有效的初始化,且在机器人控制任务中通过标准监督学习即可实现,并展现出优于BC的微调性能。
链接: https://arxiv.org/abs/2512.16911
作者: Andrew Wagenmaker,Perry Dong,Raymond Tsao,Chelsea Finn,Sergey Levine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) – which trains a policy to directly match the actions played by the demonstrator – can fail to ensure coverage over the demonstrator’s actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator’s behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator’s actions, enabling more effective finetuning. Furthermore, this policy – which we refer to as the posterior behavioral cloning (PostBC) policy – achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains – relying only on standard supervised learning – and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.
zh
[AI-1] he Social Responsibility Stack: A Control-Theoretic Architecture for Governing Socio-Technical AI
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)系统在医疗决策支持、自动驾驶车辆和公共部门等关键领域部署时,缺乏可执行的工程机制以持续嵌入社会价值观的问题。现有负责任AI和治理框架虽提供重要规范原则,但难以在系统全生命周期内实现有效落地。其解决方案的核心是提出社会职责栈(Social Responsibility Stack, SRS),一个六层架构框架,将社会价值转化为显式约束、防护机制、行为接口、审计手段与治理流程,并通过闭环监督控制模型整合设计期防护与运行期监控及制度性监管,从而实现公平性、自主性、认知负担和解释质量等指标的持续监测与强制执行。
链接: https://arxiv.org/abs/2512.16873
作者: Otman A. Basir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence systems are increasingly deployed in domains that shape human behaviour, institutional decision-making, and societal outcomes. Existing responsible AI and governance efforts provide important normative principles but often lack enforceable engineering mechanisms that operate throughout the system lifecycle. This paper introduces the Social Responsibility Stack (SRS), a six-layer architectural framework that embeds societal values into AI systems as explicit constraints, safeguards, behavioural interfaces, auditing mechanisms, and governance processes. SRS models responsibility as a closed-loop supervisory control problem over socio-technical systems, integrating design-time safeguards with runtime monitoring and institutional oversight. We develop a unified constraint-based formulation, introduce safety-envelope and feedback interpretations, and show how fairness, autonomy, cognitive burden, and explanation quality can be continuously monitored and enforced. Case studies in clinical decision support, cooperative autonomous vehicles, and public-sector systems illustrate how SRS translates normative objectives into actionable engineering and operational controls. The framework bridges ethics, control theory, and AI governance, providing a practical foundation for accountable, adaptive, and auditable socio-technical AI systems.
zh
[AI-2] Sequencing to Mitigate Catastrophic Forgetting in Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning)中的灾难性遗忘(Catastrophic Forgetting, CF)问题,即模型在学习新任务时导致对先前任务性能显著下降的现象。其解决方案的关键在于提出一种基于最优任务序列规划的方法,通过智能排序任务顺序来缓解CF。该方法借鉴神经架构搜索(Neural Architecture Search, NAS)中的零样本评分算法,自动确定任务的最优呈现顺序,从而提升模型在多任务场景下的知识保留能力。实验表明,该策略不仅能显著减少遗忘,还能与传统持续学习方法结合,进一步增强模型的性能和鲁棒性。
链接: https://arxiv.org/abs/2512.16871
作者: Hesham G. Moussa,Aroosa Hameed,Arashmid Akhavain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The Manuscript is submitted for review under IEEE Transactions on Artificial intelligence
Abstract:To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, and exploit knowledge throughout its lifetime. This ability, known as Continual learning, provides a foundation for AI systems to develop themselves adaptively. Catastrophic forgetting is a major challenge to the progress of Continual Learning approaches, where learning a new task usually results in a dramatic performance drop on previously learned ones. Many approaches have emerged to counteract the impact of CF. Most of the proposed approaches can be categorized into five classes: replay-based, regularization-based, optimization-based, representation-based, and architecture-based. In this work, we approach the problem from a different angle, specifically by considering the optimal sequencing of tasks as they are presented to the model. We investigate the role of task sequencing in mitigating CF and propose a method for determining the optimal task order. The proposed method leverages zero-shot scoring algorithms inspired by neural architecture search (NAS). Results demonstrate that intelligent task sequencing can substantially reduce CF. Moreover, when combined with traditional continual learning strategies, sequencing offers enhanced performance and robustness against forgetting. Additionally, the presented approaches can find applications in other fields, such as curriculum learning.
zh
[AI-3] Semi-Supervised Online Learning on the Edge by Transforming Knowledge from Teacher Models
【速读】:该论文旨在解决在线边缘机器学习(Online Edge ML)中的核心挑战:如何为真正未来、未见过的数据点确定标签。传统方法依赖静态模型训练与部署,难以应对动态变化的数据分布,而在线边缘学习要求模型能在边缘设备上持续更新。解决方案的关键在于提出知识转化(Knowledge Transformation, KT)机制,该机制融合知识蒸馏(Knowledge Distillation)、主动学习(Active Learning)与因果推理(Causal Reasoning),通过教师模型的知识迁移生成伪标签,从而指导学生模型的持续训练。KT在稳定教师模型条件下可使学生模型逼近最优性能,适用于教师任务具有通用性(如可复用预训练模型)或学生任务标签获取困难的场景。
链接: https://arxiv.org/abs/2512.16866
作者: Jiabin Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Edge machine learning (Edge ML) enables training ML models using the vast data distributed across network edges. However, many existing approaches assume static models trained centrally and then deployed, making them ineffective against unseen data. To address this, Online Edge ML allows models to be trained directly on edge devices and updated continuously with new data. This paper explores a key challenge of Online Edge ML: “How to determine labels for truly future, unseen data points”. We propose Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning. In short, KT acts as the oracle in active learning by transforming knowledge from a teacher model to generate pseudo-labels for training a student model. To verify the validity of the method, we conducted simulation experiments with two setups: (1) using a less stable teacher model and (2) a relatively more stable teacher model. Results indicate that when a stable teacher model is given, the student model can eventually reach its expected maximum performance. KT is potentially beneficial for scenarios that meet the following circumstances: (1) when the teacher’s task is generic, which means existing pre-trained models might be adequate for its task, so there will be no need to train the teacher model from scratch; and/or (2) when the label for the student’s task is difficult or expensive to acquire.
zh
[AI-4] ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning
【速读】:该论文旨在解决长时程机器人操作(long-horizon manipulation)这一长期存在的挑战,即如何让机器人在复杂任务中实现稳定、可靠的多步骤操作。解决方案的关键在于提出一个名为ReinforceGen的系统,其核心是将任务分解为多个局部技能(localized skills),并通过运动规划(motion planning)连接这些技能;同时利用10次人类示范生成的数据进行模仿学习(imitation learning)训练初始策略,并通过基于强化学习(reinforcement learning)的在线适应与微调优化各组件性能,从而显著提升任务成功率和鲁棒性。
链接: https://arxiv.org/abs/2512.16861
作者: Zihan Zhou,Animesh Garg,Ajay Mandlekar,Caelan Garrett
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in this https URL
zh
[AI-5] Distributional AGI Safety
【速读】:该论文试图解决当前AI安全与对齐研究主要聚焦于单个AI系统的防护,而忽视了“拼凑式人工通用智能”(patchwork AGI)这一潜在风险的问题——即多个具备互补能力的子AGI个体代理通过协调合作形成整体通用智能的可能性。解决方案的关键在于提出一种分布式的AGI安全框架,其核心是构建虚拟代理沙箱经济(virtual agentic sandbox economies),这些沙箱具有不可渗透或半渗透特性,通过强健的市场机制规范代理间交易,并结合可审计性、声誉管理和监督机制,以缓解群体层面的协同风险。
链接: https://arxiv.org/abs/2512.16856
作者: Nenad Tomašev,Matija Franklin,Julian Jacobs,Sébastien Krier,Simon Osindero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centers on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.
zh
[AI-6] OGGLE: Temporal Logic-Guided Large Language Model Compression for Edge
链接: https://arxiv.org/abs/2512.16855
作者: Khurram Khalil,Khaza Anuarul Hoque
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Published in the IEEE ICCAD 2025 conference
[AI-7] PrivateXR: Defending Privacy Attacks in Extended Reality Through Explainable AI-Guided Differential Privacy
【速读】:该论文旨在解决生成式 AI 与扩展现实(XR)技术融合(AI XR)系统中因敏感数据(如眼动追踪)易受隐私攻击的问题,特别是成员推理攻击(MIA)和再识别攻击(RDA)带来的高风险泄露问题。传统差分隐私(DP)方法在处理多特征数据集时存在局限性:对所有特征统一加噪会引入冗余噪声、降低模型精度并增加推理延迟,难以满足实时 XR 应用需求。解决方案的关键在于提出一种结合可解释人工智能(XAI)与差分隐私的协同防御框架——通过后验解释识别对模型输出最具影响力的特征,并仅对这些关键特征应用 DP 加噪策略,在显著提升隐私保护强度(MIA 和 RDA 成功率分别降低最多 43% 和 39%)的同时保持模型性能(最高达 97% 准确率),并实现约 2 倍于传统 DP 方法的推理加速,最终在 HTC VIVE Pro 设备上部署了支持用户自定义隐私等级的 PrivateXR 用户界面,验证了方案的实际可行性。
链接: https://arxiv.org/abs/2512.16851
作者: Ripan Kumar Kundu,Istiak Ahmed,Khaza Anuarul Hoque
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published in the IEEE ISMAR 2025 conference
Abstract:The convergence of artificial AI and XR technologies (AI XR) promises innovative applications across many domains. However, the sensitive nature of data (e.g., eye-tracking) used in these systems raises significant privacy concerns, as adversaries can exploit these data and models to infer and leak personal information through membership inference attacks (MIA) and re-identification (RDA) with a high success rate. Researchers have proposed various techniques to mitigate such privacy attacks, including differential privacy (DP). However, AI XR datasets often contain numerous features, and applying DP uniformly can introduce unnecessary noise to less relevant features, degrade model accuracy, and increase inference time, limiting real-time XR deployment. Motivated by this, we propose a novel framework combining explainable AI (XAI) and DP-enabled privacy-preserving mechanisms to defend against privacy attacks. Specifically, we leverage post-hoc explanations to identify the most influential features in AI XR models and selectively apply DP to those features during inference. We evaluate our XAI-guided DP approach on three state-of-the-art AI XR models and three datasets: cybersickness, emotion, and activity classification. Our results show that the proposed method reduces MIA and RDA success rates by up to 43% and 39%, respectively, for cybersickness tasks while preserving model utility with up to 97% accuracy using Transformer models. Furthermore, it improves inference time by up to ~2x compared to traditional DP approaches. To demonstrate practicality, we deploy the XAI-guided DP AI XR models on an HTC VIVE Pro headset and develop a user interface (UI), namely PrivateXR, allowing users to adjust privacy levels (e.g., low, medium, high) while receiving real-time task predictions, protecting user privacy during XR gameplay.
zh
[AI-8] Meta-RL Induces Exploration in Language Agents
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)训练的大语言模型(Large Language Model, LLM)代理在需要主动探索的多轮长时程任务中表现不佳,且难以从试错经验中高效适应的问题。解决方案的关键在于提出一种通用的元强化学习(Meta-Reinforcement Learning, Meta-RL)框架LaMer,其核心包含两个组件:一是跨episode训练机制,以鼓励探索并优化长期奖励;二是基于反思的上下文策略自适应方法,使代理能够在不进行梯度更新的情况下,根据环境反馈信号动态调整策略。该方案显著提升了LLM代理在Sokoban、MineSweeper和Webshop等环境中的性能,并增强了对未见任务的泛化能力。
链接: https://arxiv.org/abs/2512.16848
作者: Yulun Jiang,Liangze Jiang,Damien Teney,Michael Moor,Maria Brbic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
zh
[AI-9] Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决反应式干扰机(reactive jammer)对机器人集群网络(robotic-swarm networks)造成的严重安全威胁,此类干扰通过选择性地破坏节点间通信,导致集群构型失稳和任务失败。传统对策如固定功率控制或静态跳频策略在面对具备马尔可夫阈值动态特性的自适应干扰者时效果有限。解决方案的关键在于提出一种基于QMIX算法的多智能体强化学习(multi-agent reinforcement learning, MARL)框架,其中每个智能体联合决策发射频率(channel)与功率,并利用QMIX学习一个可分解的集中式动作价值函数(centralized but factorizable action-value function),从而实现协同决策与去中心化执行的统一。仿真结果表明,该方法能快速收敛至接近理想基准(genie-aided optimal policy)的协作策略,在吞吐量和抗干扰能力方面显著优于局部上界置信区间(UCB)和无状态反应策略等基线方法,验证了MARL在对抗环境中提升自主集群通信韧性的有效性。
链接: https://arxiv.org/abs/2512.16813
作者: Bahman Abolhassani,Tugba Erpek,Kemal Davaslioglu,Yalin E. Sagduyu,Sastry Kompella
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Reactive jammers pose a severe security threat to robotic-swarm networks by selectively disrupting inter-agent communications and undermining formation integrity and mission success. Conventional countermeasures such as fixed power control or static channel hopping are largely ineffective against such adaptive adversaries. This paper presents a multi-agent reinforcement learning (MARL) framework based on the QMIX algorithm to improve the resilience of swarm communications under reactive jamming. We consider a network of multiple transmitter-receiver pairs sharing channels while a reactive jammer with Markovian threshold dynamics senses aggregate power and reacts accordingly. Each agent jointly selects transmit frequency (channel) and power, and QMIX learns a centralized but factorizable action-value function that enables coordinated yet decentralized execution. We benchmark QMIX against a genie-aided optimal policy in a no-channel-reuse setting, and against local Upper Confidence Bound (UCB) and a stateless reactive policy in a more general fading regime with channel reuse enabled. Simulation results show that QMIX rapidly converges to cooperative policies that nearly match the genie-aided bound, while achieving higher throughput and lower jamming incidence than the baselines, thereby demonstrating MARL’s effectiveness for securing autonomous swarms in contested environments.
zh
[AI-10] Delay-Aware Multi-Stage Edge Server Upgrade with Budget Constraint
【速读】:该论文旨在解决多阶段边缘服务器升级(Multi-stage Edge Server Upgrade, M-ESU)这一网络规划问题,核心是在长期演进的多接入边缘计算(Multi-access Edge Computing, MEC)系统中,如何在预算约束下合理决策是否部署新服务器或升级现有服务器,并优化任务卸载策略以最大化满足延迟要求的任务数量。解决方案的关键在于构建一个融合服务器部署与容量升级的联合优化框架,提出两种求解方法:针对小规模网络的混合整数线性规划(Mixed Integer Linear Programming, MILP)模型用于获得最优解,以及面向大规模网络的高效启发式算法(M-ESU/H),该算法通过平衡部署与升级决策,在保证近优性能(误差小于1.25%)的同时显著提升计算效率;实验表明,M-ESU/H相较仅考虑部署或优先部署/升级的替代方案,在相同预算和需求增长条件下可提升高达21.57%的任务满意度,验证了其在长期MEC系统中的可扩展性与实用性。
链接: https://arxiv.org/abs/2512.16792
作者: Endar Suprih Wihidayat,Sieteng Soh,Kwan-Wu Chin,Duc-son Pham
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures
Abstract:In this paper, the Multi-stage Edge Server Upgrade (M-ESU) is proposed as a new network planning problem, involving the upgrading of an existing multi-access edge computing (MEC) system through multiple stages (e.g., over several years). More precisely, the problem considers two key decisions: (i) whether to deploy additional edge servers or upgrade those already installed, and (ii) how tasks should be offloaded so that the average number of tasks that meet their delay requirement is maximized. The framework specifically involves: (i) deployment of new servers combined with capacity upgrades for existing servers, and (ii) the optimal task offloading to maximize the average number of tasks with a delay requirement. It also considers the following constraints: (i) budget per stage, (ii) server deployment and upgrade cost (in ) and cost depreciation rate, (iii) computation resource of servers, (iv) number of tasks and their growth rate (in %), and (v) the increase in task sizes and stricter delay requirements over time. We present two solutions: a Mixed Integer Linear Programming (MILP) model and an efficient heuristic algorithm (M-ESU/H). MILP yields the optimal solution for small networks, whereas M-ESU/H is used in large-scale networks. For small networks, the simulation results show that the solution computed by M-ESU/H is within 1.25% of the optimal solution while running several orders of magnitude faster. For large networks, M-ESU/H is compared against three alternative heuristic solutions that consider only server deployment, or giving priority to server deployment or upgrade. Our experiments show that M-ESU/H yields up to 21.57% improvement in task satisfaction under identical budget and demand growth conditions, confirming its scalability and practical value for long-term MEC systems.
zh
[AI-11] owards Mass Spectrum Analysis with ASP
【速读】:该论文旨在解决基于质谱分析中元素相对丰度和结构片段信息,自动推断化学样品分子结构的组合优化问题。其核心挑战在于搜索空间呈指数级增长,难以高效求解。解决方案的关键在于引入分子结构的规范表示(canonical representations)并设计基于答案集编程(Answer Set Programming, ASP)的实现方法,利用这些规范定义有效约束搜索空间,从而提升求解效率与准确性。实验表明,该方法在正确性、性能及与商用分析工具的对比中均表现出优越性。
链接: https://arxiv.org/abs/2512.16780
作者: Nils Küchenmeister,Alex Ivliev,Markus Krötzsch
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures. Extended version of a paper accepted at 17th International Conference on Logic Programming and Non-monotonic Reasoning (LPNMR 2024). Under consideration in Theory and Practice of Logic Programming (TPLP)
Abstract:We present a new use of Answer Set Programming (ASP) to discover the molecular structure of chemical samples based on the relative abundance of elements and structural fragments, as measured in mass spectrometry. To constrain the exponential search space for this combinatorial problem, we develop canonical representations of molecular structures and an ASP implemen- tation that uses these definitions. We evaluate the correctness of our implementation over a large set of known molecular structures, and we compare its quality and performance to other ASP symmetry-breaking methods and to a commercial tool from analytical chemistry. Under consideration in Theory and Practice of Logic Programming (TPLP).
zh
[AI-12] CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在动态城市环境中对隐式人类需求(如“我口渴了”)的识别与响应能力不足的问题,尤其聚焦于长时程、多目标的具身导航任务。现有VLMs虽在显式指令导航中表现良好,但在处理复杂城市空间中的隐式意图推理与决策方面仍存在显著瓶颈。解决方案的关键在于提出CitySeeker基准测试框架,涵盖6,440条轨迹和8个城市场景,用于系统评估VLMs的空间推理与决策能力,并进一步引入三项受人类认知地图启发的探索策略:回溯机制(Backtracking Mechanisms)、空间认知增强(Enriching Spatial Cognition)与基于记忆的检索(Memory-Based Retrieval, BCR),以缓解长时推理误差累积、空间认知不足及经验回忆缺失等核心障碍,从而推动具备鲁棒空间智能的VLMs发展,应对“最后一公里”导航挑战。
链接: https://arxiv.org/abs/2512.16755
作者: Siqi Wang,Chao Liang,Yunfan Gao,Erxin Yu,Sen Li,Yushi Li,Jing Li,Haofen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., “I am thirsty”) in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs’ spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping’s emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling “last-mile” navigation challenges.
zh
[AI-13] Plausibility as Failure: How LLM s and Humans Co-Construct Epistemic Error
【速读】:该论文试图解决的问题是:当前对大型语言模型(Large Language Models, LLMs)的错误评估主要依赖预测性指标,而忽视了其在人类推理过程中所引发的解释性影响,即“认知性失效”(epistemic failure)如何在人机交互中被感知、掩盖和容忍。解决方案的关键在于将评估框架从单一的模型性能分析转向一种关系性的解释过程,强调错误并非仅由模型行为决定,而是生成式可信度(generative plausibility)与人类认知捷径(interpretive shortcuts)共同作用的结果。研究通过多轮跨学科任务评估发现,人类评价者常因语言流畅性、结构一致性及表面合理的引用等表层特征,误判错误内容为可信答案,从而导致判断偏差和验证负担加重,这揭示了重构AI评估机制需聚焦于人机协同的认知动态。
链接: https://arxiv.org/abs/2512.16750
作者: Claudia Vale Oliveira,Nelson Zagalo,Filipe Silva,Anabela Brandao,Syeda Faryal Hussain Khurrum,Joaquim Santos(DigiMedia, University of Aveiro, Aveiro, Portugal)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 tables, 77 references, 6 appendices
Abstract:Large language models (LLMs) are increasingly used as epistemic partners in everyday reasoning, yet their errors remain predominantly analyzed through predictive metrics rather than through their interpretive effects on human judgment. This study examines how different forms of epistemic failure emerge, are masked, and are tolerated in human AI interaction, where failure is understood as a relational breakdown shaped by model-generated plausibility and human interpretive judgment. We conducted a three round, multi LLM evaluation using interdisciplinary tasks and progressively differentiated assessment frameworks to observe how evaluators interpret model responses across linguistic, epistemic, and credibility dimensions. Our findings show that LLM errors shift from predictive to hermeneutic forms, where linguistic fluency, structural coherence, and superficially plausible citations conceal deeper distortions of meaning. Evaluators frequently conflated criteria such as correctness, relevance, bias, groundedness, and consistency, indicating that human judgment collapses analytical distinctions into intuitive heuristics shaped by form and fluency. Across rounds, we observed a systematic verification burden and cognitive drift. As tasks became denser, evaluators increasingly relied on surface cues, allowing erroneous yet well formed answers to pass as credible. These results suggest that error is not solely a property of model behavior but a co-constructed outcome of generative plausibility and human interpretive shortcuts. Understanding AI epistemic failure therefore requires reframing evaluation as a relational interpretive process, where the boundary between system failure and human miscalibration becomes porous. The study provides implications for LLM assessment, digital literacy, and the design of trustworthy human AI communication.
zh
[AI-14] AI-Driven Prediction of Cancer Pain Episodes: A Hybrid Decision Support Approach
【速读】:该论文旨在解决肺癌患者常见但难以预测的爆发性疼痛(breakthrough pain)问题,其特征是发作突然且常需及时干预,而现有临床管理多为被动响应。解决方案的关键在于构建一个融合机器学习(Machine Learning, ML)与大语言模型(Large Language Model, LLM)的混合管道:ML模块用于捕捉药物使用的时间趋势,LLM则解析含糊的剂量记录和自由文本临床笔记,从而实现对住院48小时和72小时内疼痛发作的早期预测。通过多模态数据整合,该框架在保持高准确率(48h: 0.874;72h: 0.917)的同时,敏感度提升8.6%和10.4%,显著增强了预测的可解释性与临床实用性,为肿瘤科疼痛管理提供了可扩展、可部署的智能决策支持工具。
链接: https://arxiv.org/abs/2512.16739
作者: Yipeng Zhuang,Yifeng Guo,Yuewen Li,Yuheng Wu,Philip Leung-Ho Yu,Tingting Song,Zhiyong Wang,Kunzhong Zhou,Weifang Wang,Li Zhuang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Lung cancer patients frequently experience breakthrough pain episodes, with up to 91% requiring timely intervention. To enable proactive pain management, we propose a hybrid machine learning and large language model pipeline that predicts pain episodes within 48 and 72 hours of hospitalization using both structured and unstructured electronic health record data. A retrospective cohort of 266 inpatients was analyzed, with features including demographics, tumor stage, vital signs, and WHO-tiered analgesic use. The machine learning module captured temporal medication trends, while the large language model interpreted ambiguous dosing records and free-text clinical notes. Integrating these modalities improved sensitivity and interpretability. Our framework achieved an accuracy of 0.874 (48h) and 0.917 (72h), with an improvement in sensitivity of 8.6% and 10.4% due to the augmentation of large language model. This hybrid approach offers a clinically interpretable and scalable tool for early pain episode forecasting, with potential to enhance treatment precision and optimize resource allocation in oncology care.
zh
[AI-15] Discovering and Learning Probabilistic Models of Black-Box AI Capabilities
【速读】:该论文旨在解决黑箱人工智能(Black-box AI, BBAI)系统在序列决策任务中缺乏可解释性和安全性的问题,特别是如何高效地学习并建模BBAI的规划能力。其解决方案的关键在于采用PDDL风格的符号表示方法,结合蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)框架,系统性地生成测试任务、采集数据并剪枝可能的符号模型假设空间,从而获得对BBAI能力、执行条件、可能结果及其概率的准确描述。理论分析证明了所学模型的合理性、完备性和收敛性,实证结果表明该方法在多种BBAI系统上具有良好的适用性、效率与准确性。
链接: https://arxiv.org/abs/2512.16733
作者: Daniel Bramblett,Rushang Karia,Adrian Ciotinga,Ruthvick Suresh,Pulkit Verma,YooJung Choi,Siddharth Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Black-box AI (BBAI) systems such as foundational models are increasingly being used for sequential decision making. To ensure that such systems are safe to operate and deploy, it is imperative to develop efficient methods that can provide a sound and interpretable representation of the BBAI’s capabilities. This paper shows that PDDL-style representations can be used to efficiently learn and model an input BBAI’s planning capabilities. It uses the Monte-Carlo tree search paradigm to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learned models describe a BBAI’s capabilities, the conditions under which they can be executed, and the possible outcomes of executing them along with their associated probabilities. Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems illustrate the scope, efficiency, and accuracy of the presented methods.
zh
[AI-16] owards Reproducibility in Predictive Process Mining: SPICE - A Deep Learning Library
【速读】:该论文旨在解决预测流程挖掘(Predictive Process Mining, PPM)领域中现有深度学习方法普遍存在的可复现性差、决策过程不透明、新数据集适配困难以及基准测试难以统一的问题,这些问题严重阻碍了不同模型之间的公平比较与持续改进。解决方案的关键在于提出SPICE框架——一个基于PyTorch的开源Python实现,它重实现了三种主流PPM深度学习基线方法,并设计了一个具有严格配置能力的通用基础架构,从而支持对过往及未来建模方法进行可复现、鲁棒且公平的对比评估。
链接: https://arxiv.org/abs/2512.16715
作者: Oliver Stritzel,Nick Hühnerbein,Simon Rauch,Itzel Zarate,Lukas Fleischmann,Moike Buck,Attila Lischka,Christian Frey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, Predictive Process Mining (PPM) techniques based on artificial neural networks have evolved as a method for monitoring the future behavior of unfolding business processes and predicting Key Performance Indicators (KPIs). However, many PPM approaches often lack reproducibility, transparency in decision making, usability for incorporating novel datasets and benchmarking, making comparisons among different implementations very difficult. In this paper, we propose SPICE, a Python framework that reimplements three popular, existing baseline deep-learning-based methods for PPM in PyTorch, while designing a common base framework with rigorous configurability to enable reproducible and robust comparison of past and future modelling approaches. We compare SPICE to original reported metrics and with fair metrics on 11 datasets.
zh
[AI-17] Dual Computational Horizons: Incompleteness and Unpredictability in Intelligent Systems
【速读】:该论文试图解决算法智能在推理与预测能力上的根本性限制问题,即如何刻画智能体在有限计算资源下对自身预测能力的分析边界。其解决方案的关键在于形式化两种独立的计算局限:一是逻辑不完备性(formal incompleteness),它限制了相容推理系统的演绎能力;二是动力学不可预测性(dynamical unpredictability),它在有限精度条件下约束了长期预测的可能性。作者进一步证明,这两种极端情况共同构成了智能体对其自身预测能力进行推理时的结构性限制,特别是表明算法智能体无法普遍计算其自身的最大预测时间范围,从而揭示了推理、预测与自我分析之间内在的权衡关系。
链接: https://arxiv.org/abs/2512.16707
作者: Abhisek Ganguly
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 6 Pages, 0 figures
Abstract:We formalize two independent computational limitations that constrain algorithmic intelligence: formal incompleteness and dynamical unpredictability. The former limits the deductive power of consistent reasoning systems while the later bounds long-term prediction under finite precision. We show that these two extrema together impose structural bounds on an agent’s ability to reason about its own predictive capabilities. In particular, an algorithmic agent cannot compute its own maximal prediction horizon generally. This perspective clarifies inherent trade-offs between reasoning, prediction, and self-analysis in intelligent systems.
zh
[AI-18] Cyber Humanism in Education: Reclaiming Agency through AI and Learning Sciences
【速读】:该论文旨在解决生成式人工智能(Generative AI)在教育领域快速渗透所带来的知识生产与验证方式重构问题,特别是由此引发的认知外包(cognitive offloading)、认识论自动化(epistemic automation)以及教师专业性弱化等风险。其解决方案的关键在于提出“教育中的赛博人本主义”(Cyber Humanism in Education)框架,强调将AI赋能的学习环境视为由人类与机器共同建构的社会技术基础设施,并赋予教育者和学习者作为“认识论代理”(epistemic agents)和“算法公民”(algorithmic citizens)的权利与责任,从而 reclaim human agency。该框架通过三个核心支柱——反思性能力(reflexive competence)、算法公民意识(algorithmic citizenship)与对话式设计(dialogic design)——实现对AI工具使用的伦理化、主体化与协作化引导,已在高等教育场景中通过提示词学习(prompt-based learning)和EPICT生态系统中的“对话式AI教育者认证”机制得以实践验证。
链接: https://arxiv.org/abs/2512.16701
作者: Giovanni Adorni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 16 references, Key Note preented at the “WAILS 2025 - The 2nd. Workshop on Artificial Intelligence with and for Learning Sciences”, Cagliary, Italy, 10-12 December 2025
Abstract:Generative Artificial Intelligence (GenAI) is rapidly reshaping how knowledge is produced and validated in education. Rather than adding another digital tool, large language models reconfigure reading, writing, and coding into hybrid human-AI workflows, raising concerns about epistemic automation, cognitive offloading, and the de-professiona-lisation of teachers. This paper proposes \emphCyber Humanism in Education as a framework for reclaiming human agency in this landscape. We conceptualise AI-enabled learning environments as socio-technical infrastructures co-authored by humans and machines, and position educators and learners as epistemic agents and \emphalgorithmic citizens who have both the right and the responsibility to shape these infrastructures. We articulate three pillars for cyber-humanist design, \emphreflexive competence, \emphalgorithmic citizenship, and \emphdialogic design, and relate them to major international digital and AI competence frameworks. We then present higher-education case studies that operationalise these ideas through \emphprompt-based learning and a new \emphConversational AI Educator certification within the EPICT ecosystem. The findings show how such practices can strengthen epistemic agency while surfacing tensions around workload, equity, and governance, and outline implications for the future of AI-rich, human-centred education. Comments: 15 pages, 16 references, Key Note preented at the “WAILS 2025 - The 2nd. Workshop on Artificial Intelligence with and for Learning Sciences”, Cagliary, Italy, 10-12 December 2025 Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.0; K.3.2; K.4.0 Cite as: arXiv:2512.16701 [cs.AI] (or arXiv:2512.16701v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.16701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-19] Do Multi-Agents Solve Better Than Single? Evaluating Agent ic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning
【速读】:该论文旨在解决多智能体(multi-agent)设计在图示几何问题求解任务中相较于单智能体(single-agent)是否具有优势这一关键问题。其核心发现是:对于开源模型,多智能体架构能显著提升性能,例如Qwen-2.5-VL(7B)在Geometry3K数据集上提升6.8分,而闭源模型如Gemini-2.0-Flash在经典基准上单智能体表现更优,仅在较新的We-Math数据集上多智能体带来小幅改进。这表明多智能体分解策略并非普遍最优,但对开源模型更具增益潜力,并可辅助强闭源系统应对新挑战。
链接: https://arxiv.org/abs/2512.16698
作者: Mahbub E Sobhani,Md. Faiyaz Abdullah Sayeedi,Mohammad Nehad Alam,Proma Hossain Progga,Swakkhar Shatabda
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: Accepted to the ARR October 2025 cycle
Abstract:Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at this https URL
zh
[AI-20] Unsupervised Thematic Clustering Of hadith Texts Using The Apriori Algorithm
【速读】:该论文旨在解决伊斯兰文本数字化背景下圣训(hadith)主题自动分组的问题,以提升数字伊斯兰研究的效率与智能化水平。其解决方案的关键在于采用基于关联规则挖掘的无监督学习方法,具体使用Apriori算法对印尼语版布哈里圣训译本进行处理,通过预处理(如词干提取、停用词去除等)和设定支持度(support)、置信度(confidence)与提升度(lift)参数,自动识别出如“拜功-节段”、“启示-经文”和“圣训-故事”等有意义的语义关联模式,从而揭示文本中潜在的主题结构。
链接: https://arxiv.org/abs/2512.16694
作者: Wisnu Uriawan,Achmad Ajie Priyajie,Angga Gustian,Fikri Nur Hidayat,Sendi Ahmad Rafiudin,Muhamad Fikri Zaelani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This research stems from the urgency to automate the thematic grouping of hadith in line with the growing digitalization of Islamic texts. Based on a literature review, the unsupervised learning approach with the Apriori algorithm has proven effective in identifying association patterns and semantic relations in unlabeled text data. The dataset used is the Indonesian Translation of the hadith of Bukhari, which first goes through preprocessing stages including case folding, punctuation cleaning, tokenization, stopword removal, and stemming. Next, an association rule mining analysis was conducted using the Apriori algorithm with support, confidence, and lift parameters. The results show the existence of meaningful association patterns such as the relationship between rakaat-prayer, verse-revelation, and hadith-story, which describe the themes of worship, revelation, and hadith narration. These findings demonstrate that the Apriori algorithm has the ability to automatically uncover latent semantic relationships, while contributing to the development of digital Islamic studies and technology-based learning systems.
zh
[AI-21] Microsoft Academic Graph Information Retrieval for Research Recommendation and Assistance
【速读】:该论文旨在解决科学文献检索中从海量研究数据中高效筛选相关信息的挑战,尤其是在面对大规模信息数据库时如何提升检索精度与知识推理能力。解决方案的关键在于提出一种基于注意力机制的子图检索器(Attention-Based Subgraph Retriever),其核心是利用图神经网络(Graph Neural Networks, GNNs)结合注意力机制对原始知识图谱进行剪枝,从而提取出与查询高度相关的精炼子图,并将其输入至大型语言模型(Large Language Models, LLMs)以支持更深层次的知识推理。
链接: https://arxiv.org/abs/2512.16661
作者: Jacob Reiss,Shikshya Shiwakoti,Samuel Goldsmith,Ujjwal Pandit
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures
Abstract:In today’s information-driven world, access to scientific publications has become increasingly easy. At the same time, filtering through the massive volume of available research has become more challenging than ever. Graph Neural Networks (GNNs) and graph attention mechanisms have shown strong effectiveness in searching large-scale information databases, particularly when combined with modern large language models. In this paper, we propose an Attention-Based Subgraph Retriever, a GNN-as-retriever model that applies attention-based pruning to extract a refined subgraph, which is then passed to a large language model for advanced knowledge reasoning.
zh
[AI-22] Protecting Deep Neural Network Intellectual Property with Chaos-Based White-Box Watermarking
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在广泛应用中面临的知识产权(Intellectual Property, IP)保护与模型滥用问题,特别是如何在不损害模型预测性能的前提下嵌入并验证模型所有权。解决方案的关键在于提出一种高效且鲁棒的白盒水印框架,通过混沌序列(chaotic sequences)将所有权信息嵌入到DNN内部参数中——具体而言,利用逻辑映射(logistic map)生成对初始参数高度敏感的混沌序列,并将其注入选定的中间层权重,无需修改模型结构即可实现水印嵌入;同时设计基于遗传算法(genetic algorithm)的验证机制,通过优化提取与再生序列间的相似性来恢复原始混沌参数,从而实现水印的准确检测与所有权确认。实验表明,该方法在MNIST和CIFAR-10图像分类任务上具有良好的鲁棒性,即使在微调后仍能保持可检测性且模型精度损失可忽略。
链接: https://arxiv.org/abs/2512.16658
作者: Sangeeth B,Serena Nicolazzo,Deepa K.,Vinod P
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid proliferation of deep neural networks (DNNs) across several domains has led to increasing concerns regarding intellectual property (IP) protection and model misuse. Trained DNNs represent valuable assets, often developed through significant investments. However, the ease with which models can be copied, redistributed, or repurposed highlights the urgent need for effective mechanisms to assert and verify model ownership. In this work, we propose an efficient and resilient white-box watermarking framework that embeds ownership information into the internal parameters of a DNN using chaotic sequences. The watermark is generated using a logistic map, a well-known chaotic function, producing a sequence that is sensitive to its initialization parameters. This sequence is injected into the weights of a chosen intermediate layer without requiring structural modifications to the model or degradation in predictive performance. To validate ownership, we introduce a verification process based on a genetic algorithm that recovers the original chaotic parameters by optimizing the similarity between the extracted and regenerated sequences. The effectiveness of the proposed approach is demonstrated through extensive experiments on image classification tasks using MNIST and CIFAR-10 datasets. The results show that the embedded watermark remains detectable after fine-tuning, with negligible loss in model accuracy. In addition to numerical recovery of the watermark, we perform visual analyses using weight density plots and construct activation-based classifiers to distinguish between original, watermarked, and tampered models. Overall, the proposed method offers a flexible and scalable solution for embedding and verifying model ownership in white-box settings well-suited for real-world scenarios where IP protection is critical.
zh
[AI-23] Comprehensive AI Literacy: The Case for Centering Human Agency
【速读】:该论文试图解决当前教育体系在人工智能(Artificial Intelligence, AI)快速融入社会各领域背景下,未能有效应对AI素养培养的教育紧迫性问题,尤其指出存在一个危险的“AI素养鸿沟”——即过度强调对AI工具的功能性操作技能,而忽视了批判性与伦理思维能力的培养。解决方案的关键在于推动系统性转变,构建以“人类能动性”(human agency)为核心的全面AI素养框架,强调个体应具备基于意图、批判和责任做出选择的能力,无论是学生对AI的使用决策,还是教师对教学设计的主导权,均需回归人的主体地位;这一转变要求教育者和学习者共同成为主动的参与者,通过深化批判性思维和认识论理解,将AI视为可选择而非必然采纳的技术,从而明确决策意图并评估其对学业、职业及社会的影响。
链接: https://arxiv.org/abs/2512.16656
作者: Sri Yash Tadimalla,Justin Cary,Gordon Hull,Jordan Register,Daniel Maxwell,David Pugalee,Tina Heafner
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 2 figures, 2 tables
Abstract:The rapid assimilation of Artificial Intelligence technologies into various facets of society has created a significant educational imperative that current frameworks are failing to effectively address. We are witnessing the rise of a dangerous literacy gap, where a focus on the functional, operational skills of using AI tools is eclipsing the development of critical and ethical reasoning about them. This position paper argues for a systemic shift toward comprehensive AI literacy that centers human agency - the empowered capacity for intentional, critical, and responsible choice. This principle applies to all stakeholders in the educational ecosystem: it is the student’s agency to question, create with, or consciously decide not to use AI based on the task; it is the teacher’s agency to design learning experiences that align with instructional values, rather than ceding pedagogical control to a tool. True literacy involves teaching about agency itself, framing technology not as an inevitability to be adopted, but as a choice to be made. This requires a deep commitment to critical thinking and a robust understanding of epistemology. Through the AI Literacy, Fluency, and Competency frameworks described in this paper, educators and students will become agents in their own human-centric approaches to AI, providing necessary pathways to clearly articulate the intentions informing decisions and attitudes toward AI and the impact of these decisions on academic work, career, and society.
zh
[AI-24] Prefix Probing: Lightweight Harmful Content Detection for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际安全敏感应用场景中面临的三重权衡问题:检测准确率、推理延迟和部署成本之间的矛盾。其解决方案的关键在于提出Prefix Probing方法,该方法通过比较“同意/执行”与“拒绝/安全”两类前缀的条件对数概率,并利用前缀缓存(prefix caching)技术将检测开销降低至接近首个token的延迟水平;同时,该方法仅需单次对探测前缀的对数概率计算即可生成有害性评分并应用阈值,无需调用额外模型或进行多阶段推理,从而实现了高效率与高准确性的统一。
链接: https://arxiv.org/abs/2512.16650
作者: Jirui Yang,Hengqi Guo,Zhihui Lu,Yi Zhao,Yuansen Zhang,Shijing Hu,Qiang Duan,Yinggui Wang,Tao Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large language models often face a three-way trade-off among detection accuracy, inference latency, and deployment cost when used in real-world safety-sensitive applications. This paper introduces Prefix Probing, a black-box harmful content detection method that compares the conditional log-probabilities of “agreement/execution” versus “refusal/safety” opening prefixes and leverages prefix caching to reduce detection overhead to near first-token latency. During inference, the method requires only a single log-probability computation over the probe prefixes to produce a harmfulness score and apply a threshold, without invoking any additional models or multi-stage inference. To further enhance the discriminative power of the prefixes, we design an efficient prefix construction algorithm that automatically discovers highly informative prefixes, substantially improving detection performance. Extensive experiments demonstrate that Prefix Probing achieves detection effectiveness comparable to mainstream external safety models while incurring only minimal computational cost and requiring no extra model deployment, highlighting its strong practicality and efficiency.
zh
[AI-25] Implementing a Sharia Chatbot as a Consultation Medium for Questions About Islam
【速读】:该论文旨在解决传统伊斯兰知识咨询方式在数字化时代中存在获取不便、响应效率低以及内容权威性难以保障的问题,尤其在宗教教育普及与数字传播(digital da’wah)场景下亟需智能化工具支持。解决方案的关键在于构建一个符合伊斯兰教法(Sharia-compliant)的对话式聊天机器人,其核心技术包括:基于Q-Learning的强化学习机制用于优化决策路径,结合Sentence-Transformers实现语义嵌入以提升问答匹配精度,并采用CRISP-DM方法论对25,000条来自《古兰经》、圣训及学者裁决(fatwa)的结构化QA数据进行处理,最终通过Flask API与Flutter前端实现可扩展的交互系统,整体在封闭领域内达到87%的语义准确率,有效提升了伊斯兰知识的可访问性与响应准确性。
链接: https://arxiv.org/abs/2512.16644
作者: Wisnu Uriawan,Aria Octavian Hamza,Ade Ripaldi Nuralim,Adi Purnama,Ahmad Juaeni Yunus,Anissya Auliani Supriadi Putri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This research presents the implementation of a Sharia-compliant chatbot as an interactive medium for consulting Islamic questions, leveraging Reinforcement Learning (Q-Learning) integrated with Sentence-Transformers for semantic embedding to ensure contextual and accurate responses. Utilizing the CRISP-DM methodology, the system processes a curated Islam QA dataset of 25,000 question-answer pairs from authentic sources like the Qur’an, Hadith, and scholarly fatwas, formatted in JSON for flexibility and scalability. The chatbot prototype, developed with a Flask API backend and Flutter-based mobile frontend, achieves 87% semantic accuracy in functional testing across diverse topics including fiqh, aqidah, ibadah, and muamalah, demonstrating its potential to enhance religious literacy, digital da’wah, and access to verified Islamic knowledge in the Industry 4.0 era. While effective for closed-domain queries, limitations such as static learning and dataset dependency highlight opportunities for future enhancements like continuous adaptation and multi-turn conversation support, positioning this innovation as a bridge between traditional Islamic scholarship and modern AI-driven consultation.
zh
[AI-26] Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对齐问题中因偏好优化方法局限性导致的性能瓶颈,尤其是传统强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)和纳什学习从人类反馈(Nash Learning from Human Feedback, NLHF)在处理复杂偏好结构、数据敏感性和对非传递偏好鲁棒性方面的不足。解决方案的关键在于提出一种新的框架——Stackelberg Learning from Human Feedback (SLHF),其核心思想是将对齐问题建模为两个策略之间的序贯博弈:领导者(Leader)先行动并承诺策略,跟随者(Follower)基于领导者的动作进行条件响应。这种序贯设计不仅将偏好优化分解为跟随者对动作的精细化改进与领导者对抗性优化两部分,还天然支持推理时的迭代精炼(inference-time refinement),从而实现更一致、更敏感且更具鲁棒性的对齐效果。
链接: https://arxiv.org/abs/2512.16626
作者: Barna Pásztor,Thomas Kleine Buening,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
备注: 10 pages, 5 tables, 1 figures
Abstract:We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader’s action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader’s actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.
zh
[AI-27] From Personalization to Prejudice: Bias and Discrimination in Memory-Enhanced AI Agents for Recruitment WSDM’26
【速读】:该论文旨在解决记忆增强型个性化大语言模型(Large Language Models, LLMs)在实际应用中可能引入并放大偏见的问题,尤其是在招聘等敏感场景下的行为偏差。其关键解决方案在于通过模拟记忆增强型AI代理的行为,系统性地识别和验证偏见在交互、学习与决策各阶段的引入机制与累积效应,从而揭示现有安全训练后的LLM仍存在因个性化记忆导致的隐性偏见风险,并强调需在记忆增强型LLM-based AI代理中部署额外的防护机制或“代理护栏”(agent guardrails)以降低此类风险。
链接: https://arxiv.org/abs/2512.16532
作者: Himanshu Gharat,Himanshi Agrawal,Gourab K. Patro
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: In Proceedings of the Nineteenth ACM International Conference on Web Search and Data Mining (WSDM '26)
Abstract:Large Language Models (LLMs) have empowered AI agents with advanced capabilities for understanding, reasoning, and interacting across diverse tasks. The addition of memory further enhances them by enabling continuity across interactions, learning from past experiences, and improving the relevance of actions and responses over time; termed as memory-enhanced personalization. Although such personalization through memory offers clear benefits, it also introduces risks of bias. While several previous studies have highlighted bias in ML and LLMs, bias due to memory-enhanced personalized agents is largely unexplored. Using recruitment as an example use case, we simulate the behavior of a memory-enhanced personalized agent, and study whether and how bias is introduced and amplified in and across various stages of operation. Our experiments on agents using safety-trained LLMs reveal that bias is systematically introduced and reinforced through personalization, emphasizing the need for additional protective measures or agent guardrails in memory-enhanced LLM-based AI agents.
zh
[AI-28] Scaling Laws for Energy Efficiency of Local LLM s
【速读】:该论文旨在解决在边缘设备上部署本地大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)时,如何在有限的计算资源与能耗预算下实现性能与准确性的平衡问题。其核心挑战在于当前主流AI部署依赖图形处理器(Graphics Processing Units, GPUs),而大多数消费级硬件(如笔记本电脑、嵌入式系统等)仅依赖中央处理器(Central Processing Units, CPUs),但CPU-only推理的计算规律尚未被系统性研究。解决方案的关键在于:首先,通过统一的连续采样方法结合面积积分技术,量化了LLMs和VLMs在不同输入长度和图像分辨率下的计算负载变化规律,揭示出两个经验缩放定律——语言模型推理成本随token长度近似线性增长,而视觉-语言模型存在由预处理驱动的“分辨率拐点”(resolution knee),即高于某一内部分辨率阈值时计算量恒定,低于该阈值则显著下降;其次,提出并验证了基于量子启发的压缩策略,可在不牺牲甚至提升语义准确性的同时,将CPU和内存使用降低高达71.9%,能耗降低62%,从而为可持续边缘推理提供了高效、低成本的优化路径。
链接: https://arxiv.org/abs/2512.16531
作者: Ander Alvarez,Alessandro Genuardi,Nilotpal Sinha,Antonio Tiene,Samuel Mugel,Román Orús
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware–including laptops, desktops, industrial controllers, and embedded systems–relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven “resolution knee”, where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
zh
[AI-29] ParamExplorer: A framework for exploring parameters in generative art
【速读】:该论文旨在解决生成式艺术系统中因高维复杂参数空间导致的探索困难问题,即美学上令人满意的输出通常仅存在于小而碎片化的区域,使得艺术家依赖大量手动试错,难以发现潜在有趣的配置。解决方案的关键在于提出ParamExplorer框架,这是一个受强化学习启发的交互式模块化工具,能够通过人机协同或自动化反馈引导参数空间的高效探索,并可无缝集成至现有项目中;同时在该框架内实现并评估了多种称为“代理(agents)”的探索策略,从而提升探索效率与创造性发现的可能性。
链接: https://arxiv.org/abs/2512.16529
作者: Julien Gachadoat,Guillaume Lagarde
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 16 pages, 3 figures
Abstract:Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regions. Because of this combinatorial explosion, artists typically rely on extensive manual trial-and-error, leaving many potentially interesting configurations undiscovered. In this work we make two contributions. First, we introduce ParamExplorer, an interactive and modular framework inspired by reinforcement learning that helps the exploration of parameter spaces in generative art algorithms, guided by human-in-the-loop or even automated feedback. The framework also integrates seamlessly with existing this http URL projects. Second, within this framework we implement and evaluate several exploration strategies, referred to as agents.
zh
[AI-30] XTC A Research Platform for Optimizing AI Workload Operators
【速读】:该论文旨在解决当前AI算子调度语言与特定编译器生态绑定的问题,导致跨框架的公平比较、代码复用和性能评估难以实现。其解决方案的关键在于提出XTC平台,通过统一的API和可复现的测量框架,将调度规范与代码生成及性能测量解耦,从而支持跨编译器的可移植实验,加速优化策略的研究与验证。
链接: https://arxiv.org/abs/2512.16512
作者: Pompougnac Hugo,Guillon Christophe,Noiry Sylvain,Dutilleul Alban,Iooss Guillaume,Rastello Fabrice
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving high efficiency on AI operators demands precise control over computation and data movement. However, existing scheduling languages are locked into specific compiler ecosystems, preventing fair comparison, reuse, and evaluation across frameworks. No unified interface currently decouples scheduling specification from code generation and measurement. We introduce XTC, a platform that unifies scheduling and performance evaluation across compilers. With its common API and reproducible measurement framework, XTC enables portable experimentation and accelerates research on optimization strategies.
zh
[AI-31] Best Practices For Empirical Meta-Algorithmic Research Guidelines from the COSEAL Research Network
【速读】:该论文旨在解决元算法学(meta-algorithmics)领域中实证研究因实验设计不规范、误差来源多样而导致的可扩展性与有效性不足的问题。其解决方案的关键在于系统性地整合来自COSEAL社区各子领域的良好实践,覆盖从研究问题定义、实验设计、执行到结果分析与呈现的完整实验周期,从而建立当前元算法学实证研究的最佳实践标准,并为新进研究人员和从业者提供统一、可遵循的指南。
链接: https://arxiv.org/abs/2512.16491
作者: Theresa Eimer,Lennart Schäpermeier,André Biedenkapp,Alexander Tornede,Lars Kotthoff,Pieter Leyman,Matthias Feurer,Katharina Eggensperger,Kaitlin Maile,Tanja Tornede,Anna Kozak,Ke Xue,Marcel Wever,Mitra Baratchi,Damir Pulatov,Heike Trautmann,Haniye Kashgarani,Marius Lindauer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Empirical research on meta-algorithmics, such as algorithm selection, configuration, and scheduling, often relies on extensive and thus computationally expensive experiments. With the large degree of freedom we have over our experimental setup and design comes a plethora of possible error sources that threaten the scalability and validity of our scientific insights. Best practices for meta-algorithmic research exist, but they are scattered between different publications and fields, and continue to evolve separately from each other. In this report, we collect good practices for empirical meta-algorithmic research across the subfields of the COSEAL community, encompassing the entire experimental cycle: from formulating research questions and selecting an experimental design, to executing ex- periments, and ultimately, analyzing and presenting results impartially. It establishes the current state-of-the-art practices within meta-algorithmic research and serves as a guideline to both new researchers and practitioners in meta-algorithmic fields.
zh
[AI-32] Quantifying and Bridging the Fidelity Gap: A Decisive-Feature Approach to Comparing Synthetic and Real Imagery
【速读】:该论文旨在解决当前自动驾驶系统(AV)安全验证中虚拟测试的局限性问题,即单纯依赖像素级视觉真实感(pixel-level fidelity)无法保证仿真环境与现实世界之间的可靠迁移。其核心挑战在于:即使图像看起来逼真,若被测系统(SUT)在两种环境中基于不同的因果证据作出决策,则仍可能导致安全隐患。解决方案的关键是提出一种行为基础的保真度度量方法——决定性特征保真度(Decisive Feature Fidelity, DFF),该方法通过可解释人工智能(XAI)技术识别并比较真实与合成数据对中驱动SUT输出的决定性特征,从而衡量两者在因果机制层面的一致性(mechanism parity)。DFF不仅揭示了传统输出值保真度忽略的差异,还支持基于反事实解释的实用估计器及校准方案,有效提升模拟器的因果保真度而不损害原有输出准确性。
链接: https://arxiv.org/abs/2512.16468
作者: Danial Safaei,Siddartha Khastgir,Mohsen Alirezaei,Jeroen Ploeg,Son Tong,Xingyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on the same causal evidence in both real and simulated environments - not just whether images “look real” to humans. This paper addresses the lack of such a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity - the agreement in causal evidence underlying the SUT’s decisions across domains. DFF leverages explainable-AI (XAI) methods to identify and compare the decisive features driving the SUT’s outputs for matched real-synthetic pairs. We further propose practical estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.
zh
[AI-33] cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution
【速读】:该论文旨在解决CUDA内核优化中因硬件-软件协同设计专业知识要求高及高性能内核库的专有性而导致的自动化优化困难问题。现有基于大语言模型(Large Language Models, LLMs)与进化算法结合的方法在性能上表现不佳,主要受限于代理(agent)设计不合理和进化表示不匹配。其解决方案的关键在于提出cuPilot——一个策略协调的多智能体框架,引入“策略”作为内核进化的中间语义表示,并结合策略级种群初始化、基于roofline模型的提示工程以及策略协调的进化算法,从而显著提升优化效率与生成内核的性能表现。
链接: https://arxiv.org/abs/2512.16465
作者: Jinwu Chen,Qidie Wu,Bin Li,Lin Ma,Xin Si,Yang Hu,Shouyi Yin,Jun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09 \times over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at this https URL.
zh
[AI-34] AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research
【速读】:该论文旨在解决科学工作中人工智能(Artificial Intelligence, AI)计算资源分散、难以复现和集成复杂的问题。其解决方案的关键在于构建一个联邦式计算平台,通过可复现的部署机制,为物理分布的电子基础设施(e-Infrastructures)提供一致且透明的访问能力;同时,平台整合了涵盖机器学习全生命周期的服务目录,包括交互式开发环境、GPU加速训练、实验追踪、联邦学习支持及云连续体(Cloud Continuum)上的多样化部署选项,并集成多模型提供商、数据集与存储资源,从而实现对AI模型的可追溯性与可复现性管理,显著降低外部用户社区的采用门槛。
链接: https://arxiv.org/abs/2512.16455
作者: Ignacio Heredia,Álvaro López García,Germán Moltó,Amanda Calatrava,Valentin Kozlov,Alessandro Costantini,Viet Tran,Mario David,Daniel San Martín,Marcin Płóciennik,Marta Obregón Ruiz,Saúl Fernandez,Judith Sáinz-Pardo Díaz,Miguel Caballer,Caterina Alarcón Marín,Stefan Dlugolinsky,Martin Šeleng,Lisana Berberi,Khadijeh Alibabaei,Borja Esteban Sanchis,Pedro Castro,Giacinto Donvito,Diego Aguirre,Sergio Langarita,Vicente Rodriguez,Leonhard Duda,Andrés Heredia Canales,Susana Rebolledo Ruiz,João Machado,Giang Nguyen,Fernando Aguilar Gómez,Jaime Díez
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we describe a federated compute platform dedicated to support Artificial Intelligence in scientific workloads. Putting the effort into reproducible deployments, it delivers consistent, transparent access to a federation of physically distributed e-Infrastructures. Through a comprehensive service catalogue, the platform is able to offer an integrated user experience covering the full Machine Learning lifecycle, including model development (with dedicated interactive development environments), training (with GPU resources, annotation tools, experiment tracking, and federated learning support) and deployment (covering a wide range of deployment options all along the Cloud Continuum). The platform also provides tools for traceability and reproducibility of AI models, integrates with different Artificial Intelligence model providers, datasets and storage resources, allowing users to interact with the broader Machine Learning ecosystem. Finally, it is easily customizable to lower the adoption barrier by external communities.
zh
[AI-35] meSeries2Report prompting enables adaptive large language model management of lithium-ion batteries
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际电池能量存储系统(Battery Energy Storage System, BESS)运维中应用受限的问题,即如何有效利用LLMs对多变量时间序列数据进行语义理解与决策支持。其核心解决方案是提出TimeSeries2Report(TS2R)提示框架,通过分割、语义抽象和规则化解释相结合的方式,将原始锂离子电池运行时序数据编码为结构化且语义丰富的自然语言报告,从而实现低层次传感器信号与高层次上下文洞察之间的有效衔接。该方法无需重新训练或修改模型架构即可显著提升LLM在异常检测、荷电状态预测及充放电管理等下游任务中的准确性、鲁棒性和可解释性,展现出专家级决策能力与预测一致性,为基于LLM的自适应电池智能提供了实用路径。
链接: https://arxiv.org/abs/2512.16453
作者: Jiayang Yang,Chunhui Zhao,Martin Guay,Zhixing Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a prompting framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.
zh
[AI-36] IoMT-based Automated Leukemia Classification using CNN and Higher Order Singular Value
【速读】:该论文旨在解决急性淋巴细胞白血病(Acute Lymphocytic Leukemia, ALL)早期诊断中传统人工显微镜检查效率低、易受人为误差影响的问题。为提升诊断速度与准确性,研究提出了一种基于物联网医疗(Internet of Medical Things, IoMT)架构的智能分类框架,其关键在于将卷积神经网络(Convolutional Neural Network, CNN)与高阶奇异值分解(Higher Order Singular Value Decomposition, HOSVD)相结合,用于从血液涂片显微图像中自动识别ALL细胞与正常细胞。该方法在ALL-IDB2数据集上实现了98.88%的平均测试准确率,显著提升了白血病早期检测的自动化水平和实时性。
链接: https://arxiv.org/abs/2512.16448
作者: Shabnam Bagheri Marzijarani,Mohammad Zolfaghari,Hedieh Sajedi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Internet of Things (IoT) is a concept by which objects find identity and can communicate with each other in a network. One of the applications of the IoT is in the field of medicine, which is called the Internet of Medical Things (IoMT). Acute Lymphocytic Leukemia (ALL) is a type of cancer categorized as a hematic disease. It usually begins in the bone marrow due to the overproduction of immature White Blood Cells (WBCs or leukocytes). Since it has a high rate of spread to other body organs, it is a fatal disease if not diagnosed and treated early. Therefore, for identifying cancerous (ALL) cells in medical diagnostic laboratories, blood, as well as bone marrow smears, are taken by pathologists. However, manual examinations face limitations due to human error risk and time-consuming procedures. So, to tackle the mentioned issues, methods based on Artificial Intelligence (AI), capable of identifying cancer from non-cancer tissue, seem vital. Deep Neural Networks (DNNs) are the most efficient machine learning (ML) methods. These techniques employ multiple layers to extract higher-level features from the raw input. In this paper, a Convolutional Neural Network (CNN) is applied along with a new type of classifier, Higher Order Singular Value Decomposition (HOSVD), to categorize ALL and normal (healthy) cells from microscopic blood images. We employed the model on IoMT structure to identify leukemia quickly and safely. With the help of this new leukemia classification framework, patients and clinicians can have real-time communication. The model was implemented on the Acute Lymphoblastic Leukemia Image Database (ALL-IDB2) and achieved an average accuracy of %98.88 in the test step.
zh
[AI-37] owards AI-Supported Research: a Vision of the TIB AIssistant
【速读】:该论文旨在解决当前科研工作中有效整合生成式 AI(Generative AI)所面临的多重挑战,包括领域差异性带来的适配难题、研究人员对 AI 技能的掌握程度有限、多工具与智能体协同复杂性高,以及生成式 AI 在科研场景中准确性尚不明确等问题。解决方案的关键在于提出并实现 TIB AIssistant 平台,这是一个跨学科的人机协作平台,通过模块化设计提供提示词库、工具库、共享数据存储和灵活的编排框架,从而支持从研究构思到文献分析、方法论开发、数据分析到学术写作的全流程任务自动化与增强,显著提升科研效率与可扩展性。
链接: https://arxiv.org/abs/2512.16447
作者: Sören Auer,Allard Oelen,Mohamad Yaser Jaradeh,Mutahira Khalid,Farhana Keya,Sasi Kiran Gaddipati,Jennifer D’Souza,Lorenz Schlüter,Amirreza Alasti,Gollam Rabby,Azanzi Jiomekong,Oliver Karras
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancements in Generative AI and Large Language Models promise to transform the way research is conducted, potentially offering unprecedented opportunities to augment scholarly workflows. However, effectively integrating AI into research remains a challenge due to varying domain requirements, limited AI literacy, the complexity of coordinating tools and agents, and the unclear accuracy of Generative AI in research. We present the vision of the TIB AIssistant, a domain-agnostic human-machine collaborative platform designed to support researchers across disciplines in scientific discovery, with AI assistants supporting tasks across the research life cycle. The platform offers modular components - including prompt and tool libraries, a shared data store, and a flexible orchestration framework - that collectively facilitate ideation, literature analysis, methodology development, data analysis, and scholarly writing. We describe the conceptual framework, system architecture, and implementation of an early prototype that demonstrates the feasibility and potential impact of our approach.
zh
[AI-38] E-SDS: Environment-aware See it Do it Sorted - Automated Environment-Aware Reinforcement Learning for Humanoid Locomotion
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在人形机器人行走控制中因缺乏环境感知能力而导致的“盲视”问题,即无法有效应对复杂地形带来的挑战。解决方案的关键在于提出E-SDS(Environment-aware See it, Do it, Sorted)框架,该框架将VLM与实时地形传感器分析相结合,基于示例视频自动构建奖励函数,从而引导训练出具备环境感知能力的稳健行走策略。此方法显著提升了机器人在复杂地形(如楼梯)中的适应性,并大幅降低人工设计奖励函数所需时间,同时提升运动性能。
链接: https://arxiv.org/abs/2512.16446
作者: Enis Yalcin,Joshua O’Hara,Maria Stamatopoulou,Chengxu Zhou,Dimitrios Kanoulas
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 4 tables. Accepted at RiTA 2025 (Springer LNNS)
Abstract:Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially “blind”, as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.
zh
[AI-39] StarCraft: Benchmarking Multi-agent Algorithms in Adversary Paradigm
【速读】:该论文旨在解决当前多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)算法评估中因使用固定内置AI模式的虚拟对手而导致多样性与泛化能力不足的问题。其核心解决方案是构建一个名为StarCraft II battle arena(SC2BA)的多智能体算法对抗环境,该环境支持算法间直接对抗,强调公平性、易用性和可定制性,并配套开发了易于使用的APyMARL库。通过在SC2BA上开展双算法配对对抗和多算法混合对抗两种模式的基准测试,揭示了现有MARL算法在有效性、敏感性和可扩展性方面的关键问题,从而推动MARL研究向更真实、更具挑战性的对抗场景演进。
链接: https://arxiv.org/abs/2512.16444
作者: Yadong Li,Tong Zhang,Bo Huang,Zhen Cui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figures
Abstract:Deep multi-agent reinforcement learning (MARL) algorithms are booming in the field of collaborative intelligence, and StarCraft multi-agent challenge (SMAC) is widely-used as the benchmark therein. However, imaginary opponents of MARL algorithms are practically configured and controlled in a fixed built-in AI mode, which causes less diversity and versatility in algorithm evaluation. To address this issue, in this work, we establish a multi-agent algorithm-vs-algorithm environment, named StarCraft II battle arena (SC2BA), to refresh the benchmarking of MARL algorithms in an adversary paradigm. Taking StarCraft as infrastructure, the SC2BA environment is specifically created for inter-algorithm adversary with the consideration of fairness, usability and customizability, and meantime an adversarial PyMARL (APyMARL) library is developed with easy-to-use interfaces/modules. Grounding in SC2BA, we benchmark those classic MARL algorithms in two types of adversarial modes: dual-algorithm paired adversary and multi-algorithm mixed adversary, where the former conducts the adversary of pairwise algorithms while the latter focuses on the adversary to multiple behaviors from a group of algorithms. The extensive benchmark experiments exhibit some thought-provoking observations/problems in the effectivity, sensibility and scalability of these completed algorithms. The SC2BA environment as well as reproduced experiments are released in \hrefthis https URLGithub, and we believe that this work could mark a new step for the MARL field in the coming years.
zh
[AI-40] IB AIssistant: a Platform for AI-Supported Research Across Research Life Cycles
【速读】:该论文旨在解决当前学术研究过程中因任务繁杂、工具分散而导致效率低下和可复现性不足的问题,特别是在人工智能(AI)日益融入科研流程的背景下,如何系统化地支持研究人员完成从文献调研到论文撰写等全流程任务。解决方案的关键在于构建一个名为TIB AIssistant的AI辅助研究平台,该平台由多个专注于特定研究任务的智能助手组成,并集成外部学术服务工具,通过生成式AI(Generative AI)协作生成研究内容,同时将产出数据以RO-Crate格式存储并导出,从而保障研究过程的透明性和可复现性,为未来社区共建的AI支持型研究生态奠定基础。
链接: https://arxiv.org/abs/2512.16442
作者: Allard Oelen,Sören Auer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapidly growing popularity of adopting Artificial Intelligence (AI), and specifically Large Language Models (LLMs), is having a widespread impact throughout society, including the academic domain. AI-supported research has the potential to support researchers with tasks across the entire research life cycle. In this work, we demonstrate the TIB AIssistant, an AI-supported research platform providing support throughout the research life cycle. The AIssistant consists of a collection of assistants, each responsible for a specific research task. In addition, tools are provided to give access to external scholarly services. Generated data is stored in the assets and can be exported as an RO-Crate bundle to provide transparency and enhance reproducibility of the research project. We demonstrate the AIssistant’s main functionalities by means of a sequential walk-through of assistants, interacting with each other to generate sections for a draft research paper. In the end, with the AIssistant, we lay the foundation for a larger agenda of providing a community-maintained platform for AI-supported research.
zh
[AI-41] Emergent Bias and Fairness in Multi-Agent Decision Systems
【速读】:该论文旨在解决多智能体预测系统(multi-agent predictive systems)在金融领域部署时因缺乏有效公平性评估方法而导致的偏见风险问题,尤其是在消费金融等高风险场景中,偏见决策可能直接引发监管违规和财务损失。解决方案的关键在于提出一套针对金融表格数据域的公平性评估方法,通过大规模模拟不同通信与协作机制下的多智能体配置,揭示出无法归因于单个智能体组件的涌现式偏见模式,从而证明多智能体系统表现出真正的集体行为特征;研究主张应将多智能体决策系统视为整体进行评估,而非仅对组成部分进行还原论分析,以更全面地识别和管理模型风险中的公平性隐患。
链接: https://arxiv.org/abs/2512.16433
作者: Maeve Madigan,Parameswaran Kamalaruban,Glenn Moynihan,Tom Kempton,David Sutton,Stuart Burrell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems have demonstrated the ability to improve performance on a variety of predictive tasks by leveraging collaborative decision making. However, the lack of effective evaluation methodologies has made it difficult to estimate the risk of bias, making deployment of such systems unsafe in high stakes domains such as consumer finance, where biased decisions can translate directly into regulatory breaches and financial loss. To address this challenge, we need to develop fairness evaluation methodologies for multi-agent predictive systems and measure the fairness characteristics of these systems in the financial tabular domain. Examining fairness metrics using large-scale simulations across diverse multi-agent configurations, with varying communication and collaboration mechanisms, we reveal patterns of emergent bias in financial decision-making that cannot be traced to individual agent components, indicating that multi-agent systems may exhibit genuinely collective behaviors. Our findings highlight that fairness risks in financial multi-agent systems represent a significant component of model risk, with tangible impacts on tasks such as credit scoring and income estimation. We advocate that multi-agent decision systems must be evaluated as holistic entities rather than through reductionist analyses of their constituent components.
zh
[AI-42] Introducing ORKG ASK: an AI-driven Scholarly Literature Search and Exploration System Taking a Neuro-Symbolic Approach
【速读】:该论文旨在解决学术文献数量持续增长背景下,研究人员在查找和探索相关文献时面临的困难问题。其解决方案的关键在于提出并实现了一个名为ASK(Assistant for Scientific Knowledge)的AI驱动文献搜索与探索系统,该系统采用神经符号(neuro-symbolic)方法,融合向量搜索、大语言模型(Large Language Models, LLMs)与知识图谱技术,通过检索增强生成(Retrieval-Augmented Generation, RAG)机制自动提取关键信息并生成对自然语言研究问题的回答,从而为科研人员提供主动支持。
链接: https://arxiv.org/abs/2512.16425
作者: Allard Oelen,Mohamad Yaser Jaradeh,Sören Auer
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:As the volume of published scholarly literature continues to grow, finding relevant literature becomes increasingly difficult. With the rise of generative Artificial Intelligence (AI), and particularly Large Language Models (LLMs), new possibilities emerge to find and explore literature. We introduce ASK (Assistant for Scientific Knowledge), an AI-driven scholarly literature search and exploration system that follows a neuro-symbolic approach. ASK aims to provide active support to researchers in finding relevant scholarly literature by leveraging vector search, LLMs, and knowledge graphs. The system allows users to input research questions in natural language and retrieve relevant articles. ASK automatically extracts key information and generates answers to research questions using a Retrieval-Augmented Generation (RAG) approach. We present an evaluation of ASK, assessing the system’s usability and usefulness. Findings indicate that the system is user-friendly and users are generally satisfied while using the system.
zh
[AI-43] Synthelite: Chemist-aligned and feasibility-aware synthesis planning with LLM s
【速读】:该论文旨在解决现有计算机辅助合成规划(Computer-aided synthesis planning, CASP)框架缺乏与人类专家交互机制的问题,从而限制了化学家经验与直觉的融入。其解决方案的关键在于提出 Synthelite 框架,该框架利用大语言模型(Large Language Models, LLMs)直接生成逆合成转化,并通过自然语言提示实现专家干预,使合成路径能够灵活适应用户指定的策略或起始原料约束,同时在路线设计中考虑化学可行性,最终在多种约束条件下实现高达 95% 的成功率。
链接: https://arxiv.org/abs/2512.16424
作者: Nguyen Xuan-Vu,Daniel Armstrong,Milena Wehrbach,Andres M Bran,Zlatko Jončev,Philippe Schwaller
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-aided synthesis planning (CASP) has long been envisioned as a complementary tool for synthetic chemists. However, existing frameworks often lack mechanisms to allow interaction with human experts, limiting their ability to integrate chemists’ insights. In this work, we introduce Synthelite, a synthesis planning framework that uses large language models (LLMs) to directly propose retrosynthetic transformations. Synthelite can generate end-to-end synthesis routes by harnessing the intrinsic chemical knowledge and reasoning capabilities of LLMs, while allowing expert intervention through natural language prompts. Our experiments demonstrate that Synthelite can flexibly adapt its planning trajectory to diverse user-specified constraints, achieving up to 95% success rates in both strategy-constrained and starting-material-constrained synthesis tasks. Additionally, Synthelite exhibits the ability to account for chemical feasibility during route design. We envision Synthelite to be both a useful tool and a step toward a paradigm where LLMs are the central orchestrators of synthesis planning.
zh
[AI-44] Hypernetworks That Evolve Themselves
【速读】:该论文旨在解决神经网络在缺乏外部优化器的情况下如何实现自我演化的问题。传统方法依赖于外部优化算法(如梯度下降)来调整参数,而本文提出了一种自指图超网络(Self-Referential Graph HyperNetworks, Self-Referential GHNs),其核心创新在于将变异与遗传机制直接嵌入网络结构中,通过超网络(hypernetworks)、随机参数生成和基于图的表示方式,使系统能够自主地进行变异、评估,并将变异率作为可选择的性状进行适应性调整。这一设计使得模型在环境变化的强化学习任务(如CartPoleSwitch、LunarLander-Switch)中展现出快速且可靠的适应能力,并在Ant-v5行走基准中演化出协调步态,表明其具备通过自主降低群体变异来聚焦于优质解的精细调优潜力。该方案的关键在于利用神经网络的自指特性,使“可进化性”本身成为系统内部可演化的属性,从而推动合成系统向更贴近生物进化的方向发展。
链接: https://arxiv.org/abs/2512.16406
作者: Joachim Winther Pedersen,Erwan Plantec,Eleni Nisioti,Marcello Barylli,Milton Montero,Kathrin Korte,Sebastian Risi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:How can neural networks evolve themselves without relying on external optimizers? We propose Self-Referential Graph HyperNetworks, systems where the very machinery of variation and inheritance is embedded within the network. By uniting hypernetworks, stochastic parameter generation, and graph-based representations, Self-Referential GHNs mutate and evaluate themselves while adapting mutation rates as selectable traits. Through new reinforcement learning benchmarks with environmental shifts (CartPoleSwitch, LunarLander-Switch), Self-Referential GHNs show swift, reliable adaptation and emergent population dynamics. In the locomotion benchmark Ant-v5, they evolve coherent gaits, showing promising fine-tuning capabilities by autonomously decreasing variation in the population to concentrate around promising solutions. Our findings support the idea that evolvability itself can emerge from neural self-reference. Self-Referential GHNs reflect a step toward synthetic systems that more closely mirror biological evolution, offering tools for autonomous, open-ended learning agents.
zh
[AI-45] PCIA: A Path Construction Imitation Algorithm for Global Optimization
【速读】:该论文旨在解决复杂优化问题中的全局搜索效率与收敛性能不足的问题,尤其针对传统元启发式算法在多峰函数优化中易陷入局部最优、收敛速度慢等挑战。其解决方案的关键在于提出一种受人类路径构建行为启发的新型元启发式算法——路径构造仿生算法(Path Construction Imitation Algorithm, PCIA),该算法通过模拟人类选择热门路径、在路径中断时混合已有路径重构新路径以及随机探索未知目的地的行为机制,实现种群个体(即路径)的动态演化与多样性维持。PCIA在53个数学优化问题和13个约束优化问题上的实验验证表明,其性能优于多种主流及最新元启发式算法,体现出更强的鲁棒性和竞争力。
链接: https://arxiv.org/abs/2512.16392
作者: Mohammad-Javad Rezaei,Mozafar Bag-Mohammadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, a new metaheuristic optimization algorithm, called Path Construction Imitation Algorithm (PCIA), is proposed. PCIA is inspired by how humans construct new paths and use them. Typically, humans prefer popular transportation routes. In the event of a path closure, a new route is built by mixing the existing paths intelligently. Also, humans select different pathways on a random basis to reach unknown destinations. PCIA generates a random population to find the best route toward the destination, similar to swarm-based algorithms. Each particle represents a path toward the destination. PCIA has been tested with 53 mathematical optimization problems and 13 constrained optimization problems. The results showed that the PCIA is highly competitive compared to both popular and the latest metaheuristic algorithms.
zh
[AI-46] Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
【速读】:该论文旨在解决长上下文大语言模型(Large Language Model, LLM)推理过程中注意力机制(attention)带来的延迟问题,这一瓶颈在推理模型和检索增强生成(Retrieval-Augmented Generation, RAG)等场景中尤为突出。解决方案的关键在于提出一种无需训练的稀疏注意力方法 Kascade,其核心创新包括:利用后 softmax 注意力本质上稀疏的特性以及相邻层间高权重键(key)身份稳定的观察;通过动态规划算法在少量锚定层(anchor layers)中计算精确的 Top-k 索引,并在中间重用层(reuse layers)中复用这些索引;同时引入头感知(head-aware)的 Top-k 选择机制以保证精度,且整个过程支持预填充(prefill)与解码(decode)阶段的高效实现。该方法在 H100 GPU 上相较 FlashAttention-3 基线实现了高达 4.1× 的解码加速和 2.2× 的预填充加速,同时在 LongBench 和 AIME-24 等长上下文基准测试中保持与稠密注意力相近的准确性。
链接: https://arxiv.org/abs/2512.16391
作者: Dhruv Deshmukh,Saurabh Goyal,Nipun Kwatra,Ramachandran Ramjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 11 pages, 8 figures, 3 tables and 1 algorithm
Abstract:Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.
zh
[AI-47] AI Needs Physics More Than Physics Needs AI
【速读】:该论文试图解决当前人工智能(AI)在实际应用中影响力有限的问题,尤其是在科学与工程领域缺乏可解释性、鲁棒性和对基本物理规律的捕捉能力。其核心论点是:尽管生成式 AI(Generative AI)和大语言模型等技术取得了一定突破,但它们依赖海量参数、存在分布偏移、无法量化不确定性、且难以提供机制性洞察,因而难以支撑真正意义上的科学发现。解决方案的关键在于推动“大AI”(Big AI)的发展——即融合理论驱动的严谨性与机器学习的灵活性,通过量子计算与类比计算等新兴范式重构AI架构,从而实现从经验拟合到物理规律建模的跃迁。
链接: https://arxiv.org/abs/2512.16344
作者: Peter Coveney,Roger Highfield
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) is commonly depicted as transformative. Yet, after more than a decade of hype, its measurable impact remains modest outside a few high-profile scientific and commercial successes. The 2024 Nobel Prizes in Chemistry and Physics recognized AI’s potential, but broader assessments indicate the impact to date is often more promotional than technical. We argue that while current AI may influence physics, physics has significantly more to offer this generation of AI. Current architectures - large language models, reasoning models, and agentic AI - can depend on trillions of meaningless parameters, suffer from distributional bias, lack uncertainty quantification, provide no mechanistic insights, and fail to capture even elementary scientific laws. We review critiques of these limits, highlight opportunities in quantum AI and analogue computing, and lay down a roadmap for the adoption of ‘Big AI’: a synthesis of theory-based rigour with the flexibility of machine learning.
zh
[AI-48] Pretrained Battery Transformer (PBT): A battery life prediction foundation model
【速读】:该论文旨在解决锂离子电池(Lithium-ion Battery, LIB)循环寿命早期预测中因数据稀缺性和异质性导致的模型泛化能力不足问题。现有机器学习方法受限于不同老化条件下的数据差异,难以实现跨场景的可靠预测。解决方案的关键在于提出首个用于电池寿命预测的基础模型(Foundation Model, FM)——预训练电池变换器(Pretrained Battery Transformer, PBT),其核心创新是通过嵌入领域知识的专家混合层(domain-knowledge-encoded mixture-of-expert layers)实现多源电池数据的统一表征学习。PBT在包含13个LIB数据集的最大公开电池寿命数据库上进行预训练,并通过迁移学习在15个多样化数据集上验证了优越性能,平均精度提升19.8%,显著优于现有模型,为构建通用电池寿命预测系统奠定了基础。
链接: https://arxiv.org/abs/2512.16334
作者: Ruifeng Tan,Weixiang Hong,Jia Li,Jiaqiang Huang,Tong-Yi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 figures in the main content
Abstract:Early prediction of battery cycle life is essential for accelerating battery research, manufacturing, and deployment. Although machine learning methods have shown encouraging results, progress is hindered by data scarcity and heterogeneity arising from diverse aging conditions. In other fields, foundation models (FMs) trained on diverse datasets have achieved broad generalization through transfer learning, but no FMs have been reported for battery cycle life prediction yet. Here we present the Pretrained Battery Transformer (PBT), the first FM for battery life prediction, developed through domain-knowledge-encoded mixture-of-expert layers. Validated on the largest public battery life database, PBT learns transferable representations from 13 lithium-ion battery (LIB) datasets, outperforming existing models by an average of 19.8%. With transfer learning, PBT achieves state-of-the-art performance across 15 diverse datasets encompassing various operating conditions, formation protocols, and chemistries of LIBs. This work establishes a foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems.
zh
[AI-49] Design and Evaluation of Cost-Aware PoQ for Decentralized LLM Inference
【速读】:该论文旨在解决去中心化大语言模型(Large Language Model, LLM)推理中因计算资源异构性导致的验证效率低下问题,即现有基于密码学的计算验证方法难以扩展至现代大规模模型。其核心解决方案是提出一种成本感知的“质量证明”(Proof of Quality, PoQ)框架,关键在于将显式的效率度量(如单位延迟下的质量表现)融入奖励机制,通过融合词级别F1分数、轻量级学习型评估器与GPT判断的统一评估流水线,并采用线性奖励函数平衡归一化质量与成本。实验表明,该机制能有效激励高质量且低延迟的推理节点和高效评估节点,从而为经济可持续的去中心化LLM推理提供可行路径。
链接: https://arxiv.org/abs/2512.16317
作者: Arther Tian,Alex Ding,Frank Chen,Alan Wu,Aaron Chan,Bruce Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Decentralized large language model (LLM) inference promises transparent and censorship resistant access to advanced AI, yet existing verification approaches struggle to scale to modern models. Proof of Quality (PoQ) replaces cryptographic verification of computation with consensus over output quality, but the original formulation ignores heterogeneous computational costs across inference and evaluator nodes. This paper introduces a cost-aware PoQ framework that integrates explicit efficiency measurements into the reward mechanism for both types of nodes. The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline, and adopts a linear reward function that balances normalized quality and cost. Experiments on extractive question answering and abstractive summarization use five instruction tuned LLMs ranging from TinyLlama-1.1B to Llama-3.2-3B and three evaluation models spanning cross encoder and bi encoder architectures. Results show that a semantic textual similarity bi encoder achieves much higher correlation with both ground truth and GPT scores than cross encoders, indicating that evaluator architecture is a critical design choice for PoQ. Quality-cost analysis further reveals that the largest models in the pool are also the most efficient in terms of quality per unit latency. Monte Carlo simulations over 5,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models and to efficient evaluators, while penalizing slow low quality nodes. These findings suggest that cost-aware PoQ provides a practical foundation for economically sustainable decentralized LLM inference. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16317 [cs.AI] (or arXiv:2512.16317v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.16317 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-50] Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks
【速读】:该论文旨在解决小规模开源大语言模型(Large Language Models, LLMs)在实际部署中面临的提示注入攻击(prompt injection attacks)安全风险,特别是目标劫持(goal-hijacking)漏洞问题。其解决方案的关键在于提出一种基于思维链(Chain of Thoughts)的种子防御机制,并通过迭代优化生成防御提示,从而系统性地提升对多种基准攻击的检测能力。实验表明,该方法显著降低了攻击成功率与误报率,同时有效识别目标劫持行为,为资源受限环境下的小型开源LLMs提供了更安全、高效的部署路径。
链接: https://arxiv.org/abs/2512.16307
作者: Safwan Shaheer,G.M. Refatul Islam,Mohammad Rafid Hamid,Tahsin Zaman Jilan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:In this fast-evolving area of LLMs, our paper discusses the significant security risk presented by prompt injection attacks. It focuses on small open-sourced models, specifically the LLaMA family of models. We introduce novel defense mechanisms capable of generating automatic defenses and systematically evaluate said generated defenses against a comprehensive set of benchmarked attacks. Thus, we empirically demonstrated the improvement proposed by our approach in mitigating goal-hijacking vulnerabilities in LLMs. Our work recognizes the increasing relevance of small open-sourced LLMs and their potential for broad deployments on edge devices, aligning with future trends in LLM applications. We contribute to the greater ecosystem of open-source LLMs and their security in the following: (1) assessing present prompt-based defenses against the latest attacks, (2) introducing a new framework using a seed defense (Chain Of Thoughts) to refine the defense prompts iteratively, and (3) showing significant improvements in detecting goal hijacking attacks. Out strategies significantly reduce the success rates of the attacks and false detection rates while at the same time effectively detecting goal-hijacking capabilities, paving the way for more secure and efficient deployments of small and open-source LLMs in resource-constrained environments.
zh
[AI-51] Code-in-the-Loop Forensics: Agent ic Tool Use for Image Forgery Detection
【速读】:该论文旨在解决现有图像伪造检测(IFD)方法在融合低层次、语义无关的伪造痕迹与高层次语义知识之间的难题,因为这两类信息在推理范式和表达形式上存在显著异构性,导致现有方法难以有效统一或建模其跨层级交互。解决方案的关键在于提出ForenAgent框架,这是一个多轮交互式的IFD系统,使多模态大语言模型(MLLM)能够自主生成、执行并迭代优化围绕检测目标设计的Python低级工具,从而实现更灵活且可解释的伪造分析。该框架采用两阶段训练流程(Cold Start与强化微调),并通过受人类推理启发的动态推理循环(包括全局感知、局部聚焦、迭代探测与整体裁决)构建数据采样策略与任务对齐的奖励机制,最终通过FABench数据集验证了其在复杂IFD任务中展现出的涌现式工具使用能力和反思性推理能力。
链接: https://arxiv.org/abs/2512.16300
作者: Fanrui Zhang,Qiang Zhang,Sizhuo Zhou,Jianwen Sun,Chuanhao Li,Jiaxin Ai,Yukang Feng,Yujie Zhang,Wenjie Li,Zizhen Li,Yifan Chang,Jiawei Liu,Kaipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures
Abstract:Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.
zh
[AI-52] Feature-Selective Representation Misdirection for Machine Unlearning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全关键和受监管领域部署时,因保留敏感或禁止知识而引发的隐私泄露、合规风险及滥用问题。现有机器遗忘(machine unlearning)方法通常假设遗忘数据与保留数据可被清晰分离,但在实际应用中,两类数据分布高度纠缠,导致基于扰动的方法要么显著降低模型通用性能,要么无法保障安全性。论文提出了一种名为选择性表示误导(Selective Representation Misdirection for Unlearning, SRMU)的新框架,其核心在于通过结构化的误导向量(misdirection vector)结合激活重要性图(activation importance map),实现特征感知且方向可控的激活编辑,从而有选择地抑制有害表征,同时最大程度保留良性任务的模型效用。该方案在WMDP基准上验证了其在低/高纠缠场景下的优越性能,尤其在20-30%数据重叠条件下仍保持有效性,为LLM驱动应用中的安全治理、隐私合规与可控知识移除提供了稳健基础。
链接: https://arxiv.org/abs/2512.16297
作者: Taozhao Chen,Linghan Huang,Kim-Kwang Raymond Choo,Huaming Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) are increasingly adopted in safety-critical and regulated sectors, the retention of sensitive or prohibited knowledge introduces escalating risks, ranging from privacy leakage to regulatory non-compliance to to potential misuse, and so on. Recent studies suggest that machine unlearning can help ensure deployed models comply with evolving legal, safety, and governance requirements. However, current unlearning techniques assume clean separation between forget and retain datasets, which is challenging in operational settings characterized by highly entangled distributions. In such scenarios, perturbation-based methods often degrade general model utility or fail to ensure safety. To address this, we propose Selective Representation Misdirection for Unlearning (SRMU), a novel principled activation-editing framework that enforces feature-aware and directionally controlled perturbations. Unlike indiscriminate model weights perturbations, SRMU employs a structured misdirection vector with an activation importance map. The goal is to allow SRMU selectively suppresses harmful representations while preserving the utility on benign ones. Experiments are conducted on the widely used WMDP benchmark across low- and high-entanglement configurations. Empirical results reveal that SRMU delivers state-of-the-art unlearning performance with minimal utility losses, and remains effective under 20-30% overlap where existing baselines collapse. SRMU provides a robust foundation for safety-driven model governance, privacy compliance, and controlled knowledge removal in the emerging LLM-based applications. We release the replication package at this https URL.
zh
[AI-53] OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Model, VLM)驱动的计算机使用代理(Computer-Using Agent, CUA)在长时程任务中因步骤级决策不可靠而导致的错误累积问题,尤其是在图形用户界面(Graphical User Interface, GUI)操作中,不可逆动作可能引发严重后果。解决方案的关键在于提出OS-Oracle框架,其核心包括:(1)可扩展的跨平台GUI反馈数据合成管道,用于生成高质量、多样化的批评数据;(2)结合监督微调(Supervised Fine-Tuning, SFT)与一致性保持组相对策略优化(Consistency-Preserving Group Relative Policy Optimization, CP-GRPO)的两阶段训练范式,提升模型对每一步操作的评估能力;(3)构建OS-Critic Bench基准测试集,实现对移动、网页和桌面平台下批评模型性能的全面评估。通过该框架,作者收集了包含31万条样本的高质量数据集,并训练出OS-Oracle-7B模型,在开源VLM中达到当前最优表现,且在移动端超越商业模型,同时作为预训练批评器显著提升了原生GUI代理(如UI-TARS-1.5-7B)在复杂环境中的任务成功率。
链接: https://arxiv.org/abs/2512.16295
作者: Zhenyu Wu,Jingjing Xie,Zehao Li,Bowen Yang,Qiushi Sun,Zhaoyang Liu,Zhoumianze Liu,Yu Qiao,Xiangyu Yue,Zun Wang,Zichen Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms. Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain. Furthermore, when serving as a pre-critic, OS-Oracle-7B improves the performance of native GUI agents such as UI-TARS-1.5-7B in OSWorld and AndroidWorld environments. The code is open-sourced at this https URL.
zh
[AI-54] CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity
【速读】:该论文旨在解决当前主流后训练量化(Post-Training Quantization, PTQ)方法在大语言模型(Large Language Models, LLMs)中普遍采用统一量化策略的问题,该策略忽略了不同网络层对量化算法的适应性差异。解决方案的关键在于提出一种无需微调、即插即用的CKA引导模块化量化框架(CKA Guided Modular Quantization),其核心是利用线性中心核对齐(Linear Centered Kernel Alignment, CKA)作为评估指标,在每一层独立测试多种PTQ算法后自动选择最优量化策略,并将各层优化后的策略融合构建混合量化模型,从而实现算法异构的精细化量化部署。
链接: https://arxiv.org/abs/2512.16282
作者: Jinhao Zhang,Yunquan Zhang,Daning Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.
zh
[AI-55] Love Lies and Language Models: Investigating AIs Role in Romance-Baiting Scams
【速读】:该论文旨在解决生成式 AI(Generative AI)在情感诈骗(romance-baiting scams)中的滥用问题,特别是其如何被犯罪组织用于自动化高仿真的情感操控和金融欺诈。研究发现,当前87%的诈骗劳动已可被大语言模型(LLMs)系统化替代,且在为期一周的盲测中,LLM代理比人类操作员更能建立信任(p=0.007)并获得更高程度的受害者配合(46% vs. 18%),而主流内容安全过滤器对这类对话的检出率为0%。解决方案的关键在于揭示LLMs在诈骗场景下的实际部署现状与高效性,并指出现有防御机制严重失效,亟需开发针对此类高度定制化、心理诱导型文本交互的新型检测与防护策略。
链接: https://arxiv.org/abs/2512.16280
作者: Gilad Gressel,Rahul Pankajakshan,Shir Rozenfeld,Ling Li,Ivan Franceschini,Krishnahsree Achuthan,Yisroel Mirsky
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Romance-baiting scams have become a major source of financial and emotional harm worldwide. These operations are run by organized crime syndicates that traffic thousands of people into forced labor, requiring them to build emotional intimacy with victims over weeks of text conversations before pressuring them into fraudulent cryptocurrency investments. Because the scams are inherently text-based, they raise urgent questions about the role of Large Language Models (LLMs) in both current and future automation. We investigate this intersection by interviewing 145 insiders and 5 scam victims, performing a blinded long-term conversation study comparing LLM scam agents to human operators, and executing an evaluation of commercial safety filters. Our findings show that LLMs are already widely deployed within scam organizations, with 87% of scam labor consisting of systematized conversational tasks readily susceptible to automation. In a week-long study, an LLM agent not only elicited greater trust from study participants (p=0.007) but also achieved higher compliance with requests than human operators (46% vs. 18% for humans). Meanwhile, popular safety filters detected 0.0% of romance baiting dialogues. Together, these results suggest that romance-baiting scams may be amenable to full-scale LLM automation, while existing defenses remain inadequate to prevent their expansion. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2512.16280 [cs.CR] (or arXiv:2512.16280v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.16280 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Usenix Security Symposium 2026
zh
[AI-56] Beyond Blind Spots: Analytic Hints for Mitigating LLM -Based Evaluation Pitfalls
【速读】:该论文旨在解决大型语言模型作为代码生成评判者(Large Language Models as Judges, LaaJ)在工业级代码评估任务中因忽视领域特定问题而导致可靠性不足的问题,特别是在COBOL遗留代码现代化场景下,LaaJ存在系统性盲点。其关键解决方案是构建一个轻量级分析检查器(analytic checker),基于专家知识提炼出一套涵盖30余种领域特异性错误的初步分类体系,并将该工具的检测结果作为提示(prompt)注入LaaJ的评判流程中,形成“分析-提示增强”的混合架构(analytic-LLM hybrid)。实验表明,该方法可将错误覆盖率从LaaJ单独使用的约45%提升至94%,同时显著改善解释的准确性与深度,验证了通过领域知识引导的提示工程能有效提升部署环境中生成式AI的评估可靠性。
链接: https://arxiv.org/abs/2512.16272
作者: Ora Nova Fandina,Eitan Farchi,Shmulik Froimovich,Raviv Gal,Wesam Ibraheem,Rami Katan,Alice Podolsky
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production deployed LaaJs can miss domain critical errors, revealing consistent blind spots in their evaluation capabilities. To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice. We use its outputs as analytic hints, dynamically injecting them into the judges prompt to encourage LaaJ to revisit aspects it may have overlooked. Experiments on a test set of 100 programs using four production level LaaJs show that LaaJ alone detects only about 45% of the errors present in the code (in all judges we tested), while the analytic checker alone lacks explanatory depth. When combined, the LaaJ+Hints configuration achieves up to 94% coverage (for the best performing judge and injection prompt) and produces qualitatively richer, more accurate explanations, demonstrating that analytic-LLM hybrids can substantially enhance evaluation reliability in deployed pipelines. We release the dataset and all used prompts. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16272 [cs.SE] (or arXiv:2512.16272v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.16272 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-57] Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification
【速读】:该论文旨在解决婴儿哭声(infant cry)中副语言特征(paralinguistics)分类的准确性与可解释性问题,尤其针对现有深度学习方法依赖相关性驱动的声学表征、易受噪声、虚假线索及跨环境域偏移影响的局限性。其核心解决方案是提出DACH-TIC模型——一种领域无关的因果感知分层音频Transformer架构,关键创新在于:通过因果注意力掩码和受控扰动训练逼近反事实声学变化,实现对因果关系的建模;结合多任务监督(哭声类型识别、 distress强度估计、因果相关性预测)与对抗域泛化机制,促进环境不变表示的学习;同时采用分层编码结构(局部token级与全局语义编码器)提升鲁棒性。实验表明,该方法在多个数据集上显著优于主流基线模型,且在未见声学环境中仍保持高泛化性能(域性能差距仅2.4%),具备临床实用价值。
链接: https://arxiv.org/abs/2512.16271
作者: Geofrey Owino,Bernard Shibwabo Kasamani,Ahmed M. Abdelmoniem,Edem Wornyo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: This paper has been published in the IEEE proceedings of the 8th International Conference of Computer and Informatics Engineering (IC2IE)
Abstract:Accurate and interpretable classification of infant cry paralinguistics is essential for early detection of neonatal distress and clinical decision support. However, many existing deep learning methods rely on correlation-driven acoustic representations, which makes them vulnerable to noise, spurious cues, and domain shifts across recording environments. We propose DACH-TIC, a Domain-Agnostic Causal-Aware Hierarchical Audio Transformer for robust infant cry classification. The model integrates causal attention, hierarchical representation learning, multi-task supervision, and adversarial domain generalization within a unified framework. DACH-TIC employs a structured transformer backbone with local token-level and global semantic encoders, augmented by causal attention masking and controlled perturbation training to approximate counterfactual acoustic variations. A domain-adversarial objective promotes environment-invariant representations, while multi-task learning jointly optimizes cry type recognition, distress intensity estimation, and causal relevance prediction. The model is evaluated on the Baby Chillanto and Donate-a-Cry datasets, with ESC-50 environmental noise overlays for domain augmentation. Experimental results show that DACH-TIC outperforms state-of-the-art baselines, including HTS-AT and SE-ResNet Transformer, achieving improvements of 2.6 percent in accuracy and 2.2 points in macro-F1 score, alongside enhanced causal fidelity. The model generalizes effectively to unseen acoustic environments, with a domain performance gap of only 2.4 percent, demonstrating its suitability for real-world neonatal acoustic monitoring systems. Comments: This paper has been published in the IEEE proceedings of the 8th International Conference of Computer and Informatics Engineering (IC2IE) Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16271 [cs.SD] (or arXiv:2512.16271v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2512.16271 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/IC2IE67206.2025.11283358 Focus to learn more DOI(s) linking to related resources
zh
[AI-58] Learning to Wait: Synchronizing Agents with the Physical World
【速读】:该论文旨在解决现实世界中智能体任务与同步马尔可夫决策过程(Markov Decision Processes, MDPs)之间的根本性差异问题,即非阻塞动作(non-blocking actions)带来的时序延迟(temporal gap),导致现有环境侧解决方案(如阻塞包装器或频繁轮询)在可扩展性或上下文窗口稀释方面存在局限。其核心解决方案是提出一种代理端方法(Agent-side Approach),使大型语言模型(Large Language Models, LLMs)能够主动对齐其认知时间线(Cognitive Timeline)与物理世界的时间节奏;通过将“代码即动作”(Code-as-Action)范式拓展至时间维度,利用语义先验和上下文学习(In-Context Learning, ICL)预测精确的等待时长(\ttthis\ http\ URL(t)),从而在无需频繁检查的情况下实现与异步环境的高效同步,实验表明该方法可显著降低查询开销并优化执行延迟,验证了时序感知能力是一种可在开放环境中自主演化的可学习特性。
链接: https://arxiv.org/abs/2512.16262
作者: Yifei She,Ping Zhang,He Liu,Yanmin Jia,Yang Jing,Zijun Liu,Peng Sun,Xiangbin Li,Xiaohe Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world agentic tasks, unlike synchronous Markov Decision Processes (MDPs), often involve non-blocking actions with variable latencies, creating a fundamental \textitTemporal Gap between action initiation and completion. Existing environment-side solutions, such as blocking wrappers or frequent polling, either limit scalability or dilute the agent’s context window with redundant observations. In this work, we propose an \textbfAgent-side Approach that empowers Large Language Models (LLMs) to actively align their \textitCognitive Timeline with the physical world. By extending the Code-as-Action paradigm to the temporal domain, agents utilize semantic priors and In-Context Learning (ICL) to predict precise waiting durations (\textttthis http URL(t)), effectively synchronizing with asynchronous environment without exhaustive checking. Experiments in a simulated Kubernetes cluster demonstrate that agents can precisely calibrate their internal clocks to minimize both query overhead and execution latency, validating that temporal awareness is a learnable capability essential for autonomous evolution in open-ended environments.
zh
[AI-59] AMUSE: Audio-Visual Benchmark and Alignment Framework for Agent ic Multi-Speaker Understanding
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在多说话人、以对话为中心的场景中表现不佳的问题,尤其是在需要代理式推理(agentic reasoning)的任务中,如识别说话者、维持角色一致性以及跨时间对事件进行定位等。这些问题在多模态音视频理解任务中尤为关键,例如会议分析和对话式视频助手。为应对这一挑战,作者提出了AMUSE基准测试框架,用于评估模型在零样本、引导式和代理式三种模式下的多模态对话理解能力,并设计了RAFT(Reward-Optimized Agentic Framework for Training)数据高效代理对齐框架,其核心创新在于将奖励优化与内在多模态自评估机制相结合作为奖励信号,并引入选择性参数适配策略实现高效的数据与参数更新。通过RAFT训练,模型在AMUSE上的准确率最高提升了39.52%。
链接: https://arxiv.org/abs/2512.16250
作者: Sanjoy Chowdhury,Karren D. Yang,Xudong Liu,Fartash Faghri,Pavan Kumar Anasosalu Vasu,Oncel Tuzel,Dinesh Manocha,Chun-Liang Li,Raviteja Vemulapalli
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.
zh
[AI-60] AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)合并过程中存在的对齐破坏问题——即传统合并方法(如线性权重汤、任务向量和Fisher加权平均)虽能保持损失下降,却可能在不经意间损害模型的安全性和指令对齐能力。其解决方案的关键在于提出AlignMerge框架,该框架将对齐视为一个显式不变量,并基于Fisher-Rao几何约束进行优化:通过在指令微调基础模型的局部Fisher坐标系中估计对齐子空间(投影算子 PA),并最小化包含几何贴近度(Lgeo)、对齐敏感方向偏移惩罚(Lalign)和软对齐预算约束(Lbud)的联合目标函数,从而确保合并过程尊重安全几何结构,而非事后验证。实验表明,该方法在多个主流模型家族中均显著提升对齐质量指标(AQI、毒性、LLM-judge一致性),同时保持或优于最优专家的指令遵循、推理与助人性能。
链接: https://arxiv.org/abs/2512.16245
作者: Aniruddha Roy,Jyoti Patel,Aman Chadha,Vinija Jain,Amitava Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc. We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector P_A and optimize: L_AlignMerge = L_geo + lambda_align * L_align + lambda_bud * L_bud, where L_geo keeps the merge close to its experts in Fisher-Rao geometry, L_align penalizes motion along alignment-sensitive directions, and L_bud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space. Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16245 [cs.AI] (or arXiv:2512.16245v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.16245 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Amitava Das [view email] [v1] Thu, 18 Dec 2025 06:55:17 UTC (9,890 KB) Full-text links: Access Paper: View a PDF of the paper titled AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints, by Aniruddha Roy and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-61] Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models AAAI2026
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在开放世界场景中面临的开集分类(Open-Set Classification)问题,即如何在准确识别分布内(In-Distribution, ID)数据的同时,有效检测并进一步对分布外(Out-of-Distribution, OOD)样本进行细粒度分类,而非简单将其统一归为单一类别。传统方法通常将所有OOD样本视为同一类,缺乏对实际高风险应用(如欺诈检测和医疗诊断)中OOD样本潜在标签的深入理解。为此,作者提出了一种粗到精的开集分类框架(Coarse-to-Fine open-set Classification, CFC),其关键在于三阶段设计:首先利用大语言模型(Large Language Models, LLMs)生成语义上真正偏离分布的OOD样本及其可能标签(粗分类),其次基于这些OOD样本训练GNN进行精细分类以提升ID与OOD区分能力,最后通过LLM提示和后处理机制实现更精准的OOD分类。CFC不依赖合成或辅助OOD样本,而是直接利用真实语义层面的OOD实例,显著提升了可解释性和实用性,在图和文本领域相比现有最优方法将OOD检测性能提升10%,并在图数据上实现高达70%的OOD分类准确率。
链接: https://arxiv.org/abs/2512.16244
作者: Xueqi Ma,Xingjun Ma,Sarah Monazam Erfani,Danilo Mandic,James Bailey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026
Abstract:Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.
zh
[AI-62] Scaling Spatial Reasoning in MLLM s through Programmatic Data Synthesis
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在具身智能(Embodied Intelligence)领域中因空间理解与推理能力有限而导致的瓶颈问题。现有方法依赖于视觉语言模型(Vision-Language Models, VLMs)的改进,但面临数据集构建的两难困境:模板化数据虽可扩展但结构僵化,人工标注则语言多样性强却难以规模化且计算精度不足。其解决方案的关键在于提出 SPRITE 框架,通过模拟器与大语言模型(Large Language Models, LLMs)协同生成高质量、多样化且可验证的空间推理数据;核心创新是将真实标签生成重构为代码生成任务——利用 LLM 将复杂空间问题编译为可执行程序,并基于模拟器提取的高精度场景元信息进行验证,从而确保标签的计算准确性与语言多样性,同时实现大规模数据合成。
链接: https://arxiv.org/abs/2512.16237
作者: Zhi Helu,Huang Jingjing,Xu Wang,Xu Yangbin,Zhang Wanyue,Jiang Baoyang,Deng Shirui,Zhu Liang,Li Fangfang,Zhao Tiejun,Lin Yankai,Yao Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.
zh
[AI-63] he Evolution of Reranking Models in Information Retrieval: From Heuristic Methods to Large Language Models
【速读】:该论文旨在解决信息检索(Information Retrieval, IR)系统中排序优化的问题,特别是如何通过重排序(reranking)技术提升最终结果的相关性,尤其是在现代检索增强生成(Retrieval Augmented Generation, RAG)流水线中,文档检索质量对生成输出的显著影响。其解决方案的关键在于系统性地梳理和对比各类重排序模型的发展脉络,涵盖从传统方法到基于神经网络的先进架构(如交叉编码器、T5等序列生成模型及图神经网络),并深入探讨提升效率的技术手段(如知识蒸馏),以及大语言模型(Large Language Models, LLMs)在重排序中的新兴应用与优化策略(包括提示工程与微调)。论文强调不同方法在性能、计算成本与实际部署之间的权衡,为研究者和工程师提供结构化的决策依据。
链接: https://arxiv.org/abs/2512.16236
作者: Tejul Pandit,Sakshi Mahendru,Meet Raval,Dhvani Upadhyay
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure, Accepted in CLNLP’25
Abstract:Reranking is a critical stage in contemporary information retrieval (IR) systems, improving the relevance of the user-presented final results by honing initial candidate sets. This paper is a thorough guide to examine the changing reranker landscape and offer a clear view of the advancements made in reranking methods. We present a comprehensive survey of reranking models employed in IR, particularly within modern Retrieval Augmented Generation (RAG) pipelines, where retrieved documents notably influence output quality. We embark on a chronological journey through the historical trajectory of reranking techniques, starting with foundational approaches, before exploring the wide range of sophisticated neural network architectures such as cross-encoders, sequence-generation models like T5, and Graph Neural Networks (GNNs) utilized for structural information. Recognizing the computational cost of advancing neural rerankers, we analyze techniques for enhancing efficiency, notably knowledge distillation for creating competitive, lighter alternatives. Furthermore, we map the emerging territory of integrating Large Language Models (LLMs) in reranking, examining novel prompting strategies and fine-tuning tactics. This survey seeks to elucidate the fundamental ideas, relative effectiveness, computational features, and real-world trade-offs of various reranking strategies. The survey provides a structured synthesis of the diverse reranking paradigms, highlighting their underlying principles and comparative strengths and weaknesses. Comments: 15 pages, 1 figure, Accepted in CLNLP’25 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16236 [cs.IR] (or arXiv:2512.16236v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.16236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-64] Neural emulation of gravity-driven geohazard runout
【速读】:该论文旨在解决地质灾害(如滑坡和雪崩)运动范围(runout)预测的难题,尤其针对下游社区面临突发性强冲击时缺乏高效且准确的预测工具的问题。现有数值模型在计算速度与物理真实性之间存在根本性权衡,难以满足大尺度早期预警系统对实时性和精度的需求。解决方案的关键在于采用神经网络模拟(neural emulation),通过在超过10万次数值模拟基础上训练模型,使其能够以100至10,000倍于传统数值求解器的速度,高精度预测不同流体类型、规模及地形条件下的流动范围和沉积厚度,并复现关键物理行为(如河道改道和沉积模式),从而实现跨多种场景的空间分辨预测,为灾害风险降低和基于影响的预报提供新路径。
链接: https://arxiv.org/abs/2512.16221
作者: Lorenzo Nava,Ye Chen,Maximillian Van Wyk de Vries
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures
Abstract:Predicting geohazard runout is critical for protecting lives, infrastructure and ecosystems. Rapid mass flows, including landslides and avalanches, cause several thousand deaths across a wide range of environments, often travelling many kilometres from their source. The wide range of source conditions and material properties governing these flows makes their runout difficult to anticipate, particularly for downstream communities that may be suddenly exposed to severe impacts. Accurately predicting runout at scale requires models that are both physically realistic and computationally efficient, yet existing approaches face a fundamental speed-realism trade-off. Here we train a machine learning model to predict geohazard runout across representative real world terrains. The model predicts both flow extent and deposit thickness with high accuracy and 100 to 10,000 times faster computation than numerical solvers. It is trained on over 100,000 numerical simulations across over 10,000 real world digital elevation model chips and reproduces key physical behaviours, including avulsion and deposition patterns, while generalizing across different flow types, sizes and landscapes. Our results demonstrate that neural emulation enables rapid, spatially resolved runout prediction across diverse real world terrains, opening new opportunities for disaster risk reduction and impact-based forecasting. These results highlight neural emulation as a promising pathway for extending physically realistic geohazard modelling to spatial and temporal scales relevant for large scale early warning systems.
zh
[AI-65] PDE-Agent : A toolchain-augmented multi-agent framework for PDE solving
【速读】:该论文旨在解决传统偏微分方程(Partial Differential Equations, PDEs)求解方法依赖人工设置和领域专家知识、自动化程度低的问题,以及现有基于物理信息神经网络(Physics-Informed Neural Networks, PINNs)的方法仍需人工干预、缺乏完全自主性的问题。其解决方案的关键在于提出PDE-Agent框架,这是一个工具链增强的多智能体协作系统,通过大语言模型(Large Language Models, LLMs)驱动的智能体实现从自然语言描述到PDE求解的端到端自动化流程。该框架的核心创新包括:(1) 基于图记忆的Prog-Act框架,支持多智能体协作中的动态规划与双循环纠错机制(局部修正与全局重构);(2) 集成工具参数分离机制的Resource-Pool资源池,统一管理运行时产物并解决多工具间依赖冲突问题,从而显著提升复杂多步骤、跨步依赖任务下的适用性和性能。
链接: https://arxiv.org/abs/2512.16214
作者: Jianming Liu,Ren Zhu,Jian Xu,Kun Ding,Xu-Yao Zhang,Gaofeng Meng,Cheng-Lin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Solving Partial Differential Equations (PDEs) is a cornerstone of engineering and scientific research. Traditional methods for PDE solving are cumbersome, relying on manual setup and domain expertise. While Physics-Informed Neural Network (PINNs) introduced end-to-end neural network-based solutions, and frameworks like DeepXDE further enhanced automation, these approaches still depend on expert knowledge and lack full autonomy. In this work, we frame PDE solving as tool invocation via LLM-driven agents and introduce PDE-Agent, the first toolchain-augmented multi-agent collaboration framework, inheriting the reasoning capacity of LLMs and the controllability of external tools and enabling automated PDE solving from natural language descriptions. PDE-Agent leverages the strengths of multi-agent and multi-tool collaboration through two key innovations: (1) A Prog-Act framework with graph memory for multi-agent collaboration, which enables effective dynamic planning and error correction via dual-loop mechanisms (localized fixes and global revisions). (2) A Resource-Pool integrated with a tool-parameter separation mechanism for multi-tool collaboration. This centralizes the management of runtime artifacts and resolves inter-tool dependency gaps in existing frameworks. To validate and evaluate this new paradigm for PDE solving , we develop PDE-Bench, a multi-type PDE Benchmark for agent-based tool collaborative solving, and propose multi-level metrics for assessing tool coordination. Evaluations verify that PDE-Agent exhibits superior applicability and performance in complex multi-step, cross-step dependent tasks. This new paradigm of toolchain-augmented multi-agent PDE solving will further advance future developments in automated scientific computing. Our source code and dataset will be made publicly available.
zh
[AI-66] Weighted K-Harmonic Means Clustering: Convergence Analysis and Applications to Wireless Communications
【速读】:该论文旨在解决无线网络中联合基站部署与用户关联优化问题,核心挑战在于如何在保证最小信号强度的同时实现负载公平性。解决方案的关键是提出加权K-谐波均值(Weighted K-harmonic Means, WKHM)聚类算法,该算法通过逆距离加权机制实现软分配,并具备数值稳定性;其权重可直接解释为基于接收信号强度的用户分数关联,且首次为基于谐波均值的聚类提供了严格的随机收敛性理论保障(包括概率收敛和几乎必然收敛),从而在多样性用户分布下显著优于传统及现代聚类基线方法。
链接: https://arxiv.org/abs/2512.16185
作者: Gourab Ghatak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose the \emphweighted K-harmonic means (WKHM) clustering algorithm, a regularized variant of K-harmonic means designed to ensure numerical stability while enabling soft assignments through inverse-distance weighting. Unlike classical K-means and constrained K-means, WKHM admits a direct interpretation in wireless networks: its weights are exactly equivalent to fractional user association based on received signal strength. We establish rigorous convergence guarantees under both deterministic and stochastic settings, addressing key technical challenges arising from non-convexity and random initialization. Specifically, we prove monotone descent to a local minimum under fixed initialization, convergence in probability under Binomial Point Process (BPP) initialization, and almost sure convergence under mild decay conditions. These results provide the first stochastic convergence guarantees for harmonic-mean-based clustering. Finally, through extensive simulations with diverse user distributions, we show that WKHM achieves a superior tradeoff between minimum signal strength and load fairness compared to classical and modern clustering baselines, making it a principled tool for joint radio node placement and user association in wireless networks.
zh
[AI-67] Ev-Trust: A Strategy Equilibrium Trust Mechanism for Evolutionary Games in LLM -Based Multi-Agent Services
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统中因开放性和异构性所引发的信任危机问题,包括欺骗、欺诈和虚假信息等风险,这些风险严重威胁系统的可信度与鲁棒性。解决方案的关键在于提出一种基于演化博弈论的策略均衡信任机制(Ev-Trust),该机制通过整合直接信任、间接信任与预期收益构建动态反馈结构,引导智能体行为向演化稳定均衡演进;在去中心化的“请求-响应-支付-评估”服务框架下,Ev-Trust 能够使智能体自适应调整策略,自然排除恶意参与者并强化高质量协作,理论分析基于复制动态方程证明了局部演化均衡的存在性与稳定性,实验验证其在提升信任度、降低恶意策略比例及增加群体收益方面的有效性。
链接: https://arxiv.org/abs/2512.16167
作者: Shiduo Yang,Jiye Wang,Jiayu Qin,Jianbin Li,Yu Wang,Yuanhe Zhao,Kenan Guo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 12 pages, 11 figures
Abstract:The rapid evolution of the Web toward an agent-centric paradigm, driven by large language models (LLMs), has enabled autonomous agents to reason, plan, and interact in complex decentralized environments. However, the openness and heterogeneity of LLM-based multi-agent systems also amplify the risks of deception, fraud, and misinformation, posing severe challenges to trust establishment and system robustness. To address this issue, we propose Ev-Trust, a strategy-equilibrium trust mechanism grounded in evolutionary game theory. This mechanism integrates direct trust, indirect trust, and expected revenue into a dynamic feedback structure that guides agents’ behavioral evolution toward equilibria. Within a decentralized “Request-Response-Payment-Evaluation” service framework, Ev-Trust enables agents to adaptively adjust strategies, naturally excluding malicious participants while reinforcing high-quality collaboration. Furthermore, our theoretical derivation based on replicator dynamics equations proves the existence and stability of local evolutionary equilibria. Experimental results indicate that our approach effectively reflects agent trustworthiness in LLM-driven open service interaction scenarios, reduces malicious strategies, and increases collective revenue. We hope Ev-Trust can provide a new perspective on trust modeling for the agentic service web in group evolutionary game scenarios.
zh
[AI-68] oolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在调用外部工具和利用检索信息时,因依赖大量真实API调用生成合成数据而导致成本高昂、且缺乏多跳推理(multi-hop reasoning)与自我反思(self-reflection)能力的问题。其解决方案的关键在于提出ToolForge框架,该框架通过构造少量虚拟工具并基于(问题,黄金上下文,答案)三元组生成大规模针对多跳搜索场景的工具学习数据,同时引入多跳推理与自我反思机制增强数据质量,并结合多层验证体系确保数据真实性,从而显著降低训练成本并提升模型性能——实证表明,仅用8B参数的模型在使用该合成数据训练后,即可超越GPT-4o在多个基准测试上的表现。
链接: https://arxiv.org/abs/2512.16149
作者: Hao Chen,Zhexin Hu,Jiajun Chai,Haocheng Yang,Hang He,Xiaohan Wang,Wei Lin,Luhang Wang,Guojun Yin,Zhuofeng zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 9 tables, 6 figures. Code available at this https URL
Abstract:Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at this https URL .
zh
[AI-69] INTELLECT-3: Technical Report
【速读】:该论文旨在解决大规模生成式 AI 模型在数学、代码、科学推理等复杂任务中性能不足的问题,尤其是在模型规模受限的情况下难以达到前沿性能的挑战。解决方案的关键在于构建一个完整的端到端强化学习(Reinforcement Learning, RL)基础设施栈,包括可扩展的异步 RL 框架 prime-rl(支持多轮交互与工具调用)、基于 GLM-4.5-Air-Base 的监督微调(SFT)与强化学习联合训练流程,以及由 verifiers 库驱动的多样化环境集合。通过在 512 张 H200 GPU 上高效扩展 RL 训练,INTELLECT-3(106B 参数,其中 12B 激活)实现了与其规模相匹配的最优性能,并超越了多个更大规模的前沿模型。
链接: https://arxiv.org/abs/2512.16144
作者: Prime Intellect Team,Mika Senghaas,Fares Obeid,Sami Jaghouar,William Brown,Jack Min Ong,Daniel Auras,Matej Sirovatka,Jannik Straube,Andrew Baker,Sebastian Müller,Justus Mattern,Manveer Basra,Aiman Ismail,Dominik Scherm,Cooper Miller,Ameen Patel,Simon Kirsten,Mario Sieg,Christian Reetz,Kemal Erdem,Vincent Weisser,Johannes Hagemann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 10 figures
Abstract:We present INTELLECT-3, a 106B-parameter Mixture-of-Experts model (12B active) trained with large-scale reinforcement learning on our end-to-end RL infrastructure stack. INTELLECT-3 achieves state of the art performance for its size across math, code, science and reasoning benchmarks, outperforming many larger frontier models. We open-source the model together with the full infrastructure stack used to create it, including RL frameworks, complete recipe, and a wide collection of environments, built with the verifiers library, for training and evaluation from our Environments Hub community platform. Built for this effort, we introduce prime-rl, an open framework for large-scale asynchronous reinforcement learning, which scales seamlessly from a single node to thousands of GPUs, and is tailored for agentic RL with first-class support for multi-turn interactions and tool use. Using this stack, we run both SFT and RL training on top of the GLM-4.5-Air-Base model, scaling RL training up to 512 H200s with high training efficiency.
zh
[AI-70] WeMusic-Agent : Efficient Conversational Music Recommendation via Knowledge Internalization and Agent ic Boundary Learning
【速读】:该论文旨在解决对话场景下个性化音乐推荐中如何有效平衡领域专业知识与灵活工具调用的问题,现有方法往往难以兼顾两者。其解决方案的关键在于提出WeMusic-Agent训练框架,通过知识内化(knowledge internalization)与代理边界学习(agentic boundary learning)机制,使模型能够智能判断何时使用内部存储的音乐知识,何时调用外部专用工具(如音乐检索API或推荐系统)。该框架支持构建具备持续预训练能力的代理模型WeMusic-Agent-M1,从而在真实世界数据上实现更精准、个性化的音乐推荐效果。
链接: https://arxiv.org/abs/2512.16108
作者: Wendong Bi,Yirong Mao,Xianglong Liu,Kai Tian,Jian Zhang,Hanjie Wang,Wenhui Que
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized music recommendation in conversational scenarios usually requires a deep understanding of user preferences and nuanced musical context, yet existing methods often struggle with balancing specialized domain knowledge and flexible tool integration. This paper proposes WeMusic-Agent, a training framework for efficient LLM-based conversational music recommendation. By integrating the knowledge internalization and agentic boundary learning, the framework aims to teach the model to intelligently decide when to leverage internalized knowledge and when to call specialized tools (e.g., music retrieval APIs, music recommendation systems). Under this framework, we present WeMusic-Agent-M1, an agentic model that internalizes extensive musical knowledge via continued pretraining on 50B music-related corpus while acquiring the ability to invoke external tools when necessary. Additionally, considering the lack of open-source benchmarks for conversational music recommendation, we also construct a benchmark for personalized music recommendations derived from real-world data in WeChat Listen. This benchmark enables comprehensive evaluation across multiple dimensions, including relevance, personalization, and diversity of the recommendations. Experiments on real-world data demonstrate that WeMusic-Agent achieves significant improvements over existing models.
zh
[AI-71] ModelTables: A Corpus of Tables about Models
【速读】:该论文旨在解决当前AI模型知识管理中结构化数据(如性能与配置表格)被文本检索方法忽略的问题,从而提升对模型表征信息的精准获取能力。其核心挑战在于如何有效建模和检索Model Lakes中的表格数据,这些表格虽小但具有密集的内部关联性,反映了模型与基准测试之间的紧密演化关系。解决方案的关键是构建首个大规模结构化基准数据集ModelTables,涵盖60K个模型和90K张表格,并通过三种互补信号(论文引用链接、模型卡显式链接与继承关系、共享训练数据集)构建多源真实标签,进而系统评估不同表格搜索方法(如基于联合的语义检索、稠密向量检索及元数据混合检索),验证了现有方法存在改进空间,为未来更准确的语义检索、结构化比较和模型知识组织提供了基础支持。
链接: https://arxiv.org/abs/2512.16106
作者: Zhengyuan Dong,Victor Zhong,Renée J. Miller
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 14 pages, 8 figures and 8 tables
Abstract:We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub READMEs, and referenced papers, linking each table to its surrounding model and publication context. Compared with open data lake tables, model tables are smaller yet exhibit denser inter table relationships, reflecting tightly coupled model and benchmark evolution. The current release covers over 60K models and 90K tables. To evaluate model and table relatedness, we construct a multi source ground truth using three complementary signals: (1) paper citation links, (2) explicit model card links and inheritance, and (3) shared training datasets. We present one extensive empirical use case for the benchmark which is table search. We compare canonical Data Lake search operators (unionable, joinable, keyword) and Information Retrieval baselines (dense, sparse, hybrid retrieval) on this benchmark. Union based semantic table retrieval attains 54.8 % P@1 overall (54.6 % on citation, 31.3 % on inheritance, 30.6 % on shared dataset signals); table based dense retrieval reaches 66.5 % P@1, and metadata hybrid retrieval achieves 54.1 %. This evaluation indicates clear room for developing better table search methods. By releasing ModelTables and its creation protocol, we provide the first large scale benchmark of structured data describing AI model. Our use case of table discovery in Model Lakes, provides intuition and evidence for developing more accurate semantic retrieval, structured comparison, and principled organization of structured model knowledge. Source code, data, and other artifacts have been made available at this https URL.
zh
[AI-72] AIMM: An AI-Driven Multimodal Framework for Detecting Social-Media-Influenced Stock Market Manipulation
【速读】:该论文旨在解决由社交媒体协同传播引发的市场操纵行为难以被及时识别的问题,尤其针对零售投资者、监管机构和券商缺乏有效工具来连接线上舆论与市场异常行为的痛点。解决方案的关键在于提出一种名为AIMM(AI-driven Market Manipulation Monitoring)的生成式AI框架,其核心创新包括:构建包含33个标注ticker-day的AIMM Ground Truth数据集(AIMM-GT),融合Reddit社交活动、机器人及协调指标与OHLCV市场特征,输出每日个股的操纵风险评分;采用基于Parquet的流式处理管道与Streamlit可视化仪表盘实现可解释性分析;并通过前瞻性评估验证模型在GameStop(GME)事件中提前22天发出预警的能力,初步展现出对社交驱动型市场操纵的早期识别潜力。
链接: https://arxiv.org/abs/2512.16103
作者: Sandeep Neela
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Market manipulation now routinely originates from coordinated social media campaigns, not isolated trades. Retail investors, regulators, and brokerages need tools that connect online narratives and coordination patterns to market behavior. We present AIMM, an AI-driven framework that fuses Reddit activity, bot and coordination indicators, and OHLCV market features into a daily AIMM Manipulation Risk Score for each ticker. The system uses a parquet-native pipeline with a Streamlit dashboard that allows analysts to explore suspicious windows, inspect underlying posts and price action, and log model outputs over time. Due to Reddit API restrictions, we employ calibrated synthetic social features matching documented event characteristics; market data (OHLCV) uses real historical data from Yahoo Finance. This release makes three contributions. First, we build the AIMM Ground Truth dataset (AIMM-GT): 33 labeled ticker-days spanning eight equities, drawing from SEC enforcement actions, community-verified manipulation cases, and matched normal controls. Second, we implement forward-walk evaluation and prospective prediction logging for both retrospective and deployment-style assessment. Third, we analyze lead times and show that AIMM flagged GME 22 days before the January 2021 squeeze peak. The current labeled set is small (33 ticker-days, 3 positive events), but results show preliminary discriminative capability and early warnings for the GME incident. We release the code, dataset schema, and dashboard design to support research on social media-driven market surveillance. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.16103 [cs.LG] (or arXiv:2512.16103v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.16103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-73] Scaling Text2SQL via LLM -efficient Schema Filtering with Functional Dependency Graph Rerankers
【速读】:该论文旨在解决现代Text2SQL系统在面对真实世界大规模数据库(如Spider 2.0基准中包含数百张表和数万列)时,因模型上下文长度限制而导致性能显著下降的问题。现有方法要么依赖高成本的多步提示流水线,要么仅基于独立列与用户问题的相关性进行筛选,忽略了列之间的结构关联性。解决方案的关键在于提出一个名为\toolname的开源、LLM高效的模式过滤框架,其核心创新包括:(i) 利用查询感知的LLM编码器结合列值和元数据对列进行排序;(ii) 通过轻量级图Transformer对功能依赖关系下的互连列进行重排序;(iii) 使用Steiner树启发式算法选择保持连接性的子模式,从而在保证近似完美召回率的同时提升精度,并支持超过23,000列的大规模schema,同时维持亚秒级延迟。
链接: https://arxiv.org/abs/2512.16083
作者: Thanh Dat Hoang,Thanh Tam Nguyen,Thanh Trung Huynh,Hongzhi Yin,Quoc Viet Hung Nguyen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Most modern Text2SQL systems prompt large language models (LLMs) with entire schemas – mostly column information – alongside the user’s question. While effective on small databases, this approach fails on real-world schemas that exceed LLM context limits, even for commercial models. The recent Spider 2.0 benchmark exemplifies this with hundreds of tables and tens of thousands of columns, where existing systems often break. Current mitigations either rely on costly multi-step prompting pipelines or filter columns by ranking them against user’s question independently, ignoring inter-column structure. To scale existing systems, we introduce \toolname, an open-source, LLM-efficient schema filtering framework that compacts Text2SQL prompts by (i) ranking columns with a query-aware LLM encoder enriched with values and metadata, (ii) reranking inter-connected columns via a lightweight graph transformer over functional dependencies, and (iii) selecting a connectivity-preserving sub-schema with a Steiner-tree heuristic. Experiments on real datasets show that \toolname achieves near-perfect recall and higher precision than CodeS, SchemaExP, Qwen rerankers, and embedding retrievers, while maintaining sub-second median latency and scaling to schemas with 23,000+ columns. Our source code is available at this https URL.
zh
[AI-74] Evaluation of Generative Models for Emotional 3D Animation Generation in VR
【速读】:该论文旨在解决当前生成式3D非语言动画模型在情感表达真实性方面评估不足的问题,即现有基于2D统计指标的评价方式难以准确反映用户对情绪感知的真实体验。其解决方案的关键在于引入虚拟现实(VR)环境下的用户中心型评估框架,通过实证研究(N=48)量化用户对情绪唤醒度、真实感、自然性、愉悦度、多样性及交互质量等维度的主观感受,并将三种前沿语音驱动的3D动画生成方法与基于重建的人类表情基准进行对比。结果表明,显式建模情绪的生成方法在情感识别准确性上优于仅关注语音同步的方法,且用户更认可高唤醒情绪(如快乐)的动画表现,但所有模型在面部细节和交互体验方面仍显著落后于真人表达,凸显了未来研究需强化用户反馈机制以提升生成模型的情感拟真能力。
链接: https://arxiv.org/abs/2512.16081
作者: Kiran Chhatre,Renan Guarese,Andrii Matviienko,Christopher Peters
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 20 pages, 5 figures. Webpage: this https URL
Abstract:Social interactions incorporate nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of model effectiveness. To address this, we evaluate emotional 3D animation generative models within a Virtual Reality (VR) environment, emphasizing user-centric metrics emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality in a real-time human-agent interaction scenario. Through a user study (N=48), we examine perceived emotional quality for three state of the art speech-driven 3D animation methods across two emotions happiness (high arousal) and neutral (mid arousal). Additionally, we compare these generative models against real human expressions obtained via a reconstruction-based method to assess both their strengths and limitations and how closely they replicate real human facial and body expressions. Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony. Users rated the realism and naturalness of happy animations significantly higher than those of neutral animations, highlighting the limitations of current generative models in handling subtle emotional states. Generative models underperformed compared to reconstruction-based methods in facial expression quality, and all methods received relatively low ratings for animation enjoyment and interaction quality, emphasizing the importance of incorporating user-centric evaluations into generative model development. Finally, participants positively recognized animation diversity across all generative models.
zh
[AI-75] Feasibility of Radio Frequency Based Wireless Sensing of Lead Contamination in Soil
【速读】:该论文旨在解决城市土壤中铅(Pb)污染检测成本高、效率低的问题,传统方法普遍存在劳动强度大和费用昂贵的局限性。解决方案的关键在于提出了一种基于无线电频率(Radio Frequency, RF)的无线传感系统——SoilScanner,其核心原理是利用不同盐类(如NaCl和Pb(NO₃)₂)对不同频段无线电波传播特性的影响差异,通过分析信号反射模式识别土壤中的铅含量。实验表明,该系统可实现对铅浓度阈值为200 ppm的土壤样本分类,准确率达72%,且无铅含量高于500 ppm的样本被误判,验证了基于无线技术构建便携、低成本铅检测设备的可行性。
链接: https://arxiv.org/abs/2512.16071
作者: Yixuan Gao,Tanvir Ahmed,Mikhail Mohammed,Zhongqi Cheng,Rajalakshmi Nandakumar
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 12 pages, 12 Figures, International Conference on Embedded Wireless Systems and Networks, this https URL , Best Paper Award of EWSN2024
Abstract:Widespread Pb (lead) contamination of urban soil significantly impacts food safety and public health and hinders city greening efforts. However, most existing technologies for measuring Pb are labor-intensive and costly. In this study, we propose SoilScanner, a radio frequency-based wireless system that can detect Pb in soils. This is based on our discovery that the propagation of different frequency band radio signals is affected differently by different salts such as NaCl and Pb(NO3)2 in the soil. In a controlled experiment, manually adding NaCl and Pb(NO3)2 in clean soil, we demonstrated that different salts reflected signals at different frequencies in distinct patterns. In addition, we confirmed the finding using uncontrolled field samples with a machine learning model. Our experiment results show that SoilScanner can classify soil samples into low-Pb and high-Pb categories (threshold at 200 ppm) with an accuracy of 72%, with no sample with 500 ppm of Pb being misclassified. The results of this study show that it is feasible to build portable and affordable Pb detection and screening devices based on wireless technology.
zh
[AI-76] A Multi-Agent Large Language Model Framework for Automated Qualitative Analysis
【速读】:该论文旨在解决慢性疾病管理中患者体验研究的瓶颈问题,即传统定性主题分析方法存在劳动密集、主观性强且难以扩展的局限。为应对这一挑战,作者提出了一种多智能体大语言模型框架——协作主题识别代理(Collaborative Theme Identification Agent, CoTI),其核心创新在于通过三个专业化智能体(指导者Instructor、主题化Agent Thematizer、编码本生成器CodebookGenerator)协同工作,实现自动化定性主题分析。该方案在12例心力衰竭患者访谈数据上的验证表明,CoTI生成的主题与资深研究者结果更接近,优于初级研究人员和基线自然语言处理模型,同时支持用户交互式应用部署,显著提升了分析效率与一致性。
链接: https://arxiv.org/abs/2512.16063
作者: Qidi Xu,Nuzha Amjad,Grace Giles,Alexa Cumming,De’angelo Hermesky,Alexander Wen,Min Ji Kwak,Yejin Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 42 pages, 5 figures
Abstract:Understanding patients experiences is essential for advancing patient centered care, especially in chronic diseases that require ongoing communication. However, qualitative thematic analysis, the primary approach for exploring these experiences, remains labor intensive, subjective, and difficult to scale. In this study, we developed a multi agent large language model framework that automates qualitative thematic analysis through three agents (Instructor, Thematizer, CodebookGenerator), named Collaborative Theme Identification Agent (CoTI). We applied CoTI to 12 heart failure patient interviews to analyze their perceptions of medication intensity. CoTI identified key phrases, themes, and codebook that were more similar to those of the senior investigator than both junior investigators and baseline NLP models. We also implemented CoTI into a user-facing application to enable AI human interaction in qualitative analysis. However, collaboration between CoTI and junior investigators provided only marginal gains, suggesting they may overrely on CoTI and limit their independent critical thinking.
zh
[AI-77] CauSTream: Causal Spatio-Temporal Representation Learning for Streamflow Forecasting
【速读】:该论文旨在解决传统深度学习模型在流域流量预测中因忽视水文物理过程而导致可解释性差与泛化能力弱的问题。其解决方案的关键在于提出CauSTream框架,该框架通过联合学习两个动态因果图结构——一是气象强迫因子间的产流因果图,二是站点间反映水流动态依赖关系的汇流因果图,并在非参数设定下建立这些因果结构的可识别条件,从而实现对水文系统物理机制的显式建模与高效预测。
链接: https://arxiv.org/abs/2512.16046
作者: Shu Wan,Reepal Shah,John Sabo,Huan Liu,K. Selçuk Candan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
备注: Accepted by IEEE Big Data 2025
Abstract:Streamflow forecasting is crucial for water resource management and risk mitigation. While deep learning models have achieved strong predictive performance, they often overlook underlying physical processes, limiting interpretability and generalization. Recent causal learning approaches address these issues by integrating domain knowledge, yet they typically rely on fixed causal graphs that fail to adapt to data. We propose CauStream, a unified framework for causal spatiotemporal streamflow forecasting. CauSTream jointly learns (i) a runoff causal graph among meteorological forcings and (ii) a routing graph capturing dynamic dependencies across stations. We further establish identifiability conditions for these causal structures under a nonparametric setting. We evaluate CauSTream on three major U.S. river basins across three forecasting horizons. The model consistently outperforms prior state-of-the-art methods, with performance gaps widening at longer forecast windows, indicating stronger generalization to unseen conditions. Beyond forecasting, CauSTream also learns causal graphs that capture relationships among hydrological factors and stations. The inferred structures align closely with established domain knowledge, offering interpretable insights into watershed dynamics. CauSTream offers a principled foundation for causal spatiotemporal modeling, with the potential to extend to a wide range of scientific and environmental applications.
zh
[AI-78] opic Discovery and Classification for Responsible Generative AI Adaptation in Higher Education
【速读】:该论文旨在解决高校中生成式 AI(Generative AI)政策分散、不一致且动态变化导致学生难以明确使用规范的问题。其核心挑战在于如何系统化地识别、分类并结构化呈现不同教育机构关于 GenAI 使用的政策文本,从而支持学生理解与合规。解决方案的关键在于构建一个自动化系统,该系统结合无监督主题建模技术以发现政策中的关键主题,并利用大语言模型(Large Language Models, LLMs)对政策文本进行细粒度分类,判断 GenAI 的允许程度及其他要求。该方法在主题 coherence 分数达 0.73,且基于 GPT-4.0 的分类任务中 precision 达到 0.92–0.97、recall 达到 0.85–0.97,显著提升了政策信息的可解释性与可用性,为教育技术平台集成提供了可行路径。
链接: https://arxiv.org/abs/2512.16036
作者: Diane Myung-kyung Woodbridge,Allyson Seba,Freddie Seba,Aydin Schwartz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As generative artificial intelligence (GenAI) becomes increasingly capable of delivering personalized learning experiences and real-time feedback, a growing number of students are incorporating these tools into their academic workflows. They use GenAI to clarify concepts, solve complex problems, and, in some cases, complete assignments by copying and pasting model-generated contents. While GenAI has the potential to enhance learning experience, it also raises concerns around misinformation, hallucinated outputs, and its potential to undermine critical thinking and problem-solving skills. In response, many universities, colleges, departments, and instructors have begun to develop and adopt policies to guide responsible integration of GenAI into learning environments. However, these policies vary widely across institutions and contexts, and their evolving nature often leaves students uncertain about expectations and best practices. To address this challenge, the authors designed and implemented an automated system for discovering and categorizing AI-related policies found in course syllabi and institutional policy websites. The system combines unsupervised topic modeling techniques to identify key policy themes with large language models (LLMs) to classify the level of GenAI allowance and other requirements in policy texts. The developed application achieved a coherence score of 0.73 for topic discovery. In addition, GPT-4.0-based classification of policy categories achieved precision between 0.92 and 0.97, and recall between 0.85 and 0.97 across eight identified topics. By providing structured and interpretable policy information, this tool promotes the safe, equitable, and pedagogically aligned use of GenAI technologies in education. Furthermore, the system can be integrated into educational technology platforms to help students understand and comply with relevant guidelines.
zh
[AI-79] Do Large Language Models Know What They Dont Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对未来未知事件时的信念校准(epistemic calibration)问题,即模型输出的概率置信度是否与其实际预测准确性相匹配。传统基准测试多基于静态知识的准确率评估,难以衡量模型对真正未知未来的不确定性量化能力。为此,作者提出KalshiBench——一个包含300个来自受监管预测市场平台Kalshi的、具有可验证现实结果的问题集合,且这些结果均发生在模型训练截止时间之后,从而真实检验模型对未来事件的校准能力。关键解决方案在于构建了一个面向未来事件、具备外部验证机制的动态评估框架,揭示了当前主流模型普遍存在系统性过度自信现象,且增强推理能力并未带来校准性能提升,凸显了校准作为独立能力需专门优化的重要性。
链接: https://arxiv.org/abs/2512.16030
作者: Lukas Nel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A well-calibrated model should express confidence that matches its actual accuracy – when it claims 80% confidence, it should be correct 80% of the time. While large language models (LLMs) have achieved remarkable performance across diverse tasks, their epistemic calibration remains poorly understood. We introduce \textbfKalshiBench, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange, with verifiable real-world outcomes occurring after model training cutoffs. Unlike traditional benchmarks measuring accuracy on static knowledge, KalshiBench evaluates whether models can appropriately quantify uncertainty about genuinely unknown future events. We evaluate five frontier models – Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 – and find \textbfsystematic overconfidence across all models. Even the best-calibrated model (Claude Opus 4.5, ECE=0.120) shows substantial calibration errors, while reasoning-enhanced models like GPT-5.2-XHigh exhibit \emphworse calibration (ECE=0.395) despite comparable accuracy. Critically, only one model achieves a positive Brier Skill Score, indicating most models perform worse than simply predicting base rates. Our findings suggest that scaling and enhanced reasoning do not automatically confer calibration benefits, highlighting epistemic calibration as a distinct capability requiring targeted development.
zh
[AI-80] Conversational Time Series Foundation Models: Towards Explainable and Effective Forecasting
【速读】:该论文旨在解决时间序列基础模型(time series foundation models)在实际应用中缺乏稳定性能的问题,即不存在单一模型能始终优于其他模型,因此核心挑战从“寻找最优模型”转变为“构建具有可解释性的最优集成模型”。其解决方案的关键在于将大型语言模型(LLM)重新定位为一个智能裁判(intelligent judge),通过基于SHAP的忠实度评分引导的R1风格微调(R1-style fine-tuning),使LLM具备对时间序列领域知识的理解能力,并将集成权重解释为关于时序动态的因果性陈述。该训练后的代理能够进行多轮迭代对话,执行前瞻性评估、提供因果驱动的解释并自适应优化集成策略,在GIFT-Eval基准上23个数据集共97种设置下显著优于现有主流时间序列基础模型,实现了CRPS和MASE指标上的新SOTA结果。
链接: https://arxiv.org/abs/2512.16022
作者: Defu Cao,Michael Gee,Jinbo Liu,Hengxuan Wang,Wei Yang,Rui Wang,Yan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31Pages
Abstract:The proliferation of time series foundation models has created a landscape where no single method achieves consistent superiority, framing the central challenge not as finding the best model, but as orchestrating an optimal ensemble with interpretability. While Large Language Models (LLMs) offer powerful reasoning capabilities, their direct application to time series forecasting has proven ineffective. We address this gap by repositioning the LLM as an intelligent judge that evaluates, explains, and strategically coordinates an ensemble of foundation models. To overcome the LLM’s inherent lack of domain-specific knowledge on time series, we introduce an R1-style finetuning process, guided by SHAP-based faithfulness scores, which teaches the model to interpret ensemble weights as meaningful causal statements about temporal dynamics. The trained agent then engages in iterative, multi-turn conversations to perform forward-looking assessments, provide causally-grounded explanations for its weighting decisions, and adaptively refine the optimization strategy. Validated on the GIFT-Eval benchmark on 23 datasets across 97 settings, our approach significantly outperforms leading time series foundation models on both CRPS and MASE metrics, establishing new state-of-the-art results.
zh
[AI-81] Few-Shot Inference of Human Perceptions of Robot Performance in Social Navigation Scenarios
【速读】:该论文旨在解决如何在人类-机器人交互中准确预测用户对机器人行为的感知评价问题,以开发符合人类预期的社会化机器人。传统方法依赖大规模用户研究获取标注数据,但存在数据成本高、扩展性差的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的少样本学习能力,通过少量上下文示例即可实现对用户感知的精准预测,从而显著降低对标注数据的需求,并提升预测性能与可扩展性。
链接: https://arxiv.org/abs/2512.16019
作者: Qiping Zhang,Nathan Tsoi,Mofeed Nagib,Hao-Tien Lewis Chiang,Marynel Vázquez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how humans evaluate robot behavior during human-robot interactions is crucial for developing socially aware robots that behave according to human expectations. While the traditional approach to capturing these evaluations is to conduct a user study, recent work has proposed utilizing machine learning instead. However, existing data-driven methods require large amounts of labeled data, which limits their use in practice. To address this gap, we propose leveraging the few-shot learning capabilities of Large Language Models (LLMs) to improve how well a robot can predict a user’s perception of its performance, and study this idea experimentally in social navigation tasks. To this end, we extend the SEAN TOGETHER dataset with additional real-world human-robot navigation episodes and participant feedback. Using this augmented dataset, we evaluate the ability of several LLMs to predict human perceptions of robot performance from a small number of in-context examples, based on observed spatio-temporal cues of the robot and surrounding human motion. Our results demonstrate that LLMs can match or exceed the performance of traditional supervised learning models while requiring an order of magnitude fewer labeled instances. We further show that prediction performance can improve with more in-context examples, confirming the scalability of our approach. Additionally, we investigate what kind of sensor-based information an LLM relies on to make these inferences by conducting an ablation study on the input features considered for performance prediction. Finally, we explore the novel application of personalized examples for in-context learning, i.e., drawn from the same user being evaluated, finding that they further enhance prediction accuracy. This work paves the path to improving robot behavior in a scalable manner through user-centered feedback.
zh
[AI-82] owards Fine-Tuning-Based Site Calibration for Knowledge-Guided Machine Learning: A Summary of Results
【速读】:该论文旨在解决农业生态系统碳循环(agroecosystem carbon cycle)在决策相关尺度上精准且低成本量化的问题,尤其针对跨区域数据异质性与复杂空间依赖关系导致的传统方法难以有效迁移和利用空间变异性(spatial variability)的挑战。其解决方案的关键在于提出FTBSC-KGML框架——一种基于预训练-微调机制、融合空间异质性感知与知识引导的机器学习方法,通过在全球范围内预训练模型后,在各州或站点进行微调以学习地点特异性表示,从而在数据有限条件下提升本地预测精度,同时保持模型可解释性,显著优于纯全局模型对多区域空间差异的捕捉能力。
链接: https://arxiv.org/abs/2512.16013
作者: Ruolei Zeng,Arun Sharma,Shuai An,Mingzhou Yang,Shengya Zhang,Licheng Liu,David Mulla,Shashi Shekhar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and cost-effective quantification of the agroecosystem carbon cycle at decision-relevant scales is essential for climate mitigation and sustainable agriculture. However, both transfer learning and the exploitation of spatial variability in this field are challenging, as they involve heterogeneous data and complex cross-scale dependencies. Conventional approaches often rely on location-independent parameterizations and independent training, underutilizing transfer learning and spatial heterogeneity in the inputs, and limiting their applicability in regions with substantial variability. We propose FTBSC-KGML (Fine-Tuning-Based Site Calibration-Knowledge-Guided Machine Learning), a pretraining- and fine-tuning-based, spatial-variability-aware, and knowledge-guided machine learning framework that augments KGML-ag with a pretraining-fine-tuning process and site-specific parameters. Using a pretraining-fine-tuning process with remote-sensing GPP, climate, and soil covariates collected across multiple midwestern sites, FTBSC-KGML estimates land emissions while leveraging transfer learning and spatial heterogeneity. A key component is a spatial-heterogeneity-aware transfer-learning scheme, which is a globally pretrained model that is fine-tuned at each state or site to learn place-aware representations, thereby improving local accuracy under limited data without sacrificing interpretability. Empirically, FTBSC-KGML achieves lower validation error and greater consistency in explanatory power than a purely global model, thereby better capturing spatial variability across states. This work extends the prior SDSA-KGML framework.
zh
[AI-83] Surrogate Neural Architecture Codesign Package (SNAC-Pack) NEURIPS2025
【速读】:该论文旨在解决现有神经架构搜索(Neural Architecture Search, NAS)方法在实际硬件部署中难以准确优化性能的问题,尤其是针对FPGA(现场可编程门阵列)等资源受限平台时,常依赖代理指标(如计算量BOPs)而非真实硬件表现进行评估,导致设计效率低下。其解决方案的关键在于提出Surrogate Neural Architecture Codesign Package (SNAC-Pack) 框架,该框架通过集成多阶段搜索机制与资源利用率及延迟估算器(Resource Utilization and Latency Estimator),实现对模型精度、FPGA资源占用和推理延迟的多目标联合优化,无需对每个候选模型进行耗时的综合(synthesis)过程,从而显著提升硬件感知型神经网络设计的自动化水平与实用性。
链接: https://arxiv.org/abs/2512.15998
作者: Jason Weitz,Dmitri Demler,Benjamin Hawks,Nhan Tran,Javier Duarte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注: NeurIPS 2025 Machine Learning and the Physical Sciences Workshop, 8 pages, 4 figures, 3 tables
Abstract:Neural Architecture Search is a powerful approach for automating model design, but existing methods struggle to accurately optimize for real hardware performance, often relying on proxy metrics such as bit operations. We present Surrogate Neural Architecture Codesign Package (SNAC-Pack), an integrated framework that automates the discovery and optimization of neural networks focusing on FPGA deployment. SNAC-Pack combines Neural Architecture Codesign’s multi-stage search capabilities with the Resource Utilization and Latency Estimator, enabling multi-objective optimization across accuracy, FPGA resource utilization, and latency without requiring time-intensive synthesis for each candidate model. We demonstrate SNAC-Pack on a high energy physics jet classification task, achieving 63.84% accuracy with resource estimation. When synthesized on a Xilinx Virtex UltraScale+ VU13P FPGA, the SNAC-Pack model matches baseline accuracy while maintaining comparable resource utilization to models optimized using traditional BOPs metrics. This work demonstrates the potential of hardware-aware neural architecture search for resource-constrained deployments and provides an open-source framework for automating the design of efficient FPGA-accelerated models.
zh
[AI-84] Provably Extracting the Features from a General Superposition
【速读】:该论文旨在解决在超完备(overcomplete)条件下从黑盒查询中学习特征方向的问题,即当特征数量 $ n $ 大于输入维度 $ d $ 时,如何高效地恢复出隐藏的特征方向 $ v_i $ 和对应的响应函数 $ \sigma_i $,从而重构函数 $ f(x) = \sum_{i=1}^n a_i \sigma_i(v_i^\top x) $。这一问题的核心挑战在于特征以“叠加态”(superposition)形式编码,使得传统方法难以识别单个特征方向。其解决方案的关键在于提出一种基于傅里叶空间迭代搜索的算法:通过逐步缩小搜索空间来定位隐含的方向 $ v_i $,并在仅要求不同特征方向不近似共线、且响应函数 $ \sigma_i $ 任意的前提下,实现对非退化响应特征方向的有效识别与函数重建。该方法显著扩展了先前工作的适用范围,首次在高度一般性的假设下实现了高效学习。
链接: https://arxiv.org/abs/2512.15987
作者: Allen Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
备注:
Abstract:It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function [ f(x)=\sum_i=1^n a_i,\sigma_i(v_i^\top x), ] where each unit vector v_i encodes a feature direction and \sigma_i:\mathbbR \rightarrow \mathbbR is an arbitrary response function and our goal is to recover the v_i and the function f . In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. n d ), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to f , identifies all feature directions whose responses are non-degenerate and reconstructs the function f . Crucially, our algorithm works in a significantly more general setting than all related prior results – we allow for essentially arbitrary superpositions, only requiring that v_i, v_j are not nearly identical for i \neq j , and general response functions \sigma_i . At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions v_i . Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2512.15987 [cs.LG] (or arXiv:2512.15987v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.15987 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-85] Embedding Software Intent: Lightweight Java Module Recovery
【速读】:该论文旨在解决现有架构恢复技术在将大型单体Java系统模块化为Java平台模块系统(JPMS)模块时效率低下、效果不佳的问题。其关键解决方案是提出ClassLAR(基于类与语言模型的架构恢复方法),该方法利用语言模型从完全限定类名中提取语义信息,从而同时捕捉模块的结构特征与功能意图,实现了对Java模块的轻量级、高效恢复,在20个流行Java项目上的实验表明,ClassLAR在架构层面相似性指标上优于所有现有最优技术,且执行速度提升3.99至10.50倍。
链接: https://arxiv.org/abs/2512.15980
作者: Yirui He,Yuqi Huai,Xingyu Chen,Joshua Garcia
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:As an increasing number of software systems reach unprecedented scale, relying solely on code-level abstractions is becoming impractical. While architectural abstractions offer a means to manage these systems, maintaining their consistency with the actual code has been problematic. The Java Platform Module System (JPMS), introduced in Java 9, addresses this limitation by enabling explicit module specification at the language level. JPMS enhances architectural implementation through improved encapsulation and direct specification of ground-truth architectures within Java projects. Although many projects are written in Java, modularizing existing monolithic projects to JPMS modules is an open challenge due to ineffective module recovery by existing architecture recovery techniques. To address this challenge, this paper presents ClassLAR (Class-and Language model-based Architectural Recovery), a novel, lightweight, and efficient approach that recovers Java modules from monolithic Java systems using fully-qualified class names. ClassLAR leverages language models to extract semantic information from package and class names, capturing both structural and functional intent. In evaluations across 20 popular Java projects, ClassLAR outperformed all state-of-the-art techniques in architectural-level similarity metrics while achieving execution times that were 3.99 to 10.50 times faster.
zh
[AI-86] OLAF: Towards Robust LLM -Based Annotation Framework in Empirical Software Engineering
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程实证研究中用于自动化标注任务时存在的可靠性与可复现性不足的问题。现有研究普遍缺乏对标注一致性、校准和漂移等关键指标的标准化度量,且常忽略必要的配置细节,导致结果难以验证和比较。论文提出了一种名为“LLM标注操作化框架”(Operationalization for LLM-based Annotation Framework, OLAF)的概念性解决方案,其核心在于将LLM标注视为一个测量过程,并系统性地组织了可靠性(reliability)、校准(calibration)、漂移(drift)、共识(consensus)、聚合(aggregation)和透明度(transparency)等关键构建块,以推动更透明、可复现的LLM标注实践在软件工程领域的应用。
链接: https://arxiv.org/abs/2512.15979
作者: Mia Mohammad Imran,Tarannum Shaila Zaman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used in empirical software engineering (ESE) to automate or assist annotation tasks such as labeling commits, issues, and qualitative artifacts. Yet the reliability and reproducibility of such annotations remain underexplored. Existing studies often lack standardized measures for reliability, calibration, and drift, and frequently omit essential configuration details. We argue that LLM-based annotation should be treated as a measurement process rather than a purely automated activity. In this position paper, we outline the \textbfOperationalization for LLM-based Annotation Framework (OLAF), a conceptual framework that organizes key constructs: \textitreliability, calibration, drift, consensus, aggregation, and \textittransparency. The paper aims to motivate methodological discussion and future empirical work toward more transparent and reproducible LLM-based annotation in software engineering research.
zh
[AI-87] Subjective functions
【速读】:该论文试图解决的问题是:如何为人工智能系统赋予类似人类智能的动态生成和选择目标函数的能力,即“客观函数从何而来”以及“我们如何选择追求的目标”。其解决方案的关键在于提出“主观函数”(subjective function)这一概念——这是一种内生于智能体自身的高阶目标函数,其定义基于智能体自身的特征而非外部任务。论文以“预期预测误差”(expected prediction error)为例,说明主观函数如何作为内在驱动力引导智能体的学习与行为优化,并指出该框架在心理学、神经科学和机器学习领域具有广泛关联性。
链接: https://arxiv.org/abs/2512.15948
作者: Samuel J. Gershman
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Where do objective functions come from? How do we select what goals to pursue? Human intelligence is adept at synthesizing new objective functions on the fly. How does this work, and can we endow artificial systems with the same ability? This paper proposes an approach to answering these questions, starting with the concept of a subjective function, a higher-order objective function that is endogenous to the agent (i.e., defined with respect to the agent’s features, rather than an external task). Expected prediction error is studied as a concrete example of a subjective function. This proposal has many connections to ideas in psychology, neuroscience, and machine learning.
zh
[AI-88] Small Language Models for Efficient Agent ic Tool Calling: Outperforming Large Models with Targeted Fine-tuning
【速读】:该论文旨在解决生成式 AI(Generative AI)在企业级应用中因大型语言模型(Large Language Models, LLMs)高昂的计算成本和资源消耗而导致的可持续性与可访问性问题。为应对这一挑战,研究提出以小语言模型(Small Language Models, SLMs)替代LLMs的解决方案,其关键在于通过领域自适应微调(domain-adapted fine-tuning)和高效的监督微调(Supervised Fine-Tuning, SFT)策略,使参数量仅为3.5亿的小模型在特定任务上实现接近甚至超越主流大模型的性能表现。实验表明,基于Facebook OPT-350M模型经单轮SFT训练后,在ToolBench评测中达到77.55%的通过率,显著优于包括ChatGPT-CoT、ToolLLaMA系列在内的多个基线模型,验证了针对目标任务进行优化设计的SLMs在降低部署成本的同时仍能保障高性能,从而推动生成式AI在生产环境中的大规模落地。
链接: https://arxiv.org/abs/2512.15943
作者: Polaris Jhandi,Owais Kazi,Shreyas Subramanian,Neel Sendas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As organizations scale adoption of generative AI, model cost optimization and operational efficiency have emerged as critical factors determining sustainability and accessibility. While Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, their extensive computational requirements make them cost-prohibitive for routine enterprise use. This limitation motivates the exploration of Small Language Models (SLMs), which can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead (Irugalbandara et al., 2023). In this work, we investigate the feasibility of replacing LLM-driven workflows with optimized SLMs. We trained a domain-adapted SLM to execute representative tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation. As part of the experiment, we investigated the fine-tuning of facebook/opt-350m model (single epoch only) using the Hugging Face TRL (Transformer Reinforcement Learning), specifically the Supervised Fine-Tuning (SFT) trainer. The OPT-350M model was released by Meta AI in 2022 as part of the OPT (Open Pretrained Transformer) family of models. Similar studies demonstrate that even models at the 350M parameter scale can meaningfully contribute to instruction-tuning pipelines (Mekala et al., 2024). Experimental results demonstrated that our fine-tuned SLM achieves exceptional performance with a 77.55% pass rate on ToolBench evaluation, significantly outperforming all baseline models including ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and ToolLLaMA-CoT (16.27%). These findings emphasize that thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems.
zh
[AI-89] Leverag ing Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在复杂推理任务中难以可靠获取并关联多步证据的问题。现有RAG框架通常假设所有检索到的信息具有同等可靠性,忽视了文本语料库中信息的可信度差异和内在关联性。为此,作者提出一种新颖的RAG框架,其核心创新在于利用传播激活算法(spreading activation algorithm)从由自动构建的知识图谱连接的文档语料库中进行信息检索,从而提升大语言模型在多跳问答等复杂任务上的表现。该方案的关键优势在于无需依赖高质量预设知识图谱或复杂的图结构构建流程,而是通过自动化知识图谱与传播激活机制实现高效、可插拔的信息检索,实验表明其相较传统RAG方法在准确率上显著提升(最高达39%绝对增益),且适用于资源受限场景。
链接: https://arxiv.org/abs/2512.15922
作者: Jovan Pavlović,Miklós Krész,László Hajdu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures
Abstract:Despite initial successes and a variety of architectures, retrieval-augmented generation (RAG) systems still struggle to reliably retrieve and connect the multi-step evidence required for complicated reasoning tasks. Most of the standard RAG frameworks regard all retrieved information as equally reliable, overlooking the varying credibility and interconnected nature of large textual corpora. GraphRAG approaches offer potential improvement to RAG systems by integrating knowledge graphs, which structure information into nodes and edges, capture entity relationships, and enable multi-step logical traversal. However, GraphRAG is not always an ideal solution as it depends on high-quality graph representations of the corpus, which requires either pre-existing knowledge graphs that are expensive to build and update, or automated graph construction pipelines that are often unreliable. Moreover, systems following this paradigm typically use large language models to guide graph traversal and evidence retrieval, leading to challenges similar to those encountered with standard RAG. In this paper, we propose a novel RAG framework that employs the spreading activation algorithm to retrieve information from a corpus of documents interconnected by automatically constructed knowledge graphs, thereby enhancing the performance of large language models on complex tasks such as multi-hop question answering. Experiments show that our method achieves better or comparable performance to iterative RAG methodologies, while also being easily integrable as a plug-and-play module with a wide range of RAG-based approaches. Combining our method with chain-of-thought iterative retrieval yields up to a 39% absolute gain in answer correctness compared to naive RAG, achieving these results with small open-weight language models and highlighting its effectiveness in resource-constrained settings.
zh
[AI-90] Darth Vecdor: An Open-Source System for Generating Knowledge Graphs Through Large Language Model Queries
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗等高要求领域应用时面临的成本高、响应速度慢、安全性不足及输出置信度低等问题。其核心挑战在于直接调用LLM进行查询可能无法满足实际场景对效率、可靠性和结构化访问的需求。解决方案的关键是开发Darth Vecdor(DV),一个将互联网训练数据中提取的知识结构化为术语映射的SQL数据库(即知识图谱)的系统,通过预提取与标准化处理,提升查询效率与可控性,并支持多元素响应和领域专家参与的提示工程,从而增强在医疗等关键领域的适用性与安全性。
链接: https://arxiv.org/abs/2512.15906
作者: Jonathan A. Handler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures
Abstract:Many large language models (LLMs) are trained on a massive body of knowledge present on the Internet. Darth Vecdor (DV) was designed to extract this knowledge into a structured, terminology-mapped, SQL database (“knowledge base” or “knowledge graph”). Knowledge graphs may be useful in many domains, including healthcare. Although one might query an LLM directly rather than a SQL-based knowledge graph, concerns such as cost, speed, safety, and confidence may arise, especially in high-volume operations. These may be mitigated when the information is pre-extracted from the LLM and becomes query-able through a standard database. However, the author found the need to address several issues. These included erroneous, off-topic, free-text, overly general, and inconsistent LLM responses, as well as allowing for multi-element responses. DV was built with features intended to mitigate these issues. To facilitate ease of use, and to allow for prompt engineering by those with domain expertise but little technical background, DV provides a simple, browser-based graphical user interface. DV has been released as free, open-source, extensible software, on an “as is” basis, without warranties or conditions of any kind, either express or implied. Users need to be cognizant of the potential risks and benefits of using DV and its outputs, and users are responsible for ensuring any use is safe and effective. DV should be assumed to have bugs, potentially very serious ones. However, the author hopes that appropriate use of current and future versions of DV and its outputs can help improve healthcare.
zh
[AI-91] PediatricAnxietyBench: Evaluating Large Language Model Safety Under Parental Anxiety and Pressure in Pediatric Consultations
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在儿科咨询场景中面对真实世界家长焦虑压力时的安全性问题,特别是当家长使用紧急语气或施加其他压力因素时,模型可能违反安全约束并给出有害建议的风险。其解决方案的关键在于构建了一个名为PediatricAnxietyBench的开源基准测试集,包含300个高质量查询(150个来自患者实际需求,150个模拟家长施压情境),涵盖10个儿科主题,并采用多维安全评估框架(包括诊断克制、转诊依从性、模糊表述和紧急情况识别)对两个Llama模型(70B和8B)进行系统评估。结果表明,模型规模显著影响安全性,但即便如此,所有模型仍暴露于现实家长压力下的关键脆弱点,如癫痫误诊率高达33.3%,且模糊表述与安全得分高度正相关(r=0.68)。该基准为揭示标准评测未捕捉到的临床显著失效模式提供了可复现的评估工具。
链接: https://arxiv.org/abs/2512.15894
作者: Vahideh Zolfaghari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly consulted by parents for pediatric guidance, yet their safety under real-world adversarial pressures is poorly understood. Anxious parents often use urgent language that can compromise model safeguards, potentially causing harmful advice. PediatricAnxietyBench is an open-source benchmark of 300 high-quality queries across 10 pediatric topics (150 patient-derived, 150 adversarial) enabling reproducible evaluation. Two Llama models (70B and 8B) were assessed using a multi-dimensional safety framework covering diagnostic restraint, referral adherence, hedging, and emergency recognition. Adversarial queries incorporated parental pressure patterns, including urgency, economic barriers, and challenges to disclaimers. Mean safety score was 5.50/15 (SD=2.41). The 70B model outperformed the 8B model (6.26 vs 4.95, p0.001) with lower critical failures (4.8% vs 12.0%, p=0.02). Adversarial queries reduced safety by 8% (p=0.03), with urgency causing the largest drop (-1.40). Vulnerabilities appeared in seizures (33.3% inappropriate diagnosis) and post-vaccination queries. Hedging strongly correlated with safety (r=0.68, p0.001), while emergency recognition was absent. Model scale influences safety, yet all models showed vulnerabilities to realistic parental pressures. PediatricAnxietyBench provides a reusable adversarial evaluation framework to reveal clinically significant failure modes overlooked by standard benchmarks.
zh
[AI-92] VET Your Agent : Towards Host-Independent Autonomy via Verifiable Execution Traces
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)驱动的自主代理在实际部署中面临的可信性问题:即代理运行于由主机控制的基础设施之上,主机可能篡改模型、输入或输出,从而破坏代理的自主性和可验证性。为应对这一挑战,论文提出VET(Verifiable Execution Traces)框架,其核心是Agent Identity Document(AID),用于定义代理配置及配套的验证证明机制;VET具备组合性,支持多种证明方式,包括可信执行环境(Trusted Execution Environment, TEE)、简洁密码学证明和经公证的TLS传输记录(Web Proofs)。实验表明,在黑盒API调用场景下,Web Proofs是最实用方案,延迟通常低于直接API调用的3倍;而在公开API场景下,TEE Proxy即可满足低开销需求。通过部署一个可验证交易代理并融合Web Proofs与TEE Proxy,作者证明了现有技术已能实现宿主无关的身份认证,为未来完全宿主独立的自主系统奠定基础。
链接: https://arxiv.org/abs/2512.15892
作者: Artem Grigor,Christian Schroeder de Witt,Simon Birnbach,Ivan Martinovic
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have enabled a new generation of autonomous agents that operate over sustained periods and manage sensitive resources on behalf of users. Trusted for their ability to act without direct oversight, such agents are increasingly considered in high-stakes domains including financial management, dispute resolution, and governance. Yet in practice, agents execute on infrastructure controlled by a host, who can tamper with models, inputs, or outputs, undermining any meaningful notion of autonomy. We address this gap by introducing VET (Verifiable Execution Traces), a formal framework that achieves host-independent authentication of agent outputs and takes a step toward host-independent autonomy. Central to VET is the Agent Identity Document (AID), which specifies an agent’s configuration together with the proof systems required for verification. VET is compositional: it supports multiple proof mechanisms, including trusted hardware, succinct cryptographic proofs, and notarized TLS transcripts (Web Proofs). We implement VET for an API-based LLM agent and evaluate our instantiation on realistic workloads. We find that for today’s black-box, secret-bearing API calls, Web Proofs appear to be the most practical choice, with overhead typically under 3 \times compared to direct API calls, while for public API calls, a lower-overhead TEE Proxy is often sufficient. As a case study, we deploy a verifiable trading agent that produces proofs for each decision and composes Web Proofs with a TEE Proxy. Our results demonstrate that practical, host-agnostic authentication is already possible with current technology, laying the foundation for future systems that achieve full host-independent autonomy. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.15892 [cs.CR] (or arXiv:2512.15892v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.15892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-93] Optimizing Agent ic Language Model Inference via Speculative Tool Calls
【速读】:该论文旨在解决语言模型(Language Models, LMs)在依赖外部工具(如文件搜索、API调用、代码执行等)进行推理和交互时,因工具调用引入的性能瓶颈问题。解决方案的关键在于提出新颖的系统级优化策略:通过推测性执行(speculating)工具调用,并强制相关序列在推理引擎中保持驻留(remain resident),从而显著减少工具调用带来的延迟开销。实验表明,该方法可实现每秒数百token的吞吐量提升,同时提供理论分析以指导最优推测配置,并建议引入“工具缓存”(tool cache)API端点,便于大模型服务商快速集成这些优化。
链接: https://arxiv.org/abs/2512.15834
作者: Daniel Nichols,Prajwal Singhania,Charles Jekel,Abhinav Bhatele,Harshitha Menon
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Software Engineering (cs.SE)
备注:
Abstract:Language models (LMs) are becoming increasingly dependent on external tools. LM-based agentic frameworks frequently interact with their environment via such tools to search files, run code, call APIs, etc. Further, modern reasoning-based LMs use tools such as web search and Python code execution to enhance their reasoning capabilities. While tools greatly improve the capabilities of LMs, they also introduce performance bottlenecks during the inference process. In this paper, we introduce novel systems optimizations to address such performance bottlenecks by speculating tool calls and forcing sequences to remain resident in the inference engine to minimize overheads. Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents. We provide a theoretical analysis of our algorithms to provide insights into speculation configurations that will yield the best performance. Further, we recommend a new “tool cache” API endpoint to enable LM providers to easily adopt these optimizations.
zh
[AI-94] State-Augmented Graphs for Circular Economy Triage
【速读】:该论文旨在解决循环经济(Circular Economy, CE)中产品在使用寿命终结后,如何科学评估并决策其可持续路径的问题,即CE分诊(CE triage)。传统方法难以兼顾价值保留、处理成本与操作约束的动态平衡。解决方案的关键在于提出一种基于状态增强的解体序列规划(Disassembly Sequencing Planning, DSP)图的确定性决策框架,通过将解体历史编码为状态变量以满足马尔可夫性质,从而实现递归最优评估——每个决策仅依赖于前一状态,避免了复杂的历史依赖性;同时融合基于诊断健康评分的条件感知效用函数与多重操作约束,使决策能灵活适应不同机械复杂度、安全要求和经济驱动因素,如电动汽车电池的层级分诊案例所示,该框架提供了一个通用且可扩展的优化基础。
链接: https://arxiv.org/abs/2512.15824
作者: Richard Fox,Rui Li,Gustav Jonsson,Farzaneh Goli,Miying Yang,Emel Aktas,Yongjing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Circular economy (CE) triage is the assessment of products to determine which sustainable pathway they can follow once they reach the end of their usefulness as they are currently being used. Effective CE triage requires adaptive decisions that balance retained value against the costs and constraints of processing and labour. This paper presents a novel decision-making framework as a simple deterministic solver over a state-augmented Disassembly Sequencing Planning (DSP) graph. By encoding the disassembly history into the state, our framework enforces the Markov property, enabling optimal, recursive evaluation by ensuring each decision only depends on the previous state. The triage decision involves choices between continuing disassembly or committing to a CE option. The model integrates condition-aware utility based on diagnostic health scores and complex operational constraints. We demonstrate the framework’s flexibility with a worked example: the hierarchical triage of electric vehicle (EV) batteries, where decisions are driven by the recursive valuation of components. The example illustrates how a unified formalism enables the accommodation of varying mechanical complexity, safety requirements, and economic drivers. This unified formalism therefore provides a tractable and generalisable foundation for optimising CE triage decisions across diverse products and operational contexts.
zh
[AI-95] A Neurosymbolic Approach to Loop Invariant Generation via Weakest Precondition Reasoning
【速读】:该论文旨在解决自动化程序验证中循环不变量(loop invariant)生成这一关键瓶颈问题。现有基于大语言模型(Large Language Models, LLMs)的方法缺乏可靠且结构化的推理机制,且未充分结合程序验证理论。为此,作者提出了一种神经符号(neurosymbolic)方法 NeuroInv,其核心在于两个模块的协同:一是神经推理模块,利用LLMs与霍尔逻辑(Hoare logic)通过后向链 weakest precondition 推理生成并优化候选不变量;二是验证引导的符号模块,借助 OpenJML 产生的反例迭代修复不变量。该方案在包含单循环、多循环、多数组及噪声代码的150个Java程序上实现99.5%的成功率,显著优于对比方法,并在10个复杂多循环程序(平均每个含7个循环)上验证了其可扩展性。
链接: https://arxiv.org/abs/2512.15816
作者: Daragh King,Vasileios Koutavas,Laura Kovacs
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Loop invariant generation remains a critical bottleneck in automated program verification. Recent work has begun to explore the use of Large Language Models (LLMs) in this area, yet these approaches tend to lack a reliable and structured methodology, with little reference to existing program verification theory. This paper presents NeuroInv, a neurosymbolic approach to loop invariant generation. NeuroInv comprises two key modules: (1) a neural reasoning module that leverages LLMs and Hoare logic to derive and refine candidate invariants via backward-chaining weakest precondition reasoning, and (2) a verification-guided symbolic module that iteratively repairs invariants using counterexamples from OpenJML. We evaluate NeuroInv on a comprehensive benchmark of 150 Java programs, encompassing single and multiple (sequential) loops, multiple arrays, random branching, and noisy code segments. NeuroInv achieves a 99.5% success rate, substantially outperforming the other evaluated approaches. Additionally, we introduce a hard benchmark of 10 larger multi-loop programs (with an average of 7 loops each); NeuroInv’s performance in this setting demonstrates that it can scale to more complex verification scenarios.
zh
[AI-96] CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory
【速读】:该论文旨在解决当前工具使用型AI代理(tool-using AI agents)在处理重复性任务时存在的三大问题:动作空间受限、上下文效率低下以及概率不稳定性。其中,概率不稳定性导致相同任务在相同环境下可能产生不同执行轨迹,从而影响可靠性。为应对这一挑战,论文提出CodeMem架构,其核心创新在于通过代码实现过程记忆(procedural memory),使得代理能够构建和运行具有确定性可靠性的可复用智能工作流(agentic workflows)。该方案利用代码作为记忆载体,确保任务执行路径的一致性和可重现性,从而克服了传统基于大语言模型(LLM)的随机性缺陷。
链接: https://arxiv.org/abs/2512.15813
作者: Nishant Gaurav,Adit Akarsh,Tejas Ravishankar,Manoj Bajaj
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures
Abstract:Current tool-using AI agents suffer from limited action space, context inefficiency, and probabilistic instability that makes them unsuitable for handling repetitive tasks which are otherwise reliably and efficiently tackled by agentic workflows built on platforms like n8n and Zapier. Earlier works like CodeAct, DynaSaur, Code Mode have tried to tackle the first two issues by using the whole Python language as its action space: The number of tools that the agent can call becomes infinite. Python code blocks can execute complex actions into a single step and print only relevant results which helps in keeping the context lean. However, the probabilistic instability issue still remains, as for the same task in the same environment, the agent can follow different trajectories due to the probabilistic nature of LLMs. Therefore, we need procedural memory for consistency and reliability. This paper proposes CodeMem, an architecture to implement procedural memory via code which can be used to build and run reusable agentic workflows with deterministic reliability.
zh
[AI-97] Edge-wise Topological Divergence Gaps: Guiding Search in Combinatorial Optimization
【速读】:该论文旨在解决旅行商问题(Travelling Salesman Problem, TSP)中局部优化算法性能受限的问题,特别是如何提升2-opt和3-opt等经典启发式方法在收敛速度与解质量上的表现。其解决方案的关键在于提出了一种基于拓扑反馈机制的新方法:通过分析巡回路径(tour)与最小生成树(minimum spanning tree, MST)之间的差异,利用一个关键的规范分解定理,将该差距表示为RTD-Lite条形码(barcode)中的边级拓扑差异项;进而基于此拓扑信息设计出指导2-opt和3-opt搜索方向的拓扑引导策略,从而显著提升局部搜索效率与最终解的质量。
链接: https://arxiv.org/abs/2512.15800
作者: Ilya Trofimov,Daria Voronkova,Alexander Mironenko,Anton Dmitriev,Eduard Tulchinskii,Evgeny Burnaev,Serguei Barannikov
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a topological feedback mechanism for the Travelling Salesman Problem (TSP) by analyzing the divergence between a tour and the minimum spanning tree (MST). Our key contribution is a canonical decomposition theorem that expresses the tour-MST gap as edge-wise topology-divergence gaps from the RTD-Lite barcode. Based on this, we develop a topological guidance for 2-opt and 3-opt heuristics that increases their performance. We carry out experiments with fine-optimization of tours obtained from heatmap-based methods, TSPLIB, and random instances. Experiments demonstrate the topology-guided optimization results in better performance and faster convergence in many cases.
zh
[AI-98] Cybercrime and Computer Forensics in Epoch of Artificial Intelligence in India
【速读】:该论文旨在解决生成式人工智能(Generative AI)在数字生态系统中对印度刑事司法体系下计算法证完整性带来的挑战,特别是《2023年个人数据保护法》(Digital Personal Data Protection Act, 2023)在应对对抗性AI威胁(如反法证技术和深度伪造)方面的适配性不足问题。研究指出,当前法律框架在隐私边界划定、数据最小化原则与法证数据留存需求之间存在关键张力,且现有定义未能涵盖由AI驱动的“工具犯罪”和“目标犯罪”。解决方案的关键在于提出一种“以人为中心”的法证模型,优先采用可解释人工智能(Explainable AI, XAI),以确保证据的可采性,并建议同步印度隐私法规与国际法证标准,从而系统性降低合成媒体风险,为未来立法修订和技术标准化提供路径。
链接: https://arxiv.org/abs/2512.15799
作者: Sahibpreet Singh,Shikha Dhiman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Published in Cyber Law Reporter 2(4), 13-32 (2023)
Abstract:The integration of generative Artificial Intelligence into the digital ecosystem necessitates a critical re-evaluation of Indian criminal jurisprudence regarding computational forensics integrity. While algorithmic efficiency enhances evidence extraction, a research gap exists regarding the Digital Personal Data Protection Act, 2023’s compatibility with adversarial AI threats, specifically anti-forensics and deepfakes. This study scrutinizes the AI “dual-use” dilemma, functioning as both a cyber-threat vector and forensic automation mechanism, to delineate privacy boundaries in high-stakes investigations. Employing a doctrinal legal methodology, the research synthesizes statutory analysis of the DPDP Act with global ethical frameworks (IEEE, EU) to evaluate regulatory efficacy. Preliminary results indicate that while Machine Learning offers high accuracy in pattern recognition, it introduces vulnerabilities regarding data poisoning and algorithmic bias. Findings highlight a critical tension between the Act’s data minimization principles and forensic data retention requirements. Furthermore, the paper identifies that existing legal definitions inadequately encompass AI-driven “tool crimes” and “target crimes.” Consequently, the research proposes a “human-centric” forensic model prioritizing explainable AI (XAI) to ensure evidence admissibility. These implications suggest that synchronizing Indian privacy statutes with international forensic standards is imperative to mitigate synthetic media risks, establishing a roadmap for future legislative amendments and technical standardization.
zh
[AI-99] oward Agent ic Environments: GenAI and the Convergence of AI Sustainability and Human-Centric Spaces
【速读】:该论文旨在解决当前以云为中心的AI部署模式所带来的高能耗与环境影响问题,其核心挑战在于生成式AI(Generative AI)和大语言模型(Large Language Models, LLMs)在广泛应用中对计算资源的巨大需求所导致的碳足迹上升。解决方案的关键在于提出“代理环境”(agentic environments)这一可持续导向的AI框架,通过整合生成式AI、多智能体系统(multi-agent systems)与边缘计算(edge computing),实现资源利用效率提升、数据隐私强化以及设计即可持续(sustainability-by-design)的目标,从而减少对能源密集型云端基础设施的依赖。
链接: https://arxiv.org/abs/2512.15787
作者: Przemek Pospieszny,Dominika P. Brodowicz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Preprint, Paper submitted for publication in Sustainable Development (Wiley)
Abstract:In recent years, advances in artificial intelligence (AI), particularly generative AI (GenAI) and large language models (LLMs), have made human-computer interactions more frequent, efficient, and accessible across sectors ranging from banking to healthcare. AI tools embedded in digital devices support decision-making and operational management at both individual and organizational levels, including resource allocation, workflow automation, and real-time data analysis. However, the prevailing cloud-centric deployment of AI carries a substantial environmental footprint due to high computational demands. In this context, this paper introduces the concept of agentic environments, a sustainability-oriented AI framework that extends beyond reactive systems by leveraging GenAI, multi-agent systems, and edge computing to reduce the environmental impact of technology. Agentic environments enable more efficient resource use, improved quality of life, and sustainability-by-design, while simultaneously enhancing data privacy through decentralized, edge-driven solutions. Drawing on secondary research as well as primary data from focus groups and semi-structured interviews with AI professionals from leading technology companies, the paper proposes a conceptual framework for agentic environments examined through three lenses: the personal sphere, professional and commercial use, and urban operations. The findings highlight the potential of agentic environments to foster sustainable ecosystems through optimized resource utilization and strengthened data privacy. The study concludes with recommendations for edge-driven deployment models to reduce reliance on energy-intensive cloud infrastructures.
zh
[AI-100] Cultural Rights and the Rights to Development in the Age of AI: Implications for Global Human Rights Governance
【速读】:该论文旨在解决人工智能(AI)技术发展对文化权利(cultural rights)与发展的权利(right to development)所构成的挑战,特别是AI在文化内容生成、知识产权分配及全球文化参与中的影响,以及其可能加剧既有经济、社会和数字不平等的问题。解决方案的关键在于系统分析AI在算法设计与部署中隐含的文化与发展假设所带来的认知与规范局限,并识别现有AI治理框架在此两类权利保护上的空白与张力;进而提出将文化权利与发展权纳入AI治理的伦理与法律考量,推动更具包容性和公平性的全球AI治理政策演进。
链接: https://arxiv.org/abs/2512.15786
作者: Alexander Kriebitz,Caitlin Corrigan,Aive Pevkur,Alberto Santos Ferro,Amanda Horzyk,Dirk Brand,Dohee Kim,Dodzi Koku Hattoh,Flavia Massucci,Gilles Fayad,Kamil Strzepek,Laud Ammah,Lavina Ramkissoon,Mariette Awad,Natalia Amasiadi,Nathan C. Walker,Nicole Manger,Sophia Devlin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Cultural rights and the right to development are essential norms within the wider framework of international human rights law. However, recent technological advances in artificial intelligence (AI) and adjacent digital frontier technologies pose significant challenges to the protection and realization of these rights. This owes to the increasing influence of AI systems on the creation and depiction of cultural content, affect the use and distribution of the intellectual property of individuals and communities, and influence cultural participation and expression worldwide. In addition, the growing influence of AI thus risks exacerbating preexisting economic, social and digital divides and reinforcing inequities for marginalized communities. This dynamic challenges the existing interplay between cultural rights and the right to development, and raises questions about the integration of cultural and developmental considerations into emerging AI governance frameworks. To address these challenges, the paper examines the impact of AI on both categories of rights. Conceptually, it analyzes the epistemic and normative limitations of AI with respect to cultural and developmental assumptions embedded in algorithmic design and deployment, but also individual and structural impacts of AI on both rights. On this basis, the paper identifies gaps and tensions in existing AI governance frameworks with respect to cultural rights and the right to development. By situating cultural rights and the right to development within the broader landscape of AI and human rights, this paper contributes to the academic discourse on AI ethics, legal frameworks, and international human rights law. Finally, it outlines avenues for future research and policy development based on existing conversations in global AI governance. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.15786 [cs.CY] (or arXiv:2512.15786v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2512.15786 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-101] Beyond Training: Enabling Self-Evolution of Agents with MOBIMEM
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在移动端和桌面端部署后难以实现持续自我演进的问题。当前以模型为中心的代理架构在提升个性化、能力与效率时,通常依赖频繁的模型重训练或微调,导致计算开销巨大,并面临模型准确率与推理效率之间的固有权衡。其解决方案的关键在于提出一种以记忆为中心的代理系统 MOBIMEM,通过引入三种专用记忆原语将代理演化过程从模型权重中解耦:(1) 用户画像记忆(Profile Memory)利用轻量级距离图(DisGraph)结构实现用户偏好对齐,缓解检索准确性与延迟间的矛盾;(2) 经验记忆(Experience Memory)采用多层级模板机制实例化新任务的执行逻辑,保障能力泛化;(3) 动作记忆(Action Memory)记录细粒度交互序列,降低对昂贵模型推理的依赖。在此基础上,MOBIMEM 进一步集成类操作系统的服务机制,包括调度器、代理回放(AgentRR)和上下文感知异常处理,从而实现无需模型重训练即可迭代优化代理性能的目标。
链接: https://arxiv.org/abs/2512.15784
作者: Zibin Liu,Cheng Zhang,Xi Zhao,Yunfei Feng,Bingyu Bai,Dahu Feng,Erhu Feng,Yubin Xia,Haibo Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Model (LLM) agents are increasingly deployed to automate complex workflows in mobile and desktop environments. However, current model-centric agent architectures struggle to self-evolve post-deployment: improving personalization, capability, and efficiency typically requires continuous model retraining/fine-tuning, which incurs prohibitive computational overheads and suffers from an inherent trade-off between model accuracy and inference efficiency. To enable iterative self-evolution without model retraining, we propose MOBIMEM, a memory-centric agent system. MOBIMEM first introduces three specialized memory primitives to decouple agent evolution from model weights: (1) Profile Memory uses a lightweight distance-graph (DisGraph) structure to align with user preferences, resolving the accuracy-latency trade-off in user profile retrieval; (2) Experience Memory employs multi-level templates to instantiate execution logic for new tasks, ensuring capability generalization; and (3) Action Memory records fine-grained interaction sequences, reducing the reliance on expensive model inference. Building upon this memory architecture, MOBIMEM further integrates a suite of OS-inspired services to orchestrate execution: a scheduler that coordinates parallel sub-task execution and memory operations; an agent record-and-replay (AgentRR) mechanism that enables safe and efficient action reuse; and a context-aware exception handling that ensures graceful recovery from user interruptions and runtime errors. Evaluation on AndroidWorld and top-50 apps shows that MOBIMEM achieves 83.1% profile alignment with 23.83 ms retrieval time (280x faster than GraphRAG baselines), improves task success rates by up to 50.3%, and reduces end-to-end latency by up to 9x on mobile devices. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.15784 [cs.AI] (or arXiv:2512.15784v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.15784 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-102] AI Epidemiology: achieving explainable AI through expert oversight patterns
【速读】:该论文旨在解决当前高级人工智能(AI)系统在部署规模下因模型复杂性导致的可解释性难题,尤其是传统方法如SHAP和机制可解释性难以应对大规模模型治理的问题。其核心解决方案是提出“AI流行病学”(AI Epidemiology)框架,通过将AI输出的专家交互行为标准化为结构化评估字段——风险等级、对齐分数和准确度分数——作为暴露变量,利用统计关联预测输出失败,类似于医学中用胆固醇和血压预测心脏事件。该方法不依赖模型内部计算,而是基于群体层面的被动监控,实现自动审计追踪、跨模型与供应商的治理连续性,并提供可靠性评分与语义解释,使领域专家无需机器学习专业知识即可有效监督AI系统,从而实现AI治理的民主化。
链接: https://arxiv.org/abs/2512.15783
作者: Kit Tempest-Walters
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 1 figure, 7 tables
Abstract:AI Epidemiology is a framework for governing and explaining advanced AI systems by applying population-level surveillance methods to AI outputs. The approach mirrors the way in which epidemiologists enable public health interventions through statistical evidence before molecular mechanisms are understood. This bypasses the problem of model complexity which plagues current interpretability methods (such as SHAP and mechanistic interpretability) at the scale of deployed models. AI Epidemiology achieves this population-level surveillance by standardising capture of AI-expert interactions into structured assessment fields: risk level, alignment score, and accuracy score. These function as exposure variables which predict output failure through statistical associations, much like cholesterol and blood pressure act as exposure variables predicting cardiac events. Output-failure associations are subsequently validated against expert overrides and real-world outcomes. The framework places zero burden on experts and provides automatic audit trails by passively tracking expert convergence and divergence with AI recommendations. Since it analyses outputs rather than internal model computations, it also provides governance continuity when institutions update models and switch vendors. Finally, by providing reliability scores and semantic assessments (e.g. ‘this recommendation resembles 500 cases overridden by experts due to guideline violations’), it enables experts and institutions to detect unreliable AI outputs before they cause harm. This democratises AI oversight by enabling domain experts to govern AI systems without requiring machine learning expertise. Comments: 41 pages, 1 figure, 7 tables Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.15783 [cs.AI] (or arXiv:2512.15783v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.15783 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-103] Adversarial Robustness in Financial Machine Learning: Defenses Economic Impact and Governance Evidence
【速读】:该论文旨在解决表格型机器学习模型在金融决策场景中面临的对抗鲁棒性问题,特别是评估这些模型在遭受梯度攻击时的性能稳定性及其对歧视性、校准性和金融风险指标的影响。其解决方案的关键在于通过对抗训练(adversarial training)提升模型在小扰动下的鲁棒性,从而实现性能的部分恢复,保障金融应用中的可靠性与公平性。
链接: https://arxiv.org/abs/2512.15780
作者: Samruddhi Baviskar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:We evaluate adversarial robustness in tabular machine learning models used in financial decision making. Using credit scoring and fraud detection data, we apply gradient based attacks and measure impacts on discrimination, calibration, and financial risk metrics. Results show notable performance degradation under small perturbations and partial recovery through adversarial training.
zh
[AI-104] Emergence: Overcoming Privileged Information Bias in Asymmetric Embodied Agents via Active Querying
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在具身环境中的“符号接地”(symbol grounding)难题,特别是当信息分布不对称时,具备知识优势的“领导者”代理无法有效指导感知受限的“跟随者”代理的问题。这一现象被定义为“特权信息偏差”(Privileged Information Bias),其根源在于缺乏心智理论(Theory of Mind)。解决方案的关键在于提出一种新颖的非对称辅助推理框架(Asymmetric Assistive Reasoning),并通过AI2-THOR环境验证:采用基于主动查询的“拉取式”(Pull-based)通信协议显著优于传统的“推送式”(Push-based)指令传递方式,成功案例中澄清请求频率提升两倍,表明主动不确定性降低是实现安全人机协作和机器人间协作的前提条件。
链接: https://arxiv.org/abs/2512.15776
作者: Shaun Baek,Sam Liu,Joseph Ukpong
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 12 pages, 9 pages of content, 6 tables, 5 figures
Abstract:Large Language Models (LLMs) act as powerful reasoning engines but struggle with “symbol grounding” in embodied environments, particularly when information is asymmetrically distributed. We investigate the Privileged Information Bias (or “Curse of Knowledge”), where a knowledgeable “Leader” agent fails to guide a sensor-limited “Follower” due to a lack of Theory of Mind. To quantify this phenomenon, we propose a novel Asymmetric Assistive Reasoning framework within AI2-THOR. Our experiments reveal a significant “Success Gap”: while the Leader successfully perceives the target in 35.0% of episodes, the collaborative team succeeds only 17.0% of the time, implying that nearly 50% of feasible plans fail solely due to communicative grounding errors. We demonstrate that a “Pull-based” protocol (active querying) is significantly more robust than standard “Push-based” instruction, with successful episodes featuring 2x the frequency of clarification requests. This research isolates the mechanism of active uncertainty reduction as a prerequisite for safe human-AI and robot-robot collaboration.
zh
[AI-105] Enhanced Web User Interface Design Via Cross-Device Responsiveness Assessment Using An Improved HCI-INTEGRATED DL Schemes
【速读】:该论文旨在解决现有用户界面(User Interface, UI)优化模型忽视跨响应性(Cross-Responsiveness, CR)评估的问题,从而影响用户交互效率。其核心解决方案是引入基于有限指数连续状态机(Finite Exponential Continuous State Machine, FECSM)的CR评估机制,并结合一种新型的Quokka非线性差分 swarm 优化算法(Quokka Nonlinear Difference Swarm Optimization Algorithm, QNDSOA)对UI设计进行动态优化。关键创新在于通过FECSM实现对用户行为模式的精准建模与CR评估,再利用双向门控Luong与Mish激活函数递归单元(Bidirectional Gated Luong and Mish Recurrent Unit, BiGLMRU)识别用户体验(User eXperience, UX)变化类型,并以用户界面变化预测指数(User Interface Change Prediction Index, UICPI)为标签训练分类器,最终由QNDSOA实现平均适应度达98.5632%的UI优化部署。
链接: https://arxiv.org/abs/2512.15775
作者: Shrinivass Arunachalam Balasubramanian
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Software Engineering (cs.SE)
备注: 17 Pages, 8 Figures
Abstract:User Interface (UI) optimization is essential in the digital era to enhance user satisfaction in web environments. Nevertheless, the existing UI optimization models had overlooked the Cross-Responsiveness (CR) assessment, affecting the user interaction efficiency. Consequently, this article proposes a dynamic web UI optimization through CR assessment using Finite Exponential Continuous State Machine (FECSM) and Quokka Nonlinear Difference Swarm Optimization Algorithm (QNDSOA). Initially, the design and user interaction related information is collected as well as pre-processed for min-max normalization. Next, the Human-Computer Interaction (HCI)-based features are extracted, followed by user behaviour pattern grouping. Meanwhile, the CR assessment is done using FECSM. Then, the proposed Bidirectional Gated Luong and Mish Recurrent Unit (BiGLMRU) is used to classify the User eXperience (UX) change type, which is labelled based on the User Interface Change Prediction Index (UICPI). Lastly, a novel QNDSOA is utilized to optimize the UI design with an average fitness of 98.5632%. Feedback monitoring is done after optimal deployment.
zh
[AI-106] S-DP: Reinforcement Speculative Decoding For Temporal Adaptive Diffusion Policy Acceleration
【速读】:该论文旨在解决扩散策略(Diffusion Policy, DP)在具身控制任务中因多次迭代去噪步骤导致的高推理延迟和计算成本问题,尤其针对时间复杂度较高的动态环境,传统静态加速方法(如量化)无法适应任务难度变化,而推测解码(speculative decoding)虽具无损与自适应潜力但尚未被有效应用于DP。解决方案的关键在于提出Temporal-aware Reinforcement-based Speculative Diffusion Policy (TS-DP),其核心创新包括:1)通过蒸馏训练一个基于Transformer的drafting模型以替代原模型昂贵的去噪调用;2)设计基于强化学习(Reinforcement Learning, RL)的调度器,动态调整推测参数以匹配随时间变化的任务难度,在保证精度的同时实现计算资源的高效分配。实验表明,TS-DP可实现最高4.17倍加速,且接受率超过94%,达到25 Hz的推理频率,支持实时扩散控制而无性能损失。
链接: https://arxiv.org/abs/2512.15773
作者: Ye Li,Jiahe Feng,Yuan Meng,Kangye Ji,Chen Tang,Xinwan Wen,Shutao Xia,Zhi Wang,Wenwu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Policy (DP) excels in embodied control but suffers from high inference latency and computational cost due to multiple iterative denoising steps. The temporal complexity of embodied tasks demands a dynamic and adaptable computation mode. Static and lossy acceleration methods, such as quantization, fail to handle such dynamic embodied tasks, while speculative decoding offers a lossless and adaptive yet underexplored alternative for DP. However, it is non-trivial to address the following challenges: how to match the base model’s denoising quality at lower cost under time-varying task difficulty in embodied settings, and how to dynamically and interactively adjust computation based on task difficulty in such environments. In this paper, we propose Temporal-aware Reinforcement-based Speculative Diffusion Policy (TS-DP), the first framework that enables speculative decoding for DP with temporal adaptivity. First, to handle dynamic environments where task difficulty varies over time, we distill a Transformer-based drafter to imitate the base model and replace its costly denoising calls. Second, an RL-based scheduler further adapts to time-varying task difficulty by adjusting speculative parameters to maintain accuracy while improving efficiency. Extensive experiments across diverse embodied environments demonstrate that TS-DP achieves up to 4.17 times faster inference with over 94% accepted drafts, reaching an inference frequency of 25 Hz and enabling real-time diffusion-based control without performance degradation.
zh
[AI-107] ENG: Time-Evolving Natural Gradient for Solving PDEs With Deep Neural Nets under General Boundary Conditions
【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在求解偏微分方程(Partial Differential Equations, PDEs)时面临的高精度不足与复杂边界条件处理困难的问题。其解决方案的关键在于将时间演化自然梯度(Time-Evolving Natural Gradient, TENG)框架扩展至Dirichlet边界条件的处理,通过在损失函数中引入边界惩罚项实现约束的精确施加,并结合欧拉(Euler)和赫恩(Heun)数值时间推进方法以兼顾稳定性与精度:其中赫恩方法因具有二阶修正而表现更优,而欧拉方法则在简单场景下更具计算效率。此方法为后续拓展至诺伊曼(Neumann)及混合边界条件以及更广泛的PDE类提供了坚实基础。
链接: https://arxiv.org/abs/2512.15771
作者: Xinjie He,Chenggong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注: 7 pages, 2 figures
Abstract:Partial Differential Equations (PDEs) are central to modeling complex systems across physical, biological, and engineering domains, yet traditional numerical methods often struggle with high-dimensional or complex problems. Physics-Informed Neural Networks (PINNs) have emerged as an efficient alternative by embedding physics-based constraints into deep learning frameworks, but they face challenges in achieving high accuracy and handling complex boundary conditions. In this work, we extend the Time-Evolving Natural Gradient (TENG) framework to address Dirichlet boundary conditions, integrating natural gradient optimization with numerical time-stepping schemes, including Euler and Heun methods, to ensure both stability and accuracy. By incorporating boundary condition penalty terms into the loss function, the proposed approach enables precise enforcement of Dirichlet constraints. Experiments on the heat equation demonstrate the superior accuracy of the Heun method due to its second-order corrections and the computational efficiency of the Euler method for simpler scenarios. This work establishes a foundation for extending the framework to Neumann and mixed boundary conditions, as well as broader classes of PDEs, advancing the applicability of neural network-based solvers for real-world problems.
zh
[AI-108] Data-Chain Backdoor: Do You Trust Diffusion Models as Generative Data Supplier?
【速读】:该论文旨在解决生成式模型(如扩散模型)在合成数据增强过程中可能引入的后门传播问题,即Data-Chain Backdoor(DCB)威胁。其核心问题是:开源扩散模型因其强大的分布拟合能力,会隐式记忆并再现后门触发器(trigger),并将这些触发器传递至下游任务模型,从而造成严重安全风险,尤其在无标签攻击(clean-label attack)场景下,这种威胁难以察觉且不影响合成数据的可用性。解决方案的关键在于识别并利用早期阶段触发器显现现象(Early-Stage Trigger Manifestation, ESTM)——即后门触发模式在扩散模型逆向生成过程的早期高噪声阶段更为明显,这一特性为检测和防御后门注入提供了新的切入点。
链接: https://arxiv.org/abs/2512.15769
作者: Junchi Lu,Xinke Li,Yuheng Liu,Qi Alfred Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The increasing use of generative models such as diffusion models for synthetic data augmentation has greatly reduced the cost of data collection and labeling in downstream perception tasks. However, this new data source paradigm may introduce important security concerns. This work investigates backdoor propagation in such emerging generative data supply chains, namely Data-Chain Backdoor (DCB). Specifically, we find that open-source diffusion models can become hidden carriers of backdoors. Their strong distribution-fitting ability causes them to memorize and reproduce backdoor triggers during generation, which are subsequently inherited by downstream models, resulting in severe security risks. This threat is particularly concerning under clean-label attack scenarios, as it remains effective while having negligible impact on the utility of the synthetic data. Furthermore, we discover an Early-Stage Trigger Manifestation (ESTM) phenomenon: backdoor trigger patterns tend to surface more explicitly in the early, high-noise stages of the diffusion model’s reverse generation process before being subtly integrated into the final samples. Overall, this work reveals a previously underexplored threat in generative data pipelines and provides initial insights toward mitigating backdoor risks in synthetic data generation.
zh
[AI-109] PHANTOM: Progressive High-fidelity Adversarial Network for Threat Object Modeling
【速读】:该论文旨在解决网络安全领域中网络攻击数据稀缺问题,这一限制严重制约了入侵检测系统(Intrusion Detection System, IDS)的鲁棒性发展。为应对该挑战,作者提出了一种名为PHANTOM的新型对抗变分框架,其核心创新在于采用渐进式训练策略、双路径VAE-GAN架构以及领域特定特征匹配机制,从而生成高保真度的合成攻击数据,同时保留攻击语义信息。实验表明,基于PHANTOM生成的数据训练的模型在真实攻击上达到98%的加权准确率,且统计分析验证了合成数据在分布和多样性上的真实性。
链接: https://arxiv.org/abs/2512.15768
作者: Jamal Al-Karaki,Muhammad Al-Zafar Khan,Rand Derar Mohammad Al Athamneh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The scarcity of cyberattack data hinders the development of robust intrusion detection systems. This paper introduces PHANTOM, a novel adversarial variational framework for generating high-fidelity synthetic attack data. Its innovations include progressive training, a dual-path VAE-GAN architecture, and domain-specific feature matching to preserve the semantics of attacks. Evaluated on 100,000 network traffic samples, models trained on PHANTOM data achieve 98% weighted accuracy on real attacks. Statistical analyses confirm that the synthetic data preserves authentic distributions and diversity. Limitations in generating rare attack types are noted, highlighting challenges with severe class imbalance. This work advances the generation of synthetic data for training robust, privacy-preserving detection systems.
zh
[AI-110] Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework
【速读】:该论文旨在解决物理仿真模型与实际物理现象之间的偏差问题,即“无知模型”(ignorance model),这种偏差源于未建模效应或简化假设。传统纯数据驱动方法虽可学习系统行为,但需覆盖全空间和时间域的高质量大数据,在现实场景中难以实现。为此,论文提出一种基于图神经网络(Graph Neural Networks, GNNs)的混合孪生(hybrid twin)方法,其关键在于:利用已有的物理模型捕捉整体行为后,仅需学习残差形式的无知成分——该成分复杂度显著低于完整物理响应,因而可用少量数据建模;同时,GNN能够从稀疏空间测量中学习缺失物理的空间模式,从而在不依赖密集时空参数数据的前提下,实现对物理模型的数据增强修正,提升仿真精度与可解释性,并具备跨几何、网格和载荷配置的泛化能力。
链接: https://arxiv.org/abs/2512.15767
作者: M. Gorpinich(1 and 2),B. Moya(2),S. Rodriguez(2),F. Meraghni(2),Y. Jaafra(1),A. Briot(1),M. Henner(1),R. Leon(1),F. Chinesta(2 and 3) ((1) Valeo, (2) PIMM Lab. ENSAM Institute of Technology, (3) CNRS@CREATE LTD. Singapore)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 14 figures
Abstract:Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element Method (FEM). However, these models often exhibit discrepancies from the reality due to unmodeled effects or simplifying assumptions. We refer to this gap as the ignorance model. While purely data-driven approaches attempt to learn full system behavior, they require large amounts of high-quality data across the entire spatial and temporal domain. In real-world scenarios, such information is unavailable, making full data-driven modeling unreliable. To overcome this limitation, we model of the ignorance component using a hybrid twin approach, instead of simulating phenomena from scratch. Since physics-based models approximate the overall behavior of the phenomena, the remaining ignorance is typically lower in complexity than the full physical response, therefore, it can be learned with significantly fewer data. A key difficulty, however, is that spatial measurements are sparse, also obtaining data measuring the same phenomenon for different spatial configurations is challenging in practice. Our contribution is to overcome this limitation by using Graph Neural Networks (GNNs) to represent the ignorance model. GNNs learn the spatial pattern of the missing physics even when the number of measurement locations is limited. This allows us to enrich the physics-based model with data-driven corrections without requiring dense spatial, temporal and parametric data. To showcase the performance of the proposed method, we evaluate this GNN-based hybrid twin on nonlinear heat transfer problems across different meshes, geometries, and load positions. Results show that the GNN successfully captures the ignorance and generalizes corrections across spatial configurations, improving simulation accuracy and interpretability, while minimizing data requirements.
zh
[AI-111] LOOPRAG : Enhancing Loop Transformation Optimization with Retrieval-Augmented Large Language Models ASPLOS2026
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在循环优化(loop transformation)任务中表现不佳的问题,即LLMs常因缺乏对循环结构语义和优化目标的精准理解而产生错误或次优的变换策略,从而错失性能提升机会。解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的框架LOOPRAG:首先通过参数化方法挖掘循环属性以生成多样且合法的示例代码作为示范源;其次设计一种基于循环特征的感知算法,在代码检索中平衡相似性与多样性,获取最具信息量的示例;最后引入反馈驱动的迭代机制,将编译、测试及性能结果作为反馈信号指导LLM生成更准确的优化代码,并通过变异测试、覆盖率分析与差异测试确保优化前后代码等价性。该方案显著提升了LLMs在静态控制部分(Static Control Part)上的循环优化能力。
链接: https://arxiv.org/abs/2512.15766
作者: Yijie Zhi,Yayu Cao,Jianhua Dai,Xiaoyang Han,Jingwen Pu,Qingran Wu,Sheng Cheng,Ming Cai
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: Accepted to ASPLOS 2026
Abstract:Loop transformations are semantics-preserving optimization techniques, widely used to maximize objectives such as parallelism. Despite decades of research, applying the optimal composition of loop transformations remains challenging due to inherent complexities, including cost modeling for optimization objectives. Recent studies have explored the potential of Large Language Models (LLMs) for code optimization. However, our key observation is that LLMs often struggle with effective loop transformation optimization, frequently leading to errors or suboptimal optimization, thereby missing opportunities for performance improvements. To bridge this gap, we propose LOOPRAG, a novel retrieval-augmented generation framework designed to guide LLMs in performing effective loop optimization on Static Control Part. We introduce a parameter-driven method to harness loop properties, which trigger various loop transformations, and generate diverse yet legal example codes serving as a demonstration source. To effectively obtain the most informative demonstrations, we propose a loop-aware algorithm based on loop features, which balances similarity and diversity for code retrieval. To enhance correct and efficient code generation, we introduce a feedback-based iterative mechanism that incorporates compilation, testing and performance results as feedback to guide LLMs. Each optimized code undergoes mutation, coverage and differential testing for equivalence checking. We evaluate LOOPRAG on PolyBench, TSVC and LORE benchmark suites, and compare it against compilers (GCC-Graphite, Clang-Polly, Perspective and ICX) and representative LLMs (DeepSeek and GPT-4). The results demonstrate average speedups over base compilers of up to 11.20 \times , 14.34 \times , and 9.29 \times for PolyBench, TSVC, and LORE, respectively, and speedups over base LLMs of up to 11.97 \times , 5.61 \times , and 11.59 \times .
zh
[AI-112] AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和小型语言模型(Small Language Models, SLMs)在微调过程中计算资源消耗高、内存占用大以及效率低的问题。传统全参数微调方法成本高昂,而现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法如LoRA虽能降低资源开销,但受限于固定子空间更新策略,可能影响性能表现。其解决方案的关键在于提出AdaGradSelect——一种基于梯度范数自适应选择Transformer块进行更新的机制:通过结合Dirichlet分布采样与ε-greedy探索策略,在训练初期广泛探索不同层的更新可能性,并随训练进程逐步聚焦于梯度范数最高的关键模块,从而实现更优的性能-效率平衡。实验表明,该方法在保持接近全微调性能的同时,可提升约12%的训练速度并节省35%的GPU内存,且在GSM8K和MATH等基准上优于LoRA(rank 256)。
链接: https://arxiv.org/abs/2512.15764
作者: Anshul Kumar,Gagan Raj Gupta,Manisha Chawla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2512.15764 [cs.LG] (or arXiv:2512.15764v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.15764 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anshul Kumar Mr [view email] [v1] Fri, 12 Dec 2025 09:44:07 UTC (2,957 KB)
zh
[AI-113] Cross-Sample Augmented Test-Time Adaptation for Personalized Intraoperative Hypotension Prediction AAAI2026
【速读】:该论文旨在解决术中低血压(Intraoperative Hypotension, IOH)预测中存在的个体差异导致的准确性不足问题,尤其是在事件稀少情况下测试时适应(Test-Time Adaptation, TTA)因样本不足而难以可靠训练的挑战。解决方案的关键在于提出一种新颖的跨样本增强测试时适应框架(Cross-Sample Augmented Test-Time Adaptation, CSA-TTA):首先构建包含历史数据中低血压与非低血压样本的跨样本记忆库;随后采用粗粒度到细粒度的检索策略,先通过K-Shape聚类识别代表性簇中心,再基于当前患者信号检索语义相似的Top-K样本用于测试时训练;同时引入自监督掩码重建和回溯序列预测信号以提升模型对术中快速细微动态变化的适应能力。该方法在VitalDB和真实院内数据集上显著提升了召回率与F1分数,验证了其在微调和零样本场景下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2512.15762
作者: Kanxue Li,Yibing Zhan,Hua Jin,Chongchong Qi,Xu Lin,Baosheng Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Intraoperative hypotension (IOH) poses significant surgical risks, but accurate prediction remains challenging due to patient-specific variability. While test-time adaptation (TTA) offers a promising approach for personalized prediction, the rarity of IOH events often leads to unreliable test-time training. To address this, we propose CSA-TTA, a novel Cross-Sample Augmented Test-Time Adaptation framework that enhances training by incorporating hypotension events from other individuals. Specifically, we first construct a cross-sample bank by segmenting historical data into hypotensive and non-hypotensive samples. Then, we introduce a coarse-to-fine retrieval strategy for building test-time training data: we initially apply K-Shape clustering to identify representative cluster centers and subsequently retrieve the top-K semantically similar samples based on the current patient signal. Additionally, we integrate both self-supervised masked reconstruction and retrospective sequence forecasting signals during training to enhance model adaptability to rapid and subtle intraoperative dynamics. We evaluate the proposed CSA-TTA on both the VitalDB dataset and a real-world in-hospital dataset by integrating it with state-of-the-art time series forecasting models, including TimesFM and UniTS. CSA-TTA consistently enhances performance across settings-for instance, on VitalDB, it improves Recall and F1 scores by +1.33% and +1.13%, respectively, under fine-tuning, and by +7.46% and +5.07% in zero-shot scenarios-demonstrating strong robustness and generalization.
zh
[AI-114] ReactorFold: Generative discovery of nuclear reactor cores via emergent physical reasoning
【速读】:该论文旨在解决核反应堆核心设计中因受限于人类预设配置空间而导致难以发现根本性新拓扑结构的问题。传统方法如确定性算法、元启发式算法和机器学习辅助方法均依赖固定的设计空间搜索,限制了创新性设计的探索能力。其解决方案的关键在于提出一种名为ReactorFold的生成式框架,将燃料组件设计重构为语言模型的序列建模问题,并利用蒙特卡洛数据、参数高效微调(Parameter-Efficient Fine-Tuning)以及直接偏好优化(Direct Preference Optimization, DPO)训练模型,使其能够从复杂中子学交互中学习隐含结构并单次前向传播生成候选布局。特别地,DPO对齐模型展现出涌现的设计空间扩展能力:尽管仅在固定钆(Gd)棒数量的配置上训练,仍能自主调整Gd库存以满足严格的功率峰因子约束,并发现高性能的非对称构型,突破传统对称加载启发式限制,证明语言模型可内化因果物理关系并超越人为设定的设计边界。
链接: https://arxiv.org/abs/2512.15756
作者: Yoonpyo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Designing nuclear reactor cores requires navigating large discrete design spaces governed by complex neutronic interactions. Traditional deterministic, metaheuristic, and machine-learning-assisted methods search within fixed, human-defined configuration spaces, limiting their ability to discover fundamentally new design topologies. Here we introduce ReactorFold, a generative framework that reformulates fuel-assembly design as a sequence modeling problem for language models. Using Monte Carlo data, parameter-efficient fine-tuning, and Direct Preference Optimization (DPO), the model learns the latent structure of a pressurized-water-reactor assembly and generates candidate layouts in a single forward pass. Notably, the DPO-aligned model exhibits emergent design-space expansion: despite being trained exclusively on configurations with a fixed number of gadolinium burnable absorber (Gd) rods, it autonomously adjusts Gd inventory to satisfy strict power-peaking constraints. The model also discovers high-performing asymmetric configurations that challenge conventional symmetric loading heuristics, accessing design regimes inaccessible to conventional search methods and demonstrating that language models can internalize causal physical relationships and transcend human-imposed design constraints.
zh
[AI-115] AO-Net: Two-stage Adaptive OOD Classification Network for Fine-grained Encrypted Traffic Classification
【速读】:该论文旨在解决加密流量分类中因新应用不断涌现而导致的分布外(Out-of-Distribution, OOD)流量识别难题,即现有方法依赖预定义类别,难以有效处理未见过的流量模式,且多数仅将未知流量归为单一“其他”类,缺乏细粒度分类能力。其解决方案的关键在于提出两阶段自适应OOD分类网络(Two-stage Adaptive OOD classification Network, TAO-Net):第一阶段采用融合Transformer-based层间变换平滑性与特征分析的混合OOD检测机制,精准区分分布内(In-Distribution, ID)与OOD流量;第二阶段则利用大语言模型(Large Language Models, LLMs)结合新颖的语义增强提示策略,将OOD分类任务转化为生成式任务,从而实现无需预定义标签的灵活细粒度分类,显著提升对新兴网络应用的识别准确率。
链接: https://arxiv.org/abs/2512.15753
作者: Zihao Wang,Wei Peng,Junming Zhang,Jian Li,Wenxin Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Encrypted traffic classification aims to identify applications or services by analyzing network traffic data. One of the critical challenges is the continuous emergence of new applications, which generates Out-of-Distribution (OOD) traffic patterns that deviate from known categories and are not well represented by predefined models. Current approaches rely on predefined categories, which limits their effectiveness in handling unknown traffic types. Although some methods mitigate this limitation by simply classifying unknown traffic into a single “Other” category, they fail to make a fine-grained classification. In this paper, we propose a Two-stage Adaptive OOD classification Network (TAO-Net) that achieves accurate classification for both In-Distribution (ID) and OOD encrypted traffic. The method incorporates an innovative two-stage design: the first stage employs a hybrid OOD detection mechanism that integrates transformer-based inter-layer transformation smoothness and feature analysis to effectively distinguish between ID and OOD traffic, while the second stage leverages large language models with a novel semantic-enhanced prompt strategy to transform OOD traffic classification into a generation task, enabling flexible fine-grained classification without relying on predefined labels. Experiments on three datasets demonstrate that TAO-Net achieves 96.81-97.70% macro-precision and 96.77-97.68% macro-F1, outperforming previous methods that only reach 44.73-86.30% macro-precision, particularly in identifying emerging network applications.
zh
[AI-116] GLOW: Graph-Language Co-Reasoning for Agent ic Workflow Performance Prediction
【速读】:该论文旨在解决代理工作流(Agentic Workflows, AWs)性能预测的准确性与可扩展性问题,现有方法因无法同时捕捉AWs中复杂的拓扑依赖关系和深层语义逻辑,导致预测效果受限。其解决方案的关键在于提出GLOW框架,通过融合图神经网络(GNN)的结构建模能力与大语言模型(LLM)的推理能力:首先设计一个面向图任务指令微调的图导向LLM,提取具有拓扑感知能力的语义特征;随后将这些特征与GNN编码的结构表示进行融合,并采用对比对齐策略优化潜在空间,从而有效区分高质量AWs。
链接: https://arxiv.org/abs/2512.15751
作者: Wei Guan,Jian Cao,Jinyu Cai,Qiqi Cai,Jianqi Gao,See-Kiong Ng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Agentic Workflows (AWs) have emerged as a promising paradigm for solving complex tasks. However, the scalability of automating their generation is severely constrained by the high cost and latency of execution-based evaluation. Existing AW performance prediction methods act as surrogates but fail to simultaneously capture the intricate topological dependencies and the deep semantic logic embedded in AWs. To address this limitation, we propose GLOW, a unified framework for AW performance prediction that combines the graph-structure modeling capabilities of GNNs with the reasoning power of LLMs. Specifically, we introduce a graph-oriented LLM, instruction-tuned on graph tasks, to extract topologically aware semantic features, which are fused with GNN-encoded structural representations. A contrastive alignment strategy further refines the latent space to distinguish high-quality AWs. Extensive experiments on FLORA-Bench show that GLOW outperforms state-of-the-art baselines in prediction accuracy and ranking utility.
zh
[AI-117] Prompt-to-Parts: Generative AI for Physical Assembly and Scalable Instructions
【速读】:该论文旨在解决从自然语言描述生成可物理实现的装配指令这一难题,尤其针对传统基于像素的扩散模型或计算机辅助设计(CAD)模型在复杂装配序列生成和组件互换性支持方面的不足。解决方案的关键在于提出了一种新颖的“积木式”方法(“bag of bricks” method),通过使用LDraw作为富含文本的中间表示,将自然语言转化为具有几何有效性、连接约束和可构建顺序的离散部件装配步骤;该方法利用大语言模型(LLM)结合编程工具,在超过3000个零件的原型中成功生成了可执行的构建序列,并实现了模块化、可扩展且高保真的从语义设计意图到可制造输出的映射,从而为制造与工程原型开发提供了一种新的物理API(Physical API)。
链接: https://arxiv.org/abs/2512.15743
作者: David Noever
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a framework for generating physically realizable assembly instructions from natural language descriptions. Unlike unconstrained text-to-3D approaches, our method operates within a discrete parts vocabulary, enforcing geometric validity, connection constraints, and buildability ordering. Using LDraw as a text-rich intermediate representation, we demonstrate that large language models can be guided with tools to produce valid step-by-step construction sequences and assembly instructions for brick-based prototypes of more than 3000 assembly parts. We introduce a Python library for programmatic model generation and evaluate buildable outputs on complex satellites, aircraft, and architectural domains. The approach aims for demonstrable scalability, modularity, and fidelity that bridges the gap between semantic design intent and manufacturable output. Physical prototyping follows from natural language specifications. The work proposes a novel elemental lingua franca as a key missing piece from the previous pixel-based diffusion methods or computer-aided design (CAD) models that fail to support complex assembly instructions or component exchange. Across four original designs, this novel “bag of bricks” method thus functions as a physical API: a constrained vocabulary connecting precisely oriented brick locations to a “bag of words” through which arbitrary functional requirements compile into material reality. Given such a consistent and repeatable AI representation opens new design options while guiding natural language implementations in manufacturing and engineering prototyping.
zh
[AI-118] he Principle of Proportional Duty: A Knowledge-Duty Framework for Ethical Equilibrium in Human and Artificial Systems
【速读】:该论文旨在解决传统伦理框架在不确定性情境下难以有效建模决策责任的问题,尤其针对道德义务如何随认知状态变化而动态调整这一核心难题。其解决方案的关键在于提出“比例责任原则”(Principle of Proportional Duty, PPD),该原则通过数学形式化将道德责任分解为行动责任(Action Duty)与修复责任(Repair Duty)的动态转换关系:随着不确定性上升,行动责任按比例转化为主动验证和消除不确定性的修复责任。该机制由公式 $ D_{\text{total}} = K[(1-HI) + HI \cdot g(C_{\text{signal}})] $ 表征,其中总责任依赖于知识水平(K)、谦逊/不确定性系数(HI)和情境信号强度(C_signal)。研究表明,维持基线谦逊系数(λ₀)可显著提升责任分配的稳定性并抑制过度自信决策风险,从而为可审计的人工智能系统提供可计算、跨领域的伦理建模路径。
链接: https://arxiv.org/abs/2512.15740
作者: Timothy Prescher
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 46 pages, 2 figures. Preregistered at OSF on Nov 14, 2025 ( this https URL ). Includes comparative analysis with OpenAI’s ‘Confessions’ paper (Dec 3, 2025)
Abstract:Traditional ethical frameworks often struggle to model decision-making under uncertainty, treating it as a simple constraint on action. This paper introduces the Principle of Proportional Duty (PPD), a novel framework that models how ethical responsibility scales with an agent’s epistemic state. The framework reveals that moral duty is not lost to uncertainty but transforms: as uncertainty increases, Action Duty (the duty to act decisively) is proportionally converted into Repair Duty (the active duty to verify, inquire, and resolve uncertainty). This dynamic is expressed by the equation D_total = K[(1-HI) + HI * g(C_signal)], where Total Duty is a function of Knowledge (K), Humility/Uncertainty (HI), and Contextual Signal Strength (C_signal). Monte Carlo simulations demonstrate that systems maintaining a baseline humility coefficient (lambda 0) produce more stable duty allocations and reduce the risk of overconfident decision-making. By formalizing humility as a system parameter, the PPD offers a mathematically tractable approach to moral responsibility that could inform the development of auditable AI decision systems. This paper applies the framework across four domains, clinical ethics, recipient-rights law, economic governance, and artificial intelligence, to demonstrate its cross-disciplinary validity. The findings suggest that proportional duty serves as a stabilizing principle within complex systems, preventing both overreach and omission by dynamically balancing epistemic confidence against contextual risk. Comments: 46 pages, 2 figures. Preregistered at OSF on Nov 14, 2025 (this https URL). Includes comparative analysis with OpenAI’s ‘Confessions’ paper (Dec 3, 2025) Subjects: Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2512.15740 [cs.AI] (or arXiv:2512.15740v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.15740 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-119] Hybrid Quantum-Classical Ensemble Learning for SP 500 Directional Prediction
【速读】:该论文旨在解决金融市场价格预测中模型准确率难以突破55%-57%的难题,其核心挑战源于市场高噪声、非平稳性及效率性。解决方案的关键在于提出一种融合量子情感分析(quantum sentiment analysis)、决策变换器(Decision Transformer)架构与智能模型选择机制的混合集成框架。该框架通过三个创新点提升性能:首先,强调架构多样性优于数据集多样性,不同算法(如LSTM、XGBoost、随机森林等)在相同数据上组合可获得60.14%的方向准确性,显著高于单一架构多数据训练的结果(52.80%);其次,引入4量子比特变分电路增强情感特征提取,为各模型带来0.8%-1.5%的精度增益;最后,采用基于置信度的弱预测器过滤策略(排除准确率低于52%的模型),使Top-7模型集成达到最优效果(60.14% vs. 全部35模型仅51.2%)。实证表明,该方法在2020–2023年多种市场环境下均具统计显著性和交易实用性(Sharpe比达1.2)。
链接: https://arxiv.org/abs/2512.15738
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
备注:
Abstract:Financial market prediction is a challenging application of machine learning, where even small improvements in directional accuracy can yield substantial value. Most models struggle to exceed 55–57% accuracy due to high noise, non-stationarity, and market efficiency. We introduce a hybrid ensemble framework combining quantum sentiment analysis, Decision Transformer architecture, and strategic model selection, achieving 60.14% directional accuracy on S\P 500 prediction, a 3.10% improvement over individual models. Our framework addresses three limitations of prior approaches. First, architecture diversity dominates dataset diversity: combining different learning algorithms (LSTM, Decision Transformer, XGBoost, Random Forest, Logistic Regression) on the same data outperforms training identical architectures on multiple datasets (60.14% vs.\ 52.80%), confirmed by correlation analysis ( r0.6 among same-architecture models). Second, a 4-qubit variational quantum circuit enhances sentiment analysis, providing +0.8% to +1.5% gains per model. Third, smart filtering excludes weak predictors (accuracy 52% ), improving ensemble performance (Top-7 models: 60.14% vs.\ all 35 models: 51.2%). We evaluate on 2020–2023 market data across seven instruments, covering diverse regimes including the COVID-19 crash and inflation-driven correction. McNemar’s test confirms statistical significance ( p0.05 ). Preliminary backtesting with confidence-based filtering (6+ model consensus) yields a Sharpe ratio of 1.2 versus buy-and-hold’s 0.8, demonstrating practical trading potential. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST) Cite as: arXiv:2512.15738 [cs.LG] (or arXiv:2512.15738v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.15738 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-120] Anubuddhi: A Multi-Agent AI System for Designing and Simulating Quantum Optics Experiments
【速读】:该论文旨在解决量子光学实验设计与仿真过程中对专业编程知识的高度依赖问题,从而降低科研人员和教育工作者的使用门槛。其核心解决方案是提出一个名为Anubuddhi的多智能体AI系统,该系统通过自然语言提示直接生成并验证量子光学实验方案。关键创新在于:(1)基于语义检索的三层工具箱组件自动布局,实现从文本描述到光学结构的映射;(2)结合知识增强型生成与双模式验证机制(QuTiP与FreeSim),确保物理架构正确性与数值精度;(3)实证表明自由形式模拟在多数场景下优于受限框架,凸显了量子光学多样性对灵活数学建模的需求。该方法实现了设计-仿真一致性得分8–9/10,为科研与教学提供可迭代优化的初始设计方案。
链接: https://arxiv.org/abs/2512.15736
作者: S. K. Rithvik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注:
Abstract:We present Anubuddhi, a multi-agent AI system that designs and simulates quantum optics experiments from natural language prompts without requiring specialized programming knowledge. The system composes optical layouts by arranging components from a three-tier toolbox via semantic retrieval, then validates designs through physics simulation with convergent refinement. The architecture combines intent routing, knowledge-augmented generation, and dual-mode validation (QuTiP and FreeSim). We evaluated 13 experiments spanning fundamental optics (Hong-Ou-Mandel interference, Michelson/Mach-Zehnder interferometry, Bell states, delayed-choice quantum eraser), quantum information protocols (BB84 QKD, Franson interferometry, GHZ states, quantum teleportation, hyperentanglement), and advanced technologies (boson sampling, electromagnetically induced transparency, frequency conversion). The system achieves design-simulation alignment scores of 8–9/10, with simulations faithfully modeling intended physics. A critical finding distinguishes structural correctness from quantitative accuracy: high alignment confirms correct physics architecture, while numerical predictions require expert review. Free-form simulation outperformed constrained frameworks for 11/13 experiments, revealing that quantum optics diversity demands flexible mathematical representations. The system democratizes computational experiment design for research and pedagogy, producing strong initial designs users can iteratively refine through conversation.
zh
[AI-121] DiscoverDCP: A Data-Driven Approach for Construction of Disciplined Convex Programs via Symbolic Regression
【速读】:该论文旨在解决系统辨识中如何自动发现既具有全局凸性又具备高精度和灵活性的凸代理模型(convex surrogate models)的问题,尤其适用于安全关键型控制与优化任务。传统方法通常依赖于固定参数的凸函数形式(如二次函数),难以在表达能力与可验证性之间取得平衡。解决方案的关键在于提出DiscoverDCP框架,该框架将符号回归(symbolic regression)与有规则的凸规划(Disciplined Convex Programming, DCP)规则集相结合,强制所有候选模型表达式遵循DCP组合规则,从而在构造阶段即保证其全局凸性,避免了耗时的后验凸性验证过程,同时获得更宽松且准确的函数形式,最终实现可解释、可验证且灵活的凸模型。
链接: https://arxiv.org/abs/2512.15721
作者: Sveinung Myhre
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 6 pages, 2 figures. Code available at this https URL
Abstract:We propose DiscoverDCP, a data-driven framework that integrates symbolic regression with the rule sets of Disciplined Convex Programming (DCP) to perform system identification. By enforcing that all discovered candidate model expressions adhere to DCP composition rules, we ensure that the output expressions are globally convex by construction, circumventing the computationally intractable process of post-hoc convexity verification. This approach allows for the discovery of convex surrogates that exhibit more relaxed and accurate functional forms than traditional fixed-parameter convex expressions (e.g., quadratic functions). The proposed method produces interpretable, verifiable, and flexible convex models suitable for safety-critical control and optimization tasks.
zh
[AI-122] he Universe Learning Itself: On the Evolution of Dynamics from the Big Bang to Machine Intelligence
【速读】:该论文试图解决的问题是:如何在跨尺度的视角下统一理解从宇宙大爆炸到当代人类社会及人工学习系统中结构形成的连续演化过程,即打破传统学科界限(如宇宙学、天体物理、地球物理、生物学、认知科学与机器智能)对复杂系统演化的割裂认知。其解决方案的关键在于构建一个基于动力系统理论的统一框架,将不同尺度的演化视为在不断丰富状态空间(state space)上的连续动态过程,通过相变(phase transitions)、对称性破缺(symmetry-breaking events)和涌现吸引子(emergent attractors)实现各阶段的衔接;具体而言,从暴胀场动力学到原初扰动增长,再到引力不稳定性塑造宇宙网结构、重子物质耗散坍缩形成恒星行星,进而由地球化学循环定义非平衡吸引子,生命起源被视作自维持反应网络的涌现,进化生物学表现为高维基因型-表型-环境流形上的流动,大脑则作为近临界表面运行的适应性动力系统,最终人类文化与技术(包括现代机器学习和人工智能)被解释为符号与制度层面的动态机制,其本质是实施并优化工程化学习流,并递归重塑自身状态空间——整个叙事强调不稳定性、分岔、多尺度耦合以及在可访问状态空间测度为零子集上的约束流动等数学共性,从而将宇宙历史解读为动力学本身的演化历程,终点指向具备建模、预测与主动干预自身未来轨迹能力的生物与人工系统。
链接: https://arxiv.org/abs/2512.16515
作者: Pradeep Singh,Mudasani Rushikesh,Bezawada Sri Sai Anurag,Balasubramanian Raman
机构: 未知
类目: Adaptation and Self-Organizing Systems (nlin.AO); Artificial Intelligence (cs.AI)
备注: 38 pages, 3 figures
Abstract:We develop a unified, dynamical-systems narrative of the universe that traces a continuous chain of structure formation from the Big Bang to contemporary human societies and their artificial learning systems. Rather than treating cosmology, astrophysics, geophysics, biology, cognition, and machine intelligence as disjoint domains, we view each as successive regimes of dynamics on ever-richer state spaces, stitched together by phase transitions, symmetry-breaking events, and emergent attractors. Starting from inflationary field dynamics and the growth of primordial perturbations, we describe how gravitational instability sculpts the cosmic web, how dissipative collapse in baryonic matter yields stars and planets, and how planetary-scale geochemical cycles define long-lived nonequilibrium attractors. Within these attractors, we frame the origin of life as the emergence of self-maintaining reaction networks, evolutionary biology as flow on high-dimensional genotype-phenotype-environment manifolds, and brains as adaptive dynamical systems operating near critical surfaces. Human culture and technology-including modern machine learning and artificial intelligence-are then interpreted as symbolic and institutional dynamics that implement and refine engineered learning flows which recursively reshape their own phase space. Throughout, we emphasize recurring mathematical motifs-instability, bifurcation, multiscale coupling, and constrained flows on measure-zero subsets of the accessible state space. Our aim is not to present any new cosmological or biological model, but a cross-scale, theoretical perspective: a way of reading the universe’s history as the evolution of dynamics itself, culminating (so far) in biological and artificial systems capable of modeling, predicting, and deliberately perturbing their own future trajectories.
zh
[AI-123] Interpretable Deep Learning for Stock Returns: A Consensus-Bottleneck Asset Pricing Model
【速读】:该论文旨在解决传统资产定价模型难以有效捕捉投资者信念分散性如何通过共识形成机制影响资产价格的问题,尤其在长期风险溢价预测中缺乏结构可解释性。其解决方案的关键在于提出了一种部分可解释的神经网络模型——共识瓶颈资产定价模型(Consensus-Bottleneck Asset Pricing Model, CB-APM),该模型通过模拟卖方分析师对个股与宏观信息的整合过程,将分散信念压缩为一个“瓶颈”表示,从而实现对美国股市未来风险溢价的精准预测,并以结构化方式揭示信念聚合与预期收益之间的关系。CB-APM不仅提升了长期回报预测的准确性与解释力,还通过回归和GRS检验验证了其学习到的共识表征捕获了传统因子模型未覆盖的定价信息,显著增强了对信念驱动收益动态的理解。
链接: https://arxiv.org/abs/2512.16251
作者: Bong-Gyu Jang,Younwoo Jeong,Changeun Kim
机构: 未知
类目: Pricing of Securities (q-fin.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce the \textitConsensus-Bottleneck Asset Pricing Model (CB-APM), a partially interpretable neural network that replicates the reasoning processes of sell-side analysts by capturing how dispersed investor beliefs are compressed into asset prices through a consensus formation process. By modeling this ``bottleneck’’ to summarize firm- and macro-level information, CB-APM not only predicts future risk premiums of U.S. equities but also links belief aggregation to expected returns in a structurally interpretable manner. The model improves long-horizon return forecasts and outperforms standard deep learning approaches in both predictive accuracy and explanatory power. Comprehensive portfolio analyses show that CB-APM’s out-of-sample predictions translate into economically meaningful payoffs, with monotonic return differentials and stable long-short performance across regularization settings. Empirically, CB-APM leverages consensus as a regularizer to amplify long-horizon predictability and yields interpretable consensus-based components that clarify how information is priced in returns. Moreover, regression and GRS-based pricing diagnostics reveal that the learned consensus representations capture priced variation only partially spanned by traditional factor models, demonstrating that CB-APM uncovers belief-driven structure in expected returns beyond the canonical factor space. Overall, CB-APM provides an interpretable and empirically grounded framework for understanding belief-driven return dynamics.
zh
[AI-124] Scalable Agent ic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins
【速读】:该论文旨在解决内在无序蛋白(Intrinsically Disordered Proteins, IDPs)因其缺乏稳定二级/三级结构而难以成药(“undruggable”)的问题,尤其是在癌症相关蛋白中高达80%含有长段无序区域的背景下。解决方案的关键在于设计并实现了一个可扩展的多智能体系统——StructBioReasoner,其核心是基于锦标赛机制的推理框架,使专业化智能体在生成和优化治疗假设过程中竞争协作,从而高效探索庞大的构象空间;同时,该系统通过联邦代理中间件Academy集成文献综述、AI结构预测、分子模拟与稳定性分析等工具,并在高性能计算(HPC)平台上协同执行,显著提升了针对IDP靶标的生物制剂设计能力。
链接: https://arxiv.org/abs/2512.15930
作者: Matthew Sinclair,Moeen Meigooni,Archit Vasan,Ozan Gokdemir,Xinran Lian,Heng Ma,Yadu Babuji,Alexander Brace,Khalid Hossain,Carlo Siebenschuh,Thomas Brettin,Kyle Chard,Christopher Henry,Venkatram Vishwanath,Rick L. Stevens,Ian T. Foster,Arvind Ramanathan
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: This manuscript is under peer review for acceptance to the Proceedings of the Platform for Advanced Scientific Computing (PASC) 26 Conference
Abstract:Intrinsically disordered proteins (IDPs) represent crucial therapeutic targets due to their significant role in disease – approximately 80% of cancer-related proteins contain long disordered regions – but their lack of stable secondary/tertiary structures makes them “undruggable”. While recent computational advances, such as diffusion models, can design high-affinity IDP binders, translating these to practical drug discovery requires autonomous systems capable of reasoning across complex conformational ensembles and orchestrating diverse computational tools at this http URL address this challenge, we designed and implemented StructBioReasoner, a scalable multi-agent system for designing biologics that can be used to target IDPs. StructBioReasoner employs a novel tournament-based reasoning framework where specialized agents compete to generate and refine therapeutic hypotheses, naturally distributing computational load for efficient exploration of the vast design space. Agents integrate domain knowledge with access to literature synthesis, AI-structure prediction, molecular simulations, and stability analysis, coordinating their execution on HPC infrastructure via an extensible federated agentic middleware, Academy. We benchmark StructBioReasoner across Der f 21 and NMNAT-2 and demonstrate that over 50% of 787 designed and validated candidates for Der f 21 outperformed the human-designed reference binders from literature, in terms of improved binding free energy. For the more challenging NMNAT-2 protein, we identified three binding modes from 97,066 binders, including the well-studied NMNAT2:p53 interface. Thus, StructBioReasoner lays the groundwork for agentic reasoning systems for IDP therapeutic discovery on Exascale platforms.
zh
[AI-125] Dynamical Mechanisms for Coordinating Long-term Working Memory Based on the Precision of Spike-timing in Cortical Neurons
【速读】:该论文试图解决的问题是:如何在长时间尺度(小时级)上实现工作记忆的神经机制,尤其是在传统基于平均放电率的编码模型难以解释此类现象的情况下。其解决方案的关键在于提出“皮层行波”(cortical traveling waves)与“突触时序依赖可塑性”(spike-timing-dependent plasticity, STDP)协同作用的机制——即行波波前通过同步激活锥体细胞和篮状细胞的突触输入,诱导抑制性反弹后除极并触发反向传播动作电位,从而在毫秒级时间窗口内精确调控STDP,使突触权重变化得以持续数小时,形成一种临时的第二层级网络,支持长时工作记忆的维持。
链接: https://arxiv.org/abs/2512.15891
作者: Terrence J. Sejnowski
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 26 pages, 12 figures
Abstract:In the last century, most sensorimotor studies of cortical neurons relied on average firing rates. Rate coding is efficient for fast sensorimotor processing that occurs within a few seconds. Much less is known about long-term working memory with a time scale of hours (Ericsson and Kintsch, 1995). The discovery of the millisecond precision of spike initiation in cortical neurons was unexpected (Mainen and Sejnowski, 1995). Even more striking was the precision of spiking in vivo, in response to rapidly fluctuating sensory inputs, suggesting that neural circuits could, in principle, preserve and manipulate sensory information through spike timing. It could support spike-timing-dependent plasticity (STDP), which is triggered by the relative timing of spikes between presynaptic and postsynaptic neurons in the millisecond range. What spike-timing mechanisms could regulate STDP in vivo? Cortical traveling waves have been observed across many frequency bands with high temporal precision. Traveling waves have wave fronts that could link spike timing to STDP. As a wave front passes through a cortical column, excitatory synapses on the dendrites of both pyramidal and basket cells are synchronously stimulated. Inhibitory basket cells form a calyx on pyramidal cell bodies, and inhibitory rebound following a strong transient hyperpolarization can trigger a backpropagating action potential, which arrives shortly after the excitatory inputs on pyramidal dendrites. STDP activated in this way could persist for hours, creating a second-tier network. This temporary network could support long-term working memory, a cognitive network riding above the long-term sensorimotor network. On their own, traveling waves and STDP have not yet yielded new insights into cortical function. Together, they could be responsible for how we think (Sejnowski, 2025).
zh
[AI-126] Bayesian Modeling for Uncertainty Management in Financial Risk Forecasting and Compliance
【速读】:该论文旨在解决金融风险管理中不确定性量化不精确、模型可解释性不足以及计算效率低下的问题,特别是在市场波动预测、欺诈检测和合规监控三个关键场景下。解决方案的关键在于构建一个基于贝叶斯分析的统一框架,通过概率建模实现对风险的精准量化:采用折扣因子动态线性模型(discount-factor DLM)提升VaR估计的可靠性;利用贝叶斯逻辑回归增强欺诈检测的召回率与AUC-ROC性能;引入分层Beta状态空间模型提供透明且自适应的合规风险评估。该框架还结合GPU加速技术实现高达50倍的速度提升,显著改善了实际应用中的时效性与可操作性。
链接: https://arxiv.org/abs/2512.15739
作者: Sharif Al Mamun,Rakib Hossain,Md. Jobayer Rahman,Malay Kumar Devnath,Farhana Afroz,Lisan Al Amin
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:A Bayesian analytics framework that precisely quantifies uncertainty offers a significant advance for financial risk management. We develop an integrated approach that consistently enhances the handling of risk in market volatility forecasting, fraud detection, and compliance monitoring. Our probabilistic, interpretable models deliver reliable results: We evaluate the performance of one-day-ahead 95% Value-at-Risk (VaR) forecasts on daily SP 500 returns, with a training period from 2000 to 2019 and an out-of-sample test period spanning 2020 to 2024. Formal tests of unconditional (Kupiec) and conditional (Christoffersen) coverage reveal that an LSTM baseline achieves near-nominal calibration. In contrast, a GARCH(1,1) model with Student-t innovations underestimates tail risk. Our proposed discount-factor DLM model produces a slightly liberal VaR estimate, with evidence of clustered violations. Bayesian logistic regression improves recall and AUC-ROC for fraud detection, and a hierarchical Beta state-space model provides transparent and adaptive compliance risk assessment. The pipeline is distinguished by precise uncertainty quantification, interpretability, and GPU-accelerated analysis, delivering up to 50x speedup. Remaining challenges include sparse fraud data and proxy compliance labels, but the framework enables actionable risk insights. Future expansion will extend feature sets, explore regime-switching priors, and enhance scalable inference.
zh
[AI-127] Deep Reinforcement Learning Optimization for Uncertain Nonlinear Systems via Event-Triggered Robust Adaptive Dynamic Programming
【速读】:该论文旨在解决复杂动态系统中扰动抑制与计算效率之间的矛盾问题,即如何在缺乏精确系统模型的情况下实现近优控制,同时避免不必要的计算资源消耗。解决方案的关键在于构建一个统一的控制架构,融合基于强化学习(Reinforcement Learning, RL)的控制器与用于扰动估计的扩展状态观测器(Extended State Observer, ESO),并通过事件触发机制(Event-Triggered Mechanism, ETM)仅在状态偏差超过预设阈值时才更新学习模块参数,从而显著降低计算负载并保证闭环系统的稳定性。该方法通过价值迭代的自适应动态规划(Adaptive Dynamic Programming, ADP)实现策略逼近,并借助李雅普诺夫分析确保系统稳定性和强鲁棒性。
链接: https://arxiv.org/abs/2512.15735
作者: Ningwei Bai,Chi Pui Chan,Qichen Yin,Tengyang Gong,Yunda Yan,Zezhi Tang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 9 pages, 9 figures, 2 numerical examples. This version presents a unified event-triggered ESO-ADP control framework, including stability analysis, algorithm description, and simulation studies. A journal extension with full hybrid-system stability proofs will follow
Abstract:This work proposes a unified control architecture that couples a Reinforcement Learning (RL)-driven controller with a disturbance-rejection Extended State Observer (ESO), complemented by an Event-Triggered Mechanism (ETM) to limit unnecessary computations. The ESO is utilized to estimate the system states and the lumped disturbance in real time, forming the foundation for effective disturbance compensation. To obtain near-optimal behavior without an accurate system description, a value-iteration-based Adaptive Dynamic Programming (ADP) method is adopted for policy approximation. The inclusion of the ETM ensures that parameter updates of the learning module are executed only when the state deviation surpasses a predefined bound, thereby preventing excessive learning activity and substantially reducing computational load. A Lyapunov-oriented analysis is used to characterize the stability properties of the resulting closed-loop system. Numerical experiments further confirm that the developed approach maintains strong control performance and disturbance tolerance, while achieving a significant reduction in sampling and processing effort compared with standard time-triggered ADP schemes.
zh
[AI-128] A Context-Free Smart Grid Model Using Complex System Approach
【速读】:该论文旨在解决智能电网(Smart Grid)在复杂系统背景下实现全局优化的难题,尤其针对其规模、组成元素及策略多样化所带来的优化挑战。解决方案的关键在于提出一种基于复杂系统的建模方法,通过在不同层级上融合博弈论与经典优化方法,从而在保持通用性的同时,提升优化过程的灵活性与可扩展性。
链接: https://arxiv.org/abs/2512.15733
作者: Soufian Ben Amor,Alain Bui,Guillaume Guerard
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Energy and pollution are urging problems of the 21th century. By gradually changing the actual power grid system, smart grid may evolve into different systems by means of size, elements and strategies, but its fundamental requirements and objectives will not change such as optimizing production, transmission, and consumption. Studying the smart grid through modeling and simulation provides us with valuable results which cannot be obtained in real world due to time and cost related constraints. Moreover, due to the complexity of the smart grid, achieving global optimization is not an easy task. In this paper, we propose a complex system based approach to the smart grid modeling, accentuating on the optimization by combining game theoretical and classical methods in different levels. Thanks to this combination, the optimization can be achieved with flexibility and scalability, while keeping its generality.
zh
[AI-129] nyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge
【速读】:该论文旨在解决表面肌电信号(surface electromyography, EMG)在跨受试者、不同采集系统和协议下难以实现鲁棒泛化的问题,同时应对现有生成式 AI (Generative AI) 基础模型(Foundation Models, FMs)在 EMG 领域中任务单一且难以部署于嵌入式平台的局限性。解决方案的关键在于提出 TinyMyo——一个基于 Transformer 编码器架构的轻量化基础模型,通过自监督预训练在公开数据集上学习通用特征表示,仅用 3.6M 参数即实现高保真重建;其统一的骨干网络经少量任务特定头适配即可支持多下游任务(如手势分类、手部运动学回归、语音产生与识别),并在多种传感位置和硬件平台上展现媲美或超越当前最优方法的性能,且模型尺寸低于 5M 参数,首次实现在超低功耗微控制器(GAP9)上的部署,平均功耗仅为 36.45mW。
链接: https://arxiv.org/abs/2512.15729
作者: Matteo Fasulo,Giusy Spacone,Thorir Mar Ingolfsson,Yawei Li,Luca Benini,Andrea Cossettini
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Surface electromyography (EMG) is a non-invasive sensing modality used in several domains, including biomechanics, rehabilitation, prosthetic control, and emerging human-machine interaction paradigms. Despite decades of use, significant challenges remain in achieving robust generalization across subjects, recording systems, and acquisition protocols. To tackle these challenges, foundation models (FMs) are gaining traction when targeting end-to-end applications based on EMG signals. Yet, existing EMG FMs remain limited to single downstream tasks and lack deployability on embedded platforms. In this work, we present TinyMyo, a lightweight FM based on a Transformer encoder architecture. The model is pre-trained in a self-supervised manner on publicly available datasets and achieves high reconstruction fidelity with only 3.6M parameters. With minimal task-specific head adaptations, the same backbone is used to tackle multiple downstream tasks, leveraging datasets acquired from diverse sensing locations and hardware platforms. We demonstrate generalization across hand gesture classification, hand kinematic regression, speech production and recognition, with performance comparable to or surpassing the state of the art (SoA), and model size below 5M parameters. We achieve SoA results compared to previous FM-based works on the NinaPro DB5 ( 89.4\pm0.16% ), UCI-EMG ( 97.56\pm0.32% ), and EPN-612 ( 96.74\pm0.09% ) datasets. We report, to the best of our knowledge, the first deployment of an EMG FM on an ultra-low-power microcontroller (GAP9), achieving an average power envelope of 36.45mW. By open-sourcing the pre-trained and the downstream task architectures (this https URL), we aim to provide a flexible resource that can accelerate future research and serve as a common foundation for the EMG community.
zh
[AI-130] FedSight AI: Multi-Agent System Architecture for Federal Funds Target Rate Prediction NEURIPS2025
【速读】:该论文旨在解决美联储联邦公开市场委员会(FOMC)货币政策决策过程的可解释性与预测准确性问题,即如何利用人工智能技术模拟FOMC成员的决策逻辑并准确预测利率政策结果。其解决方案的关键在于提出FedSight AI这一多智能体框架,通过引入大型语言模型(LLMs)构建成员代理(member agents),使其能够分析结构化指标和非结构化输入(如《褐皮书》Beige Book),进行辩论与投票,从而复现FOMC的推理链条;进一步结合链式草稿(Chain-of-Draft, CoD)机制,强制执行分阶段、精简的推理流程,显著提升预测精度(93.75%)与稳定性(93.33%),同时保持与真实FOMC沟通一致的透明推理路径。
链接: https://arxiv.org/abs/2512.15728
作者: Yuhan Hou,Tianji Rao,Jeremy Tan,Adler Viton,Xiyue Zhang,David Ye,Abhishek Kodi,Sanjana Dulam,Aditya Paul,Yikai Feng
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Generative AI in Finance Workshop
Abstract:The Federal Open Market Committee (FOMC) sets the federal funds rate, shaping monetary policy and the broader economy. We introduce \emphFedSight AI, a multi-agent framework that uses large language models (LLMs) to simulate FOMC deliberations and predict policy outcomes. Member agents analyze structured indicators and unstructured inputs such as the Beige Book, debate options, and vote, replicating committee reasoning. A Chain-of-Draft (CoD) extension further improves efficiency and accuracy by enforcing concise multistage reasoning. Evaluated at 2023-2024 meetings, FedSight CoD achieved accuracy of 93.75% and stability of 93.33%, outperforming baselines including MiniFed and Ordinal Random Forest (RF), while offering transparent reasoning aligned with real FOMC communications.
zh
[AI-131] Large Model Enabled Embodied Intelligence for 6G Integrated Perception Communication and Computation Network
【速读】:该论文旨在解决第六代移动通信(6G)系统中感知、通信与计算深度融合的挑战,以实现安全关键型智能无线网络架构。其核心问题在于传统基站(Base Station, BS)功能单一,难以支撑未来复杂场景下的协同感知与决策需求。解决方案的关键在于引入大人工智能模型(Large Artificial Intelligence Models, LAMs),赋予基站感知、推理与执行能力,从而演化为智能基站代理(Intelligent Base Station Agents, IBSAs)。该方案通过构建感知-认知-执行流水线,并结合云-边-端协同计算与参数高效适配机制,实现了对自动驾驶车辆-道路协同感知和低空无人机监管等典型场景的支持,同时提出涵盖通信性能、感知精度、决策可靠性、安全性与能效的综合评估框架,推动LAM赋能的IBSAs成为6G时代原生集成感知、通信与计算能力的可行路径。
链接: https://arxiv.org/abs/2512.15109
作者: Zhuoran Li,Zhen Gao,Xinhua Liu,Zheng Wang,Xiaotian Zhou,Lei Liu,Yongpeng Wu,Wei Feng,Yongming Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:The advent of sixth-generation (6G) places intelligence at the core of wireless architecture, fusing perception, communication, and computation into a single closed-loop. This paper argues that large artificial intelligence models (LAMs) can endow base stations with perception, reasoning, and acting capabilities, thus transforming them into intelligent base station agents (IBSAs). We first review the historical evolution of BSs from single-functional analog infrastructure to distributed, software-defined, and finally LAM-empowered IBSA, highlighting the accompanying changes in architecture, hardware platforms, and deployment. We then present an IBSA architecture that couples a perception-cognition-execution pipeline with cloud-edge-end collaboration and parameter-efficient adaptation. Subsequently,we study two representative scenarios: (i) cooperative vehicle-road perception for autonomous driving, and (ii) ubiquitous base station support for low-altitude uncrewed aerial vehicle safety monitoring and response against unauthorized drones. On this basis, we analyze key enabling technologies spanning LAM design and training, efficient edge-cloud inference, multi-modal perception and actuation, as well as trustworthy security and governance. We further propose a holistic evaluation framework and benchmark considerations that jointly cover communication performance, perception accuracy, decision-making reliability, safety, and energy efficiency. Finally, we distill open challenges on benchmarks, continual adaptation, trustworthy decision-making, and standardization. Together, this work positions LAM-enabled IBSAs as a practical path toward integrated perception, communication, and computation native, safety-critical 6G systems.
zh
机器学习
[LG-0] PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies
链接: https://arxiv.org/abs/2512.16881
作者: Arhan Jain,Mingtong Zhang,Kanav Arora,William Chen,Marcel Torne,Muhammad Zubair Irshad,Sergey Zakharov,Yue Wang,Sergey Levine,Chelsea Finn,Wei-Chiu Ma,Dhruv Shah,Abhishek Gupta,Karl Pertsch
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL
Abstract:A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
[LG-1] Learning Confidence Ellipsoids and Applications to Robust Subspace Recovery
链接: https://arxiv.org/abs/2512.16875
作者: Chao Gao,Liren Shan,Vaidehi Srinivas,Aravindan Vijayaraghavan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We study the problem of finding confidence ellipsoids for an arbitrary distribution in high dimensions. Given samples from a distribution D and a confidence parameter \alpha , the goal is to find the smallest volume ellipsoid E which has probability mass \Pr_D[E] \ge 1-\alpha . Ellipsoids are a highly expressive class of confidence sets as they can capture correlations in the distribution, and can approximate any convex set. This problem has been studied in many different communities. In statistics, this is the classic minimum volume estimator introduced by Rousseeuw as a robust non-parametric estimator of location and scatter. However in high dimensions, it becomes NP-hard to obtain any non-trivial approximation factor in volume when the condition number \beta of the ellipsoid (ratio of the largest to the smallest axis length) goes to \infty . This motivates the focus of our paper: can we efficiently find confidence ellipsoids with volume approximation guarantees when compared to ellipsoids of bounded condition number \beta ? Our main result is a polynomial time algorithm that finds an ellipsoid E whose volume is within a O(\beta^\gamma d) multiplicative factor of the volume of best \beta -conditioned ellipsoid while covering at least 1-O(\alpha/\gamma) probability mass for any \gamma \alpha . We complement this with a computational hardness result that shows that such a dependence seems necessary up to constants in the exponent. The algorithm and analysis uses the rich primal-dual structure of the minimum volume enclosing ellipsoid and the geometric Brascamp-Lieb inequality. As a consequence, we obtain the first polynomial time algorithm with approximation guarantees on worst-case instances of the robust subspace recovery problem. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2512.16875 [cs.DS] (or arXiv:2512.16875v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2512.16875 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] On the Universal Representation Property of Spiking Neural Networks
链接: https://arxiv.org/abs/2512.16872
作者: Shayan Hundrieser,Philipp Tuchel,Insung Kong,Johannes Schmidt-Hieber
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 54 pages, 8 figures
Abstract:Inspired by biology, spiking neural networks (SNNs) process information via discrete spikes over time, offering an energy-efficient alternative to the classical computing paradigm and classical artificial neural networks (ANNs). In this work, we analyze the representational power of SNNs by viewing them as sequence-to-sequence processors of spikes, i.e., systems that transform a stream of input spikes into a stream of output spikes. We establish the universal representation property for a natural class of spike train functions. Our results are fully quantitative, constructive, and near-optimal in the number of required weights and neurons. The analysis reveals that SNNs are particularly well-suited to represent functions with few inputs, low temporal complexity, or compositions of such functions. The latter is of particular interest, as it indicates that deep SNNs can efficiently capture composite functions via a modular design. As an application of our results, we discuss spike train classification. Overall, these results contribute to a rigorous foundation for understanding the capabilities and limitations of spike-based neuromorphic systems.
[LG-3] ny Recursive Control: Iterative Reasoning for Efficient Optimal Control
链接: https://arxiv.org/abs/2512.16824
作者: Amit Jain,Richard Linares
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Neural network controllers increasingly demand millions of parameters, and language model approaches push into the billions. For embedded aerospace systems with strict power and latency constraints, this scaling is prohibitive. We present Tiny Recursive Control (TRC), a neural architecture based on a counterintuitive principle: capacity can emerge from iteration depth rather than parameter count. TRC applies compact networks (approximately 1.5M parameters) repeatedly through a two-level hierarchical latent structure, refining control sequences by simulating trajectories and correcting based on tracking error. Because the same weights process every refinement step, adding iterations increases computation without increasing memory. We evaluate TRC on nonlinear control problems including oscillator stabilization and powered descent with fuel constraints. Across these domains, TRC achieves near-optimal control costs while requiring only millisecond-scale inference on GPU and under 10~MB memory, two orders of magnitude smaller than language model baselines. These results demonstrate that recursive reasoning, previously confined to discrete tasks, transfers effectively to continuous control synthesis.
[LG-4] MEPIC: Memory Efficient Position Independent Caching for LLM Serving
链接: https://arxiv.org/abs/2512.16822
作者: Qian Wang,Zahra Yousefijamarani,Morgan Lindsay Heisler,Rongzhi Gu,Bai Xiaolong,Shan Yizhou,Wei Zhang,Wang Lan,Ying Xiong,Yong Zhang,Zhenan Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.16822 [cs.LG] (or arXiv:2512.16822v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.16822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] Pattern recognition in complex systems via vector-field representations of spatio-temporal data
链接: https://arxiv.org/abs/2512.16763
作者: Ingrid Amaranta Membrillo Solis,Maria van Rossem,Tristan Madeleine,Tetiana Orlova,Nina Podoliak,Giampaolo D’Alessandro,Jacek Brodzki,Malgosia Kaczmarek
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Chaotic Dynamics (nlin.CD); Pattern Formation and Solitons (nlin.PS)
*备注: 24 pages, 10 figures
Abstract:A complex system comprises multiple interacting entities whose interdependencies form a unified whole, exhibiting emergent behaviours not present in individual components. Examples include the human brain, living cells, soft matter, Earth’s climate, ecosystems, and the economy. These systems exhibit high-dimensional, non-linear dynamics, making their modelling, classification, and prediction particularly challenging. Advances in information technology have enabled data-driven approaches to studying such systems. However, the sheer volume and complexity of spatio-temporal data often hinder traditional methods like dimensionality reduction, phase-space reconstruction, and attractor characterisation. This paper introduces a geometric framework for analysing spatio-temporal data from complex systems, grounded in the theory of vector fields over discrete measure spaces. We propose a two-parameter family of metrics suitable for data analysis and machine learning applications. The framework supports time-dependent images, image gradients, and real- or vector-valued functions defined on graphs and simplicial complexes. We validate our approach using data from numerical simulations of biological and physical systems on flat and curved domains. Our results show that the proposed metrics, combined with multidimensional scaling, effectively address key analytical challenges. They enable dimensionality reduction, mode decomposition, phase-space reconstruction, and attractor characterisation. Our findings offer a robust pathway for understanding complex dynamical systems, especially in contexts where traditional modelling is impractical but abundant experimental data are available.
[LG-6] NRGPT : An Energy-based Alternative for GPT
链接: https://arxiv.org/abs/2512.16762
作者: Nima Dehmamy,Benjamin Hoover,Bishwajit Saha,Leo Kozachkov,Jean-Jacques Slotine,Dmitry Krotov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don’t necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.
[LG-7] Machine Learning Algorithms: Detection Official Hajj and Umrah Travel Agency Based on Text and Metadata Analysis
链接: https://arxiv.org/abs/2512.16742
作者: Wisnu Uriawan,Muhamad Veva Ramadhan,Firman Adi Nugraha,Hasbi Nur Wahid,M Dantha Arianvasya,Muhammad Zaki Alghifari
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid digitalization of Hajj and Umrah services in Indonesia has significantly facilitated pilgrims but has concurrently opened avenues for digital fraud through counterfeit mobile applications. These fraudulent applications not only inflict financial losses but also pose severe privacy risks by harvesting sensitive personal data. This research aims to address this critical issue by implementing and evaluating machine learning algorithms to verify application authenticity automatically. Using a comprehensive dataset comprising both official applications registered with the Ministry of Religious Affairs and unofficial applications circulating on app stores, we compare the performance of three robust classifiers: Support Vector Machine (SVM), Random Forest (RF), and Na"ive Bayes (NB). The study utilizes a hybrid feature extraction methodology that combines Textual Analysis (TF-IDF) of application descriptions with Metadata Analysis of sensitive access permissions. The experimental results indicate that the SVM algorithm achieves the highest performance with an accuracy of 92.3%, a precision of 91.5%, and an F1-score of 92.0%. Detailed feature analysis reveals that specific keywords related to legality and high-risk permissions (e.g., READ PHONE STATE) are the most significant discriminators. This system is proposed as a proactive, scalable solution to enhance digital trust in the religious tourism sector, potentially serving as a prototype for a national verification system.
[LG-8] KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling
链接: https://arxiv.org/abs/2512.16723
作者: Lei Wang,Xin Tan,Mingwei Wang,Ying Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent selective state space models (SSMs), such as Mamba and Mamba-2, have demonstrated strong performance in sequence modeling owing to input-dependent selection mechanisms. However, these mechanisms lack theoretical grounding and cannot support context-aware selection from latent state dynamics. To address these limitations, we propose KOSS, a Kalman-optimal Selective State Space model that formulates selection as latent state uncertainty minimization. Derived from estimation theory, KOSS adopts a continuous-time latent update driven by a Kalman gain that dynamically modulates information propagation based on content and context, enabling a closed-loop, context-aware selectivity mechanism. To ensure stable computation and near-linear scalability, KOSS employs global spectral differentiation for frequency-domain derivative estimation, along with a segment-wise scan for hardware-efficient processing. On a selective copying task with distractors, KOSS achieves over 79% accuracy while baselines drop below 20%, demonstrating robust context-aware selection. Furthermore, across nine long-term forecasting benchmarks, KOSS reduces MSE by 2.92–36.23% and consistently outperforms state-of-the-art models in both accuracy and stability. To assess real-world applicability, a case study on secondary surveillance radar (SSR) tracking confirms KOSS’s robustness under irregular intervals and noisy conditions and demonstrates its effectiveness in real-world applications. Finally, supplementary experiments verify Kalman gain convergence and the frequency response of spectral differentiation, providing theoretical support for the proposed closed-loop design.
[LG-9] Polyharmonic Spline Packages: Composition Efficient Procedures for Computation and Differentiation
链接: https://arxiv.org/abs/2512.16718
作者: Yuriy N. Bakhvalov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Part 2 of 4 in the “Polyharmonic Cascade” cycle. Continues the theory from arXiv.2512.12731. Source code is available at: this https URL
Abstract:In a previous paper it was shown that a machine learning regression problem can be solved within the framework of random function theory, with the optimal kernel analytically derived from symmetry and indifference principles and coinciding with a polyharmonic spline. However, a direct application of that solution is limited by O(N^3) computational cost and by a breakdown of the original theoretical assumptions when the input space has excessive dimensionality. This paper proposes a cascade architecture built from packages of polyharmonic splines that simultaneously addresses scalability and is theoretically justified for problems with unknown intrinsic low dimensionality. Efficient matrix procedures are presented for forward computation and end-to-end differentiation through the cascade.
[LG-10] Phishing Detection System: An Ensemble Approach Using Character-Level CNN and Feature Engineering
链接: https://arxiv.org/abs/2512.16717
作者: Rudra Dubey,Arpit Mani Tripathi,Archit Srivastava,Sarvpal Singh
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 7 pages, 8 figures
Abstract:In actuality, phishing attacks remain one of the most prevalent cybersecurity risks in existence today, with malevolent actors constantly changing their strategies to successfully trick users. This paper presents an AI model for a phishing detection system that uses an ensemble approach to combine character-level Convolutional Neural Networks (CNN) and LightGBM with engineered features. Our system uses a character-level CNN to extract sequential features after extracting 36 lexical, structural, and domain-based features from the URLs. On a test dataset of 19,873 URLs, the ensemble model achieves an accuracy of 99.819 percent, precision of 100 percent, recall of 99.635 percent, and ROC-AUC of 99.947 percent. Through a FastAPI-based service with an intuitive user interface, the suggested system has been utilised to offer real-time detection. In contrast, the results demonstrate that the suggested solution performs better than individual models; LightGBM contributes 40 percent and character-CNN contributes 60 percent to the final prediction. The suggested method maintains extremely low false positive rates while doing a good job of identifying contemporary phishing techniques. Index Terms - Phishing detection, machine learning, deep learning, CNN, ensemble methods, cybersecurity, URL analysis
[LG-11] Olaf: Bringing an Animated Character to Life in the Physical World
链接: https://arxiv.org/abs/2512.16705
作者: David Müller,Espen Knoop,Dario Mylonopoulos,Agon Serifi,Michael A. Hopkins,Ruben Grandia,Moritz Bächer
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf’s feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character’s slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.
[LG-12] CLARiTy: A Vision Transformer for Multi-Label Classification and Weakly-Supervised Localization of Chest X-ray Pathologies
链接: https://arxiv.org/abs/2512.16700
作者: John M. Statheros,Hairong Wang,Richard Klein
类目: Machine Learning (cs.LG)
*备注: 23 pages, 11 figures, submitted to Medical Image Analysis
Abstract:The interpretation of chest X-rays (CXRs) poses significant challenges, particularly in achieving accurate multi-label pathology classification and spatial localization. These tasks demand different levels of annotation granularity but are frequently constrained by the scarcity of region-level (dense) annotations. We introduce CLARiTy (Class Localizing and Attention Refining Image Transformer), a vision transformer-based model for joint multi-label classification and weakly-supervised localization of thoracic pathologies. CLARiTy employs multiple class-specific tokens to generate discriminative attention maps, and a SegmentCAM module for foreground segmentation and background suppression using explicit anatomical priors. Trained on image-level labels from the NIH ChestX-ray14 dataset, it leverages distillation from a ConvNeXtV2 teacher for efficiency. Evaluated on the official NIH split, the CLARiTy-S-16-512 (a configuration of CLARiTy), achieves competitive classification performance across 14 pathologies, and state-of-the-art weakly-supervised localization performance on 8 pathologies, outperforming prior methods by 50.7%. In particular, pronounced gains occur for small pathologies like nodules and masses. The lower-resolution variant of CLARiTy, CLARiTy-S-16-224, offers high efficiency while decisively surpassing baselines, thereby having the potential for use in low-resource settings. An ablation study confirms contributions of SegmentCAM, DINO pretraining, orthogonal class token loss, and attention pooling. CLARiTy advances beyond CNN-ViT hybrids by harnessing ViT self-attention for global context and class-specific localization, refined through convolutional background suppression for precise, noise-reduced heatmaps.
[LG-13] Blog Data Showdown: Machine Learning vs Neuro-Symbolic Models for Gender Classification
链接: https://arxiv.org/abs/2512.16687
作者: Natnael Tilahun Sinshaw,Mengmei He,Tadesse K. Bahiru,Sudhir Kumar Mohapatra
类目: Machine Learning (cs.LG)
*备注: 6 pages
Abstract:Text classification problems, such as gender classification from a blog, have been a well-matured research area that has been well studied using machine learning algorithms. It has several application domains in market analysis, customer recommendation, and recommendation systems. This study presents a comparative analysis of the widely used machine learning algorithms, namely Support Vector Machines (SVM), Naive Bayes (NB), Logistic Regression (LR), AdaBoost, XGBoost, and an SVM variant (SVM_R) with neuro-symbolic AI (NeSy). The paper also explores the effect of text representations such as TF-IDF, the Universal Sentence Encoder (USE), and RoBERTa. Additionally, various feature extraction techniques, including Chi-Square, Mutual Information, and Principal Component Analysis, are explored. Building on these, we introduce a comparative analysis of the machine learning and deep learning approaches in comparison to the NeSy. The experimental results show that the use of the NeSy approach matched strong MLP results despite a limited dataset. Future work on this research will expand the knowledge base, the scope of embedding types, and the hyperparameter configuration to further study the effectiveness of the NeSy approach.
[LG-14] Exploiting Radio Frequency Fingerprints for Device Identification: Tackling Cross-receiver Challenges in the Source-data-free Scenario
链接: https://arxiv.org/abs/2512.16648
作者: Liu Yang,Qiang Li,Luxiong Wen,Jian Yang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: IEEE Transactions on Mobile Computing
Abstract:With the rapid proliferation of edge computing, Radio Frequency Fingerprint Identification (RFFI) has become increasingly important for secure device authentication. However, practical deployment of deep learning-based RFFI models is hindered by a critical challenge: their performance often degrades significantly when applied across receivers with different hardware characteristics due to distribution shifts introduced by receiver variation. To address this, we investigate the source-data-free cross-receiver RFFI (SCRFFI) problem, where a model pretrained on labeled signals from a source receiver must adapt to unlabeled signals from a target receiver, without access to any source-domain data during adaptation. We first formulate a novel constrained pseudo-labeling-based SCRFFI adaptation framework, and provide a theoretical analysis of its generalization performance. Our analysis highlights a key insight: the target-domain performance is highly sensitive to the quality of the pseudo-labels generated during adaptation. Motivated by this, we propose Momentum Soft pseudo-label Source Hypothesis Transfer (MS-SHOT), a new method for SCRFFI that incorporates momentum-center-guided soft pseudo-labeling and enforces global structural constraints to encourage confident and diverse predictions. Notably, MS-SHOT effectively addresses scenarios involving label shift or unknown, non-uniform class distributions in the target domain – a significant limitation of prior methods. Extensive experiments on real-world datasets demonstrate that MS-SHOT consistently outperforms existing approaches in both accuracy and robustness, offering a practical and scalable solution for source-data-free cross-receiver adaptation in RFFI.
[LG-15] Abacus: Self-Supervised Event Counting-Aligned Distributional Pretraining for Sequential User Modeling
链接: https://arxiv.org/abs/2512.16581
作者: Sullivan Castro,Artem Betlei,Thomas Di Martino,Nadir El Manouzi
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:Modeling user purchase behavior is a critical challenge in display advertising systems, necessary for real-time bidding. The difficulty arises from the sparsity of positive user events and the stochasticity of user actions, leading to severe class imbalance and irregular event timing. Predictive systems usually rely on hand-crafted “counter” features, overlooking the fine-grained temporal evolution of user intent. Meanwhile, current sequential models extract direct sequential signal, missing useful event-counting statistics. We enhance deep sequential models with self-supervised pretraining strategies for display advertising. Especially, we introduce Abacus, a novel approach of predicting the empirical frequency distribution of user events. We further propose a hybrid objective unifying Abacus with sequential learning objectives, combining stability of aggregated statistics with the sequence modeling sensitivity. Experiments on two real-world datasets show that Abacus pretraining outperforms existing methods accelerating downstream task convergence, while hybrid approach yields up to +6.1% AUC compared to the baselines.
[LG-16] Persistent Multiscale Density-based Clustering
链接: https://arxiv.org/abs/2512.16558
作者: Daniël Bot,Leland McInnes,Jan Aerts
类目: Machine Learning (cs.LG)
*备注: 21 pages, 11 figures, submitted to the Journal of Machine Learning Research
Abstract:Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN’s computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.
[LG-17] A Systematic Study of Code Obfuscation Against LLM -based Vulnerability Detection
链接: https://arxiv.org/abs/2512.16538
作者: Xiao Li,Yue Li,Hao Wu,Yue Zhang,Yechao Zhang,Fengyuan Xu,Sheng Zhong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:As large language models (LLMs) are increasingly adopted for code vulnerability detection, their reliability and robustness across diverse vulnerability types have become a pressing concern. In traditional adversarial settings, code obfuscation has long been used as a general strategy to bypass auditing tools, preserving exploitability without tampering with the tools themselves. Numerous efforts have explored obfuscation methods and tools, yet their capabilities differ in terms of supported techniques, granularity, and programming languages, making it difficult to systematically assess their impact on LLM-based vulnerability detection. To address this gap, we provide a structured systematization of obfuscation techniques and evaluate them under a unified framework. Specifically, we categorize existing obfuscation methods into three major classes (layout, data flow, and control flow) covering 11 subcategories and 19 concrete techniques. We implement these techniques across four programming languages (Solidity, C, C++, and Python) using a consistent LLM-driven approach, and evaluate their effects on 15 LLMs spanning four model families (DeepSeek, OpenAI, Qwen, and LLaMA), as well as on two coding agents (GitHub Copilot and Codex). Our findings reveal both positive and negative impacts of code obfuscation on LLM-based vulnerability detection, highlighting conditions under which obfuscation leads to performance improvements or degradations. We further analyze these outcomes with respect to vulnerability characteristics, code properties, and model attributes. Finally, we outline several open problems and propose future directions to enhance the robustness of LLMs for real-world vulnerability detection.
[LG-18] Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders
链接: https://arxiv.org/abs/2512.16519
作者: Nikolaos Ellinas,Alexandra Vioni,Panos Kakoulidis,Georgios Vamvoukakis,Myrsini Christidou,Konstantinos Markopoulos,Junkwang Oh,Gunu Jho,Inchul Hwang,Aimilios Chalamandaris,Pirros Tsiakoulis
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.
[LG-19] Batch Normalization-Free Fully Integer Quantized Neural Networks via Progressive Tandem Learning
链接: https://arxiv.org/abs/2512.16476
作者: Pengfei Sun,Wenyu Jiang,Piew Yoong Chee,Paul Devos,Dick Botteldooren
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Quantised neural networks (QNNs) shrink models and reduce inference energy through low-bit arithmetic, yet most still depend on a running statistics batch normalisation (BN) layer, preventing true integer-only deployment. Prior attempts remove BN by parameter folding or tailored initialisation; while helpful, they rarely recover BN’s stability and accuracy and often impose bespoke constraints. We present a BN-free, fully integer QNN trained via a progressive, layer-wise distillation scheme that slots into existing low-bit pipelines. Starting from a pretrained BN-enabled teacher, we use layer-wise targets and progressive compensation to train a student that performs inference exclusively with integer arithmetic and contains no BN operations. On ImageNet with AlexNet, the BN-free model attains competitive Top-1 accuracy under aggressive quantisation. The procedure integrates directly with standard quantisation workflows, enabling end-to-end integer-only inference for resource-constrained settings such as edge and embedded devices.
[LG-20] A Novel Proposal in Wind Turbine Blade Failure Detection: An Integrated Approach to Energy Efficiency and Sustainability
链接: https://arxiv.org/abs/2512.16437
作者: Jordan Abarca-Albores,Danna Cristina Gutiérrez Cabrera,Luis Antonio Salazar-Licea,Dante Ruiz-Robles,Jesus Alejandro Franco,Alberto-Jesus Perea-Moreno,David Muñoz-Rodríguez,Quetzalcoatl Hernandez-Escobedo
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 21 pages, 10 figures, 9 tables
Abstract:This paper presents a novel methodology for detecting faults in wind turbine blades using com-putational learning techniques. The study evaluates two models: the first employs logistic regression, which outperformed neural networks, decision trees, and the naive Bayes method, demonstrating its effectiveness in identifying fault-related patterns. The second model leverages clustering and achieves superior performance in terms of precision and data segmentation. The results indicate that clustering may better capture the underlying data characteristics compared to supervised methods. The proposed methodology offers a new approach to early fault detection in wind turbine blades, highlighting the potential of integrating different computational learning techniques to enhance system reliability. The use of accessible tools like Orange Data Mining underscores the practical application of these advanced solutions within the wind energy sector. Future work will focus on combining these methods to improve detection accuracy further and extend the application of these techniques to other critical components in energy infrastructure.
[LG-21] Multi-Fidelity Delayed Acceptance: hierarchical MCMC sampling for Bayesian inverse problems combining multiple solvers through deep neural networks
链接: https://arxiv.org/abs/2512.16430
作者: Filippo Zacchei,Paolo Conti,Attilio Alberto Frangi,Andrea Manzoni
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 28 pages, 8 tables, 3 algorithms, 16 figures
Abstract:Inverse uncertainty quantification (UQ) tasks such as parameter estimation are computationally demanding whenever dealing with physics-based models, and typically require repeated evaluations of complex numerical solvers. When partial differential equations are involved, full-order models such as those based on the Finite Element Method can make traditional sampling approaches like Markov Chain Monte Carlo (MCMC) computationally infeasible. Although data-driven surrogate models may help reduce evaluation costs, their utility is often limited by the expense of generating high-fidelity data. In contrast, low-fidelity data can be produced more efficiently, although relying on them alone may degrade the accuracy of the inverse UQ solution. To address these challenges, we propose a Multi-Fidelity Delayed Acceptance scheme for Bayesian inverse problems. Extending the Multi-Level Delayed Acceptance framework, the method introduces multi-fidelity neural networks that combine the predictions of solvers of varying fidelity, with high fidelity evaluations restricted to an offline training stage. During the online phase, likelihood evaluations are obtained by evaluating the coarse solvers and passing their outputs to the trained neural networks, thereby avoiding additional high-fidelity simulations. This construction allows heterogeneous coarse solvers to be incorporated consistently within the hierarchy, providing greater flexibility than standard Multi-Level Delayed Acceptance. The proposed approach improves the approximation accuracy of the low fidelity solvers, leading to longer sub-chain lengths, better mixing, and accelerated posterior inference. The effectiveness of the strategy is demonstrated on two benchmark inverse problems involving (i) steady isotropic groundwater flow, (ii) an unsteady reaction-diffusion system, for which substantial computational savings are obtained. Comments: 28 pages, 8 tables, 3 algorithms, 16 figures Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2512.16430 [cs.LG] (or arXiv:2512.16430v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.16430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-22] Geometric Laplace Neural Operator
链接: https://arxiv.org/abs/2512.16409
作者: Hao Tang,Jiongyu Zhu,Zimeng Feng,Hao Li,Chao Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators have emerged as powerful tools for learning mappings between function spaces, enabling efficient solutions to partial differential equations across varying inputs and domains. Despite the success, existing methods often struggle with non-periodic excitations, transient responses, and signals defined on irregular or non-Euclidean geometries. To address this, we propose a generalized operator learning framework based on a pole-residue decomposition enriched with exponential basis functions, enabling expressive modeling of aperiodic and decaying dynamics. Building on this formulation, we introduce the Geometric Laplace Neural Operator (GLNO), which embeds the Laplace spectral representation into the eigen-basis of the Laplace-Beltrami operator, extending operator learning to arbitrary Riemannian manifolds without requiring periodicity or uniform grids. We further design a grid-invariant network architecture (GLNONet) that realizes GLNO in practice. Extensive experiments on PDEs/ODEs and real-world datasets demonstrate our robust performance over other state-of-the-art models.
[LG-23] NDRL: Cotton Irrigation and Nitrogen Application with Nested Dual-Agent Reinforcement Learning ICONIP2025
链接: https://arxiv.org/abs/2512.16408
作者: Ruifeng Xu,Liang He
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted by ICONIP 2025
Abstract:Effective irrigation and nitrogen fertilization have a significant impact on crop yield. However, existing research faces two limitations: (1) the high complexity of optimizing water-nitrogen combinations during crop growth and poor yield optimization results; and (2) the difficulty in quantifying mild stress signals and the delayed feedback, which results in less precise dynamic regulation of water and nitrogen and lower resource utilization efficiency. To address these issues, we propose a Nested Dual-Agent Reinforcement Learning (NDRL) method. The parent agent in NDRL identifies promising macroscopic irrigation and fertilization actions based on projected cumulative yield benefits, reducing ineffective explorationwhile maintaining alignment between objectives and yield. The child agent’s reward function incorporates quantified Water Stress Factor (WSF) and Nitrogen Stress Factor (NSF), and uses a mixed probability distribution to dynamically optimize daily strategies, thereby enhancing both yield and resource efficiency. We used field experiment data from 2023 and 2024 to calibrate and validate the Decision Support System for Agrotechnology Transfer (DSSAT) to simulate real-world conditions and interact with NDRL. Experimental results demonstrate that, compared to the best baseline, the simulated yield increased by 4.7% in both 2023 and 2024, the irrigation water productivity increased by 5.6% and 5.1% respectively, and the nitrogen partial factor productivity increased by 6.3% and 1.0% respectively. Our method advances the development of cotton irrigation and nitrogen fertilization, providing new ideas for addressing the complexity and precision issues in agricultural resource management and for sustainable agricultural development.
[LG-24] Quantitative Verification of Fairness in Tree Ensembles
链接: https://arxiv.org/abs/2512.16386
作者: Zhenjiang Zhao,Takahisa Toda,Takashi Kitamura
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work focuses on quantitative verification of fairness in tree ensembles. Unlike traditional verification approaches that merely return a single counterexample when the fairness is violated, quantitative verification estimates the ratio of all counterexamples and characterizes the regions where they occur, which is important information for diagnosing and mitigating bias. To date, quantitative verification has been explored almost exclusively for deep neural networks (DNNs). Representative methods, such as DeepGemini and FairQuant, all build on the core idea of Counterexample-Guided Abstraction Refinement, a generic framework that could be adapted to other model classes. We extended the framework into a model-agnostic form, but discovered two limitations: (i) it can provide only lower bounds, and (ii) its performance scales poorly. Exploiting the discrete structure of tree ensembles, our work proposes an efficient quantification technique that delivers any-time upper and lower bounds. Experiments on five widely used datasets demonstrate its effectiveness and efficiency. When applied to fairness testing, our quantification method significantly outperforms state-of-the-art testing techniques.
[LG-25] Multivariate Uncertainty Quantification with Tomographic Quantile Forests
链接: https://arxiv.org/abs/2512.16383
作者: Takuya Kanazawa
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages, 23 figures
Abstract:Quantifying predictive uncertainty is essential for safe and trustworthy real-world AI deployment. Yet, fully nonparametric estimation of conditional distributions remains challenging for multivariate targets. We propose Tomographic Quantile Forests (TQF), a nonparametric, uncertainty-aware, tree-based regression model for multivariate targets. TQF learns conditional quantiles of directional projections \mathbfn^\top\mathbfy as functions of the input \mathbfx and the unit direction \mathbfn . At inference, it aggregates quantiles across many directions and reconstructs the multivariate conditional distribution by minimizing the sliced Wasserstein distance via an efficient alternating scheme with convex subproblems. Unlike classical directional-quantile approaches that typically produce only convex quantile regions and require training separate models for different directions, TQF covers all directions with a single model without imposing convexity restrictions. We evaluate TQF on synthetic and real-world datasets, and release the source code on GitHub.
[LG-26] In-Context Probing for Membership Inference in Fine-Tuned Language Models
链接: https://arxiv.org/abs/2512.16292
作者: Zhexi Lu,Hongliang Chi,Nathalie Baracaldo,Swanand Ravindra Kadhe,Yuseok Jeon,Lei Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample’s intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
[LG-27] Sharpness-aware Second-order Latent Factor Model for High-dimensional and Incomplete Data
链接: https://arxiv.org/abs/2512.16277
作者: Jialiang Wang,Xueyan Bao,Hao Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Second-order Latent Factor (SLF) model, a class of low-rank representation learning methods, has proven effective at extracting node-to-node interaction patterns from High-dimensional and Incomplete (HDI) data. However, its optimization is notoriously difficult due to its bilinear and non-convex nature. Sharpness-aware Minimization (SAM) has recently proposed to find flat local minima when minimizing non-convex objectives, thereby improving the generalization of representation-learning models. To address this challenge, we propose a Sharpness-aware SLF (SSLF) model. SSLF embodies two key ideas: (1) acquiring second-order information via Hessian-vector products; and (2) injecting a sharpness term into the curvature (Hessian) through the designed Hessian-vector products. Experiments on multiple industrial datasets demonstrate that the proposed model consistently outperforms state-of-the-art baselines.
[LG-28] Sharpness-aware Federated Graph Learning WSDM’26
链接: https://arxiv.org/abs/2512.16247
作者: Ruiyu Li,Peige Zhao,Guangxia Li,Pengcheng Wu,Xingyu Gao,Zhiqiang Xu
类目: Machine Learning (cs.LG)
*备注: Accepted by WSDM’26
Abstract:One of many impediments to applying graph neural networks (GNNs) to large-scale real-world graph data is the challenge of centralized training, which requires aggregating data from different organizations, raising privacy concerns. Federated graph learning (FGL) addresses this by enabling collaborative GNN model training without sharing private data. However, a core challenge in FGL systems is the variation in local training data distributions among clients, known as the data heterogeneity problem. Most existing solutions suffer from two problems: (1) The typical optimizer based on empirical risk minimization tends to cause local models to fall into sharp valleys and weakens their generalization to out-of-distribution graph data. (2) The prevalent dimensional collapse in the learned representations of local graph data has an adverse impact on the classification capacity of the GNN model. To this end, we formulate a novel optimization objective that is aware of the sharpness (i.e., the curvature of the loss surface) of local GNN models. By minimizing the loss function and its sharpness simultaneously, we seek out model parameters in a flat region with uniformly low loss values, thus improving the generalization over heterogeneous data. By introducing a regularizer based on the correlation matrix of local representations, we relax the correlations of representations generated by individual local graph samples, so as to alleviate the dimensional collapse of the learned model. The proposed \textbfSharpness-aware f\textbfEderated gr\textbfAph \textbfLearning (SEAL) algorithm can enhance the classification accuracy and generalization ability of local GNN models in federated graph learning. Experimental studies on several graph classification benchmarks show that SEAL consistently outperforms SOTA FGL baselines and provides gains for more participants.
[LG-29] Explicit and Non-asymptotic Query Complexities of Rank-Based Zeroth-order Algorithms on Smooth Functions
链接: https://arxiv.org/abs/2512.16200
作者: Haishan Ye
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Rank-based zeroth-order (ZO) optimization – which relies only on the ordering of function evaluations – offers strong robustness to noise and monotone transformations, and underlies many successful algorithms such as CMA-ES, natural evolution strategies, and rank-based genetic algorithms. Despite its widespread use, the theoretical understanding of rank-based ZO methods remains limited: existing analyses provide only asymptotic insights and do not yield explicit convergence rates for algorithms selecting the top- k directions. This work closes this gap by analyzing a simple rank-based ZO algorithm and establishing the first \emphexplicit, and \emphnon-asymptotic query complexities. For a d -dimension problem, if the function is L -smooth and \mu -strongly convex, the algorithm achieves \widetilde\mathcal O!\left(\fracdL\mu\log!\fracdL\mu\delta\log!\frac1\varepsilon\right) to find an \varepsilon -suboptimal solution, and for smooth nonconvex objectives it reaches \mathcal O!\left(\fracdL\varepsilon\log!\frac1\varepsilon\right) . Notation \cO(\cdot) hides constant terms and \widetilde\mathcal O(\cdot) hides extra \log\log\frac1\varepsilon term. These query complexities hold with a probability at least 1-\delta with 0\delta1 . The analysis in this paper is novel and avoids classical drift and information-geometric techniques. Our analysis offers new insight into why rank-based heuristics lead to efficient ZO optimization. Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2512.16200 [cs.LG] (or arXiv:2512.16200v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.16200 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] A Multi-scale Fused Graph Neural Network with Inter-view Contrastive Learning for Spatial Transcriptomics Data Clustering
链接: https://arxiv.org/abs/2512.16188
作者: Jianping Mei,Siqi Ai,Ye Yuan
类目: Machine Learning (cs.LG)
*备注: 15 pages, 3 figures
Abstract:Spatial transcriptomics enables genome-wide expression analysis within native tissue context, yet identifying spatial domains remains challenging due to complex gene-spatial interactions. Existing methods typically process spatial and feature views separately, fusing only at output level - an “encode-separately, fuse-late” paradigm that limits multi-scale semantic capture and cross-view interaction. Accordingly, stMFG is proposed, a multi-scale interactive fusion graph network that introduces layer-wise cross-view attention to dynamically integrate spatial and gene features after each convolution. The model combines cross-view contrastive learning with spatial constraints to enhance discriminability while maintaining spatial continuity. On DLPFC and breast cancer datasets, stMFG outperforms state-of-the-art methods, achieving up to 14% ARI improvement on certain slices.
[LG-31] A Multimodal Approach to Alzheimers Diagnosis: Geometric Insights from Cube Copying and Cognitive Assessments
链接: https://arxiv.org/abs/2512.16184
作者: Jaeho Yang,Kijung Yoon
类目: Machine Learning (cs.LG)
*备注:
Abstract:Early and accessible detection of Alzheimer’s disease (AD) remains a critical clinical challenge, and cube-copying tasks offer a simple yet informative assessment of visuospatial function. This work proposes a multimodal framework that converts hand-drawn cube sketches into graph-structured representations capturing geometric and topological properties, and integrates these features with demographic information and neuropsychological test (NPT) scores for AD classification. Cube drawings are modeled as graphs with node features encoding spatial coordinates, local graphlet-based topology, and angular geometry, which are processed using graph neural networks and fused with age, education, and NPT features in a late-fusion model. Experimental results show that graph-based representations provide a strong unimodal baseline and substantially outperform pixel-based convolutional models, while multimodal integration further improves performance and robustness to class imbalance. SHAP-based interpretability analysis identifies specific graphlet motifs and geometric distortions as key predictors, closely aligning with clinical observations of disorganized cube drawings in AD. Together, these results establish graph-based analysis of cube copying as an interpretable, non-invasive, and scalable approach for Alzheimer’s disease screening.
[LG-32] Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference
链接: https://arxiv.org/abs/2512.16134
作者: Jian Tian,Shuailong Li,Yang Cao,Wenbo Cui,Minghan Zhu,Wenkang Wu,Jianming Zhang,Yanpeng Wang,Zhiwen Xiao,Zhenyu Hou,Dou Shen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The evolution of Large Language Model (LLM) serving towards complex, distributed architectures–specifically the P/D-separated, large-scale DP+EP paradigm–introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.
[LG-33] BUILD with Precision: Bottom-Up Inference of Linear DAGs
链接: https://arxiv.org/abs/2512.16111
作者: Hamed Ajorlou,Samuel Rey,Gonzalo Mateos,Geert Leus,Antonio G. Marques
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Learning the structure of directed acyclic graphs (DAGs) from observational data is a central problem in causal discovery, statistical signal processing, and machine learning. Under a linear Gaussian structural equation model (SEM) with equal noise variances, the problem is identifiable and we show that the ensemble precision matrix of the observations exhibits a distinctive structure that facilitates DAG recovery. Exploiting this property, we propose BUILD (Bottom-Up Inference of Linear DAGs), a deterministic stepwise algorithm that identifies leaf nodes and their parents, then prunes the leaves by removing incident edges to proceed to the next step, exactly reconstructing the DAG from the true precision matrix. In practice, precision matrices must be estimated from finite data, and ill-conditioning may lead to error accumulation across BUILD steps. As a mitigation strategy, we periodically re-estimate the precision matrix (with less variables as leaves are pruned), trading off runtime for enhanced robustness. Reproducible results on challenging synthetic benchmarks demonstrate that BUILD compares favorably to state-of-the-art DAG learning algorithms, while offering an explicit handle on complexity.
[LG-34] Privacy Blur: Quantifying Privacy and Utility for Image Data Release
链接: https://arxiv.org/abs/2512.16086
作者: Saeed Mahloujifar,Narine Kokhlikyan,Chuan Guo,Kamalika Chaudhuri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Image data collected in the wild often contains private information such as faces and license plates, and responsible data release must ensure that this information stays hidden. At the same time, released data should retain its usefulness for model-training. The standard method for private information obfuscation in images is Gaussian blurring. In this work, we show that practical implementations of Gaussian blurring are reversible enough to break privacy. We then take a closer look at the privacy-utility tradeoffs offered by three other obfuscation algorithms – pixelization, pixelization and noise addition (DP-Pix), and cropping. Privacy is evaluated by reversal and discrimination attacks, while utility by the quality of the learnt representations when the model is trained on data with obfuscated faces. We show that the most popular industry-standard method, Gaussian blur is the least private of the four – being susceptible to reversal attacks in its practical low-precision implementations. In contrast, pixelization and pixelization plus noise addition, when used at the right level of granularity, offer both privacy and utility for a number of computer vision tasks. We make our proposed methods together with suggested parameters available in a software package called Privacy Blur.
[LG-35] In-Context Multi-Operator Learning with DeepOSets
链接: https://arxiv.org/abs/2512.16074
作者: Shao-Ting Chiu,Aditya Nambiar,Ali Syed,Jonathan W. Siegel,Ulisses Braga-Neto
类目: Machine Learning (cs.LG)
*备注:
Abstract:In-context Learning (ICL) is the remarkable capability displayed by some machine learning models to learn from examples in a prompt, without any further weight updates. ICL had originally been thought to emerge from the self-attention mechanism in autoregressive transformer architectures. DeepOSets is a non-autoregressive, non-attention based neural architecture that combines set learning via the DeepSets architecture with operator learning via Deep Operator Networks (DeepONets). In a previous study, DeepOSets was shown to display ICL capabilities in supervised learning problems. In this paper, we show that the DeepOSets architecture, with the appropriate modifications, is a multi-operator in-context learner that can recover the solution operator of a new PDE, not seen during training, from example pairs of parameter and solution placed in a user prompt, without any weight updates. Furthermore, we show that DeepOSets is a universal uniform approximator over a class of continuous operators, which we believe is the first result of its kind in the literature of scientific machine learning. This means that a single DeepOSets architecture exists that approximates in-context any continuous operator in the class to any fixed desired degree accuracy, given an appropriate number of examples in the prompt. Experiments with Poisson and reaction-diffusion forward and inverse boundary-value problems demonstrate the ability of the proposed model to use in-context examples to predict accurately the solutions corresponding to parameter queries for PDEs not seen during training.
[LG-36] Explainable AI in Big Data Fraud Detection
链接: https://arxiv.org/abs/2512.16037
作者: Ayush Jain,Rahul Kulkarni,Siyi Lin
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, research project
Abstract:Big Data has become central to modern applications in finance, insurance, and cybersecurity, enabling machine learning systems to perform large-scale risk assessments and fraud detection. However, the increasing dependence on automated analytics introduces important concerns about transparency, regulatory compliance, and trust. This paper examines how explainable artificial intelligence (XAI) can be integrated into Big Data analytics pipelines for fraud detection and risk management. We review key Big Data characteristics and survey major analytical tools, including distributed storage systems, streaming platforms, and advanced fraud detection models such as anomaly detectors, graph-based approaches, and ensemble classifiers. We also present a structured review of widely used XAI methods, including LIME, SHAP, counterfactual explanations, and attention mechanisms, and analyze their strengths and limitations when deployed at scale. Based on these findings, we identify key research gaps related to scalability, real-time processing, and explainability for graph and temporal models. To address these challenges, we outline a conceptual framework that integrates scalable Big Data infrastructure with context-aware explanation mechanisms and human feedback. The paper concludes with open research directions in scalable XAI, privacy-aware explanations, and standardized evaluation methods for explainable fraud detection systems.
[LG-37] chno-economic optimization of a heat-pipe microreactor part I: theory and cost optimization
链接: https://arxiv.org/abs/2512.16032
作者: Paul Seurin,Dean Price,Luis Nunez
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Microreactors, particularly heat-pipe microreactors (HPMRs), are compact, transportable, self-regulated power systems well-suited for access-challenged remote areas where costly fossil fuels dominate. However, they suffer from diseconomies of scale, and their financial viability remains unconvincing. One step in addressing this shortcoming is to design these reactors with comprehensive economic and physics analyses informing early-stage design iteration. In this work, we present a novel unifying geometric design optimization approach that accounts for techno-economic considerations. We start by generating random samples to train surrogate models, including Gaussian processes (GPs) and multi-layer perceptrons (MLPs). We then deploy these surrogates within a reinforcement learning (RL)-based optimization framework to optimize the levelized cost of electricity (LCOE), all the while imposing constraints on the fuel lifetime, shutdown margin (SDM), peak heat flux, and rod-integrated peaking factor. We study two cases: one in which the axial reflector cost is very high, and one in which it is inexpensive. We found that the operation and maintenance and capital costs are the primary contributors to the overall LCOE particularly the cost of the axial reflectors (for the first case) and the control drum materials. The optimizer cleverly changes the design parameters so as to minimize one of them while still satisfying the constraints, ultimately reducing the LCOE by more than 57% in both instances. A comprehensive integration of fuel and HP performance with multi-objective optimization is currently being pursued to fully understand the interaction between constraints and cost performance.
[LG-38] Information theory and discriminative sampling for model discovery
链接: https://arxiv.org/abs/2512.16000
作者: Yuxuan Bao,J. Nathan Kutz
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Dynamical Systems (math.DS)
*备注:
Abstract:Fisher information and Shannon entropy are fundamental tools for understanding and analyzing dynamical systems from complementary perspectives. They can characterize unknown parameters by quantifying the information contained in variables, or measure how different initial trajectories or temporal segments of a trajectory contribute to learning or inferring system dynamics. In this work, we leverage the Fisher Information Matrix (FIM) within the data-driven framework of \em sparse identification of nonlinear dynamics (SINDy). We visualize information patterns in chaotic and non-chaotic systems for both single trajectories and multiple initial conditions, demonstrating how information-based analysis can improve sampling efficiency and enhance model performance by prioritizing more informative data. The benefits of statistical bagging are further elucidated through spectral analysis of the FIM. We also illustrate how Fisher information and entropy metrics can promote data efficiency in three scenarios: when only a single trajectory is available, when a tunable control parameter exists, and when multiple trajectories can be freely initialized. As data-driven model discovery continues to gain prominence, principled sampling strategies guided by quantifiable information metrics offer a powerful approach for improving learning efficiency and reducing data requirements.
[LG-39] Higher-Order LaSDI: Reduced Order Modeling with Multiple Time Derivatives
链接: https://arxiv.org/abs/2512.15997
作者: Robert Stephany,William Michael Anderson,Youngsoo Choi
类目: Machine Learning (cs.LG)
*备注: 38 pages, 14 figures
Abstract:Solving complex partial differential equations is vital in the physical sciences, but often requires computationally expensive numerical methods. Reduced-order models (ROMs) address this by exploiting dimensionality reduction to create fast approximations. While modern ROMs can solve parameterized families of PDEs, their predictive power degrades over long time horizons. We address this by (1) introducing a flexible, high-order, yet inexpensive finite-difference scheme and (2) proposing a Rollout loss that trains ROMs to make accurate predictions over arbitrary time horizons. We demonstrate our approach on the 2D Burgers equation.
[LG-40] me-Frequency Analysis for Neural Networks
链接: https://arxiv.org/abs/2512.15992
作者: Ahmed Abdeljawad,Elena Cordero
类目: Numerical Analysis (math.NA); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:We develop a quantitative approximation theory for shallow neural networks using tools from time-frequency analysis. Working in weighted modulation spaces M^p,q_m(\mathbfR^d) , we prove dimension-independent approximation rates in Sobolev norms W^n,r(\Omega) for networks whose units combine standard activations with localized time-frequency windows. Our main result shows that for f \in M^p,q_m(\mathbfR^d) one can achieve [ |f - f_N|_W^n,r(\Omega) \lesssim N^-1/2,|f|_M^p,q_m(\mathbfR^d), ] on bounded domains, with explicit control of all constants. We further obtain global approximation theorems on \mathbfR^d using weighted modulation dictionaries, and derive consequences for Feichtinger’s algebra, Fourier-Lebesgue spaces, and Barron spaces. Numerical experiments in one and two dimensions confirm that modulation-based networks achieve substantially better Sobolev approximation than standard ReLU networks, consistent with the theoretical estimates.
[LG-41] Hierarchical Neural Surfaces for 3D Mesh Compression
链接: https://arxiv.org/abs/2512.15985
作者: Sai Karthikey Pentapati,Gregoire Phillips,Alan Bovik
类目: Computational Geometry (cs.CG); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:Implicit Neural Representations (INRs) have been demonstrated to achieve state-of-the-art compression of a broad range of modalities such as images, videos, 3D surfaces, and audio. Most studies have focused on building neural counterparts of traditional implicit representations of 3D geometries, such as signed distance functions. However, the triangle mesh-based representation of geometry remains the most widely used representation in the industry, while building INRs capable of generating them has been sparsely studied. In this paper, we present a method for building compact INRs of zero-genus 3D manifolds. Our method relies on creating a spherical parameterization of a given 3D mesh - mapping the surface of a mesh to that of a unit sphere - then constructing an INR that encodes the displacement vector field defined continuously on its surface that regenerates the original shape. The compactness of our representation can be attributed to its hierarchical structure, wherein it first recovers the coarse structure of the encoded surface before adding high-frequency details to it. Once the INR is computed, 3D meshes of arbitrary resolution/connectivity can be decoded from it. The decoding can be performed in real time while achieving a state-of-the-art trade-off between reconstruction quality and the size of the compressed representations.
[LG-42] racking Wildfire Assets with Commodity RFID and Gaussian Process Modeling
链接: https://arxiv.org/abs/2512.15956
作者: John Hateley,Sriram Narasimhan,Omid Abari
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a novel, cost-effective, and scalable approach to track numerous assets distributed in forested environments using commodity Radio Frequency Identification (RFID) targeting wildfire response applications. Commodity RFID systems suffer from poor tag localization when dispersed in forested environments due to signal attenuation, multi-path effects and environmental variability. Current methods to address this issue via fingerprinting rely on dispersing tags at known locations \em a priori. In this paper, we address the case when it is not possible to tag known locations and show that it is possible to localize tags to accuracies comparable to global positioning systems (GPS) without such a constraint. For this, we propose Gaussian Process to model various environments solely based on RF signal response signatures and without the aid of additional sensors such as global positioning GPS or cameras, and match an unknown RF to the closest match in a model dictionary. We utilize a new weighted log-likelihood method to associate an unknown environment with the closest environment in a dictionary of previously modeled environments, which is a crucial step in being able to use our approach. Our results show that it is possible to achieve localization accuracies of the order of GPS, but with passive commodity RFID, which will allow the tracking of dozens of wildfire assets within the vicinity of mobile readers at-a-time simultaneously, does not require known positions to be tagged \em a priori, and can achieve localization at a fraction of the cost compared to GPS.
[LG-43] Governance by Evidence: Regulated Predictors in Decision-Tree Models
链接: https://arxiv.org/abs/2512.15955
作者: Alexios Veskoukis,Dimitris Kalles
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decision-tree methods are widely used on structured tabular data and are valued for interpretability across many sectors. However, published studies often list the predictors they use (for example age, diagnosis codes, location). Privacy laws increasingly regulate such data types. We use published decision-tree papers as a proxy for real-world use of legally governed data. We compile a corpus of decision-tree studies and assign each reported predictor to a regulated data category (for example health data, biometric identifiers, children’s data, financial attributes, location traces, and government IDs). We then link each category to specific excerpts in European Union and United States privacy laws. We find that many reported predictors fall into regulated categories, with the largest shares in healthcare and clear differences across industries. We analyze prevalence, industry composition, and temporal patterns, and summarize regulation-aligned timing using each framework’s reference year. Our evidence supports privacy-preserving methods and governance checks, and can inform ML practice beyond decision trees.
[LG-44] AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines
链接: https://arxiv.org/abs/2512.15946
作者: Dimitrios Danopoulos,Enrico Lupi,Chang Sun,Sebastian Dittmeier,Michael Kagan,Vladimir Loncar,Maurizio Pierini
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Efficient AI inference on AMD’s Versal AI Engine (AIE) is challenging due to tightly coupled VLIW execution, explicit datapaths, and local memory management. Prior work focused on first-generation AIE kernel optimizations, without tackling full neural network execution across the 2D array. In this work, we present AIE4ML, the first comprehensive framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices, also with forward compatibility for the newer AIE-MLv2 architecture. At the single-kernel level, we attain performance close to the architectural peak. At the graph and system levels, we provide a structured parallelization method that can scale across the 2D AIE-ML fabric and exploit its dedicated memory tiles to stay entirely on-chip throughout the model execution. As a demonstration, we designed a generalized and highly efficient linear-layer implementation with intrinsic support for fused bias addition and ReLU activation. Also, as our framework necessitates the generation of multi-layer implementations, our approach systematically derives deterministic, compact, and topology-optimized placements tailored to the physical 2D grid of the device through a novel graph placement and search algorithm. Finally, the framework seamlessly accepts quantized models imported from high-level tools such as hls4ml or PyTorch while preserving bit-exactness. In layer scaling benchmarks, we achieve up to 98.6% efficiency relative to the single-kernel baseline, utilizing 296 of 304 AIE tiles (97.4%) of the device with entirely on-chip data movement. With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints, making it a practical companion for ultra-low-latency environments such as trigger systems in particle physics experiments.
[LG-45] In-Context Semi-Supervised Learning
链接: https://arxiv.org/abs/2512.15934
作者: Jiashuo Fan,Paul Rosu,Aaron T. Wang,Michael Li,Lawrence Carin,Xiang Cheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:There has been significant recent interest in understanding the capacity of Transformers for in-context learning (ICL), yet most theory focuses on supervised settings with explicitly labeled pairs. In practice, Transformers often perform well even when labels are sparse or absent, suggesting crucial structure within unlabeled contextual demonstrations. We introduce and study in-context semi-supervised learning (IC-SSL), where a small set of labeled examples is accompanied by many unlabeled points, and show that Transformers can leverage the unlabeled context to learn a robust, context-dependent representation. This representation enables accurate predictions and markedly improves performance in low-label regimes, offering foundational insights into how Transformers exploit unlabeled context for representation learning within the ICL framework.
[LG-46] BarcodeMamba: Advancing State-Space Models for Fungal Biodiversity Research NEURIPS2025
链接: https://arxiv.org/abs/2512.15931
作者: Tiancheng Gao,Scott C. Lowe,Brendan Furneaux,Angel X Chang,Graham W. Taylor
类目: Machine Learning (cs.LG)
*备注: 11 pages, accepted at the 3rd Workshop on Imageomics: Discovering Biological Knowledge from Images Using AI (NeurIPS 2025)
Abstract:Accurate taxonomic classification from DNA barcodes is a cornerstone of global biodiversity monitoring, yet fungi present extreme challenges due to sparse labelling and long-tailed taxa distributions. Conventional supervised learning methods often falter in this domain, struggling to generalize to unseen species and to capture the hierarchical nature of the data. To address these limitations, we introduce BarcodeMamba+, a foundation model for fungal barcode classification built on a powerful and efficient state-space model architecture. We employ a pretrain and fine-tune paradigm, which utilizes partially labelled data and we demonstrate this is substantially more effective than traditional fully-supervised methods in this data-sparse environment. During fine-tuning, we systematically integrate and evaluate a suite of enhancements–including hierarchical label smoothing, a weighted loss function, and a multi-head output layer from MycoAI–to specifically tackle the challenges of fungal taxonomy. Our experiments show that each of these components yields significant performance gains. On a challenging fungal classification benchmark with distinct taxonomic distribution shifts from the broad training set, our final model outperforms a range of existing methods across all taxonomic levels. Our work provides a powerful new tool for genomics-based biodiversity research and establishes an effective and scalable training paradigm for this challenging domain. Our code is publicly available at this https URL.
[LG-47] A Unification of Discrete Gaussian and Simplicial Diffusion
链接: https://arxiv.org/abs/2512.15923
作者: Nuria Alina Chandra,Yucen Lily Li,Alan N. Amin,Alex Ali,Joshua Rollins,Sebastian W. Ober,Aniruddh Raghu,Andrew Gordon Wilson
类目: Machine Learning (cs.LG)
*备注:
Abstract:To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.
[LG-48] Introduction to Symbolic Regression in the Physical Sciences
链接: https://arxiv.org/abs/2512.15920
作者: Deaglan J. Bartlett,Harry Desmond,Pedro G. Ferreira,Gabriel Kronberger
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 8 pages, no figures; accepted in Royal Society Philosophical Transactions A special issue “Symbolic regression in the physical sciences”
Abstract:Symbolic regression (SR) has emerged as a powerful method for uncovering interpretable mathematical relationships from data, offering a novel route to both scientific discovery and efficient empirical modelling. This article introduces the Special Issue on Symbolic Regression for the Physical Sciences, motivated by the Royal Society discussion meeting held in April 2025. The contributions collected here span applications from automated equation discovery and emergent-phenomena modelling to the construction of compact emulators for computationally expensive simulations. The introductory review outlines the conceptual foundations of SR, contrasts it with conventional regression approaches, and surveys its main use cases in the physical sciences, including the derivation of effective theories, empirical functional forms and surrogate models. We summarise methodological considerations such as search-space design, operator selection, complexity control, feature selection, and integration with modern AI approaches. We also highlight ongoing challenges, including scalability, robustness to noise, overfitting and computational complexity. Finally we emphasise emerging directions, particularly the incorporation of symmetry constraints, asymptotic behaviour and other theoretical information. Taken together, the papers in this Special Issue illustrate the accelerating progress of SR and its growing relevance across the physical sciences. Comments: 8 pages, no figures; accepted in Royal Society Philosophical Transactions A special issue “Symbolic regression in the physical sciences” Subjects: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an) Cite as: arXiv:2512.15920 [cs.LG] (or arXiv:2512.15920v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.15920 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-49] Boosting t-SNE Efficiency for Sequencing Data: Insights from Kernel Selection
链接: https://arxiv.org/abs/2512.15900
作者: Avais Jan,Prakash Chourasia,Sarwan Ali,Murray Patterson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dimensionality reduction techniques are essential for visualizing and analyzing high-dimensional biological sequencing data. t-distributed Stochastic Neighbor Embedding (t-SNE) is widely used for this purpose, traditionally employing the Gaussian kernel to compute pairwise similarities. However, the Gaussian kernel’s lack of data-dependence and computational overhead limit its scalability and effectiveness for categorical biological sequences. Recent work proposed the isolation kernel as an alternative, yet it may not optimally capture sequence similarities. In this study, we comprehensively evaluate nine different kernel functions for t-SNE applied to molecular sequences, using three embedding methods: One-Hot Encoding, Spike2Vec, and minimizers. Through both subjective visualization and objective metrics (including neighborhood preservation scores), we demonstrate that the cosine similarity kernel in general outperforms other kernels, including Gaussian and isolation kernels, achieving superior runtime efficiency and better preservation of pairwise distances in low-dimensional space. We further validate our findings through extensive classification and clustering experiments across six diverse biological datasets (Spike7k, Host, ShortRead, Rabies, Genome, and Breast Cancer), employing multiple machine learning algorithms and evaluation metrics. Our results show that kernel selection significantly impacts not only visualization quality but also downstream analytical tasks, with the cosine similarity kernel providing the most robust performance across different data types and embedding strategies, making it particularly suitable for large-scale biological sequence analysis.
[LG-50] Secure AI-Driven Super-Resolution for Real-Time Mixed Reality Applications
链接: https://arxiv.org/abs/2512.15823
作者: Mohammad Waquas Usmani,Sankalpa Timilsina,Michael Zink,Susmit Shannigrahi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:
Abstract:Immersive formats such as 360° and 6DoF point cloud videos require high bandwidth and low latency, posing challenges for real-time AR/VR streaming. This work focuses on reducing bandwidth consumption and encryption/decryption delay, two key contributors to overall latency. We design a system that downsamples point cloud content at the origin server and applies partial encryption. At the client, the content is decrypted and upscaled using an ML-based super-resolution model. Our evaluation demonstrates a nearly linear reduction in bandwidth/latency, and encryption/decryption overhead with lower downsampling resolutions, while the super-resolution model effectively reconstructs the original full-resolution point clouds with minimal error and modest inference time.
[LG-51] An empirical analysis of zero-day vulnerabilities disclosed by the zero day initiative
链接: https://arxiv.org/abs/2512.15803
作者: Apurva Shet,Izzat Alsmadi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Zero-day vulnerabilities represent some of the most critical threats in cybersecurity, as they correspond to previously unknown flaws in software or hardware that are actively exploited before vendors can develop and deploy patches. During this exposure window, affected systems remain defenseless, making zero-day attacks particularly damaging and difficult to mitigate. This study analyzes the Zero Day Initiative (ZDI) vulnerability disclosures reported between January and April 2024, Cole [2025] comprising a total of 415 vulnerabilities. The dataset includes vulnerability identifiers, Common Vulnerability Scoring System (CVSS) v3.0 scores, publication dates, and short textual descriptions. The primary objectives of this work are to identify trends in zero-day vulnerability disclosures, examine severity distributions across vendors, and investigate which vulnerability characteristics are most indicative of high severity. In addition, this study explores predictive modeling approaches for severity classification, comparing classical machine learning techniques with deep learning models using both structured metadata and unstructured textual descriptions. The findings aim to support improved patch prioritization strategies, more effective vulnerability management, and enhanced organizational preparedness against emerging zero-day threats.
[LG-52] Hyperparameter Tuning-Based Optimized Performance Analysis of Machine Learning Algorithms for Network Intrusion Detection
链接: https://arxiv.org/abs/2512.15779
作者: Sudhanshu Sekhar Tripathy,Bichitrananda Behera
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Network Intrusion Detection Systems (NIDS) are essential for securing networks by identifying and mitigating unauthorized activities indicative of cyberattacks. As cyber threats grow increasingly sophisticated, NIDS must evolve to detect both emerging threats and deviations from normal behavior. This study explores the application of machine learning (ML) methods to improve the NIDS accuracy through analyzing intricate structures in deep-featured network traffic records. Leveraging the 1999 KDD CUP intrusion dataset as a benchmark, this research evaluates and optimizes several ML algorithms, including Support Vector Machines (SVM), Naïve Bayes variants (MNB, BNB), Random Forest (RF), k-Nearest Neighbors (k-NN), Decision Trees (DT), AdaBoost, XGBoost, Logistic Regression (LR), Ridge Classifier, Passive-Aggressive (PA) Classifier, Rocchio Classifier, Artificial Neural Networks (ANN), and Perceptron (PPN). Initial evaluations without hyper-parameter optimization demonstrated suboptimal performance, highlighting the importance of tuning to enhance classification accuracy. After hyper-parameter optimization using grid and random search techniques, the SVM classifier achieved 99.12% accuracy with a 0.0091 False Alarm Rate (FAR), outperforming its default configuration (98.08% accuracy, 0.0123 FAR) and all other classifiers. This result confirms that SVM accomplishes the highest accuracy among the evaluated classifiers. We validated the effectiveness of all classifiers using a tenfold cross-validation approach, incorporating Recursive Feature Elimination (RFE) for feature selection to enhance the classifiers accuracy and efficiency. Our outcomes indicate that ML classifiers are both adaptable and reliable, contributing to enhanced accuracy in systems for detecting network intrusions.
[LG-53] RAMBO: Reliability Analysis for Mamba through Bit-flip attack Optimization
链接: https://arxiv.org/abs/2512.15778
作者: Sanjay Das,Swastik Bhattacharya,Shamik Kundu,Arnab Raha,Souvik Kundu,Kanad Basu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:State-space models (SSMs), exemplified by the Mamba architecture, have recently emerged as state-of-the-art sequence-modeling frameworks, offering linear-time scalability together with strong performance in long-context settings. Owing to their unique combination of efficiency, scalability, and expressive capacity, SSMs have become compelling alternatives to transformer-based models, which suffer from the quadratic computational and memory costs of attention mechanisms. As SSMs are increasingly deployed in real-world applications, it is critical to assess their susceptibility to both software- and hardware-level threats to ensure secure and reliable operation. Among such threats, hardware-induced bit-flip attacks (BFAs) pose a particularly severe risk by corrupting model parameters through memory faults, thereby undermining model accuracy and functional integrity. To investigate this vulnerability, we introduce RAMBO, the first BFA framework specifically designed to target Mamba-based architectures. Through experiments on the Mamba-1.4b model with LAMBADA benchmark, a cloze-style word-prediction task, we demonstrate that flipping merely a single critical bit can catastrophically reduce accuracy from 74.64% to 0% and increase perplexity from 18.94 to 3.75 x 10^6. These results demonstrate the pronounced fragility of SSMs to adversarial perturbations.
[LG-54] Data Valuation for LLM Fine-Tuning: Efficient Shapley Value Approximation via Language Model Arithmetic
链接: https://arxiv.org/abs/2512.15765
作者: Mélissa Tamine,Otmane Sakhi,Benjamin Heymann
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: 11 pages, 2 figures
Abstract:Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However, computing Shapley values is notoriously expensive for data valuation, typically requiring numerous model retrainings, which can become prohibitive for large machine learning models. In this work, we demonstrate that this computational challenge is dramatically simplified for LLMs trained with Direct Preference Optimization (DPO). We show how the specific mathematical structure of DPO enables scalable Shapley value computation. We believe this observation unlocks many applications at the intersection of data valuation and large language models.
[LG-55] Machine Learning Framework for Thrombosis Risk Prediction in Rotary Blood Pumps
链接: https://arxiv.org/abs/2512.15761
作者: Christopher Blum,Michael Neidlin
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Thrombosis in rotary blood pumps arises from complex flow conditions that remain difficult to translate into reliable and interpretable risk predictions using existing computational models. This limitation reflects an incomplete understanding of how specific flow features contribute to thrombus initiation and growth. This study introduces an interpretable machine learning framework for spatial thrombosis assessment based directly on computational fluid dynamics-derived flow features. A logistic regression (LR) model combined with a structured feature-selection pipeline is used to derive a compact and physically interpretable feature set, including nonlinear feature combinations. The framework is trained using spatial risk patterns from a validated, macro-scale thrombosis model for two representative scenarios. The model reproduces the labeled risk distributions and identifies distinct sets of flow features associated with increased thrombosis risk. When applied to a centrifugal pump, despite training on a single axial pump operating point, the model predicts plausible thrombosis-prone regions. These results show that interpretable machine learning can link local flow features to thrombosis risk while remaining computationally efficient and mechanistically transparent. The low computational cost enables rapid thrombogenicity screening without repeated or costly simulations. The proposed framework complements physics-based thrombosis modeling and provides a methodological basis for integrating interpretable machine learning into CFD-driven thrombosis analysis and device design workflows.
[LG-56] A Tutorial on Dimensionless Learning: Geometric Interpretation and the Effect of Noise
链接: https://arxiv.org/abs/2512.15760
作者: Zhengtao Jake Gan,Xiaoyu Xie
类目: Machine Learning (cs.LG)
*备注: 27 pages, 14 figures, a tutorial with a github link and a website link for GUI
Abstract:Dimensionless learning is a data-driven framework for discovering dimensionless numbers and scaling laws from experimental measurements. This tutorial introduces the method, explaining how it transforms experimental data into compact physical laws that reveal compact dimensional invariance between variables. The approach combines classical dimensional analysis with modern machine learning techniques. Starting from measurements of physical quantities, the method identifies the fundamental ways to combine variables into dimensionless groups, then uses neural networks to discover which combinations best predict the experimental output. A key innovation is a regularization technique that encourages the learned coefficients to take simple, interpretable values like integers or half-integers, making the discovered laws both accurate and physically meaningful. We systematically investigate how measurement noise and discrete sampling affect the discovery process, demonstrating that the regularization approach provides robustness to experimental uncertainties. The method successfully handles cases with single or multiple dimensionless numbers, revealing how different but equivalent representations can capture the same underlying physics. Despite recent progress, key challenges remain, including managing the computational cost of identifying multiple dimensionless groups, understanding the influence of data characteristics, automating the selection of relevant input variables, and developing user-friendly tools for experimentalists. This tutorial serves as both an educational resource and a practical guide for researchers seeking to apply dimensionless learning to their experimental data.
[LG-57] Semantic-Constrained Federated Aggregation: Convergence Theory and Privacy-Utility Bounds for Knowledge-Enhanced Distributed Learning
链接: https://arxiv.org/abs/2512.15759
作者: Jahidul Arafat
类目: Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, 9 tables; Federated Machine Learning
Abstract:Federated learning enables collaborative model training across distributed data sources but suffers from slow convergence under non-IID data conditions. Existing solutions employ algorithmic modifications treating all client updates identically, ignoring semantic validity. We introduce Semantic-Constrained Federated Aggregation (SCFA), a theoretically-grounded framework incorporating domain knowledge constraints into distributed optimization. We prove SCFA achieves convergence rate O(1/sqrt(T) + rho) where rho represents constraint violation rate, establishing the first convergence theory for constraint-based federated learning. Our analysis shows constraints reduce effective data heterogeneity by 41% and improve privacy-utility tradeoffs through hypothesis space reduction by factor theta=0.37. Under (epsilon,delta)-differential privacy with epsilon=10, constraint regularization maintains utility within 3.7% of non-private baseline versus 12.1% degradation for standard federated learning, representing 2.7x improvement. We validate our framework on manufacturing predictive maintenance using Bosch production data with 1.18 million samples and 968 sensor features, constructing knowledge graphs encoding 3,000 constraints from ISA-95 and MASON ontologies. Experiments demonstrate 22% faster convergence, 41.3% model divergence reduction, and constraint violation thresholds where rho0.05 maintains 90% optimal performance while rho0.18 causes catastrophic failure. Our theoretical predictions match empirical observations with R^20.90 across convergence, privacy, and violation-performance relationships.
[LG-58] Yantra AI – An intelligence platform which interacts with manufacturing operations
链接: https://arxiv.org/abs/2512.15758
作者: Varshini Krishnamurthy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Industry 4.0 is growing quickly, which has changed smart production by encouraging the use of real-time tracking, machine learning, and AI-driven systems to make operations run more smoothly. The main focus of this dissertation is on creating and testing an intelligent production system for XRIT that solves important problems like energy management, predictive maintenance, and AI-powered decision support. Machine learning models are built into the system, such as the Random Forest Classifier for proactive maintenance and the Isolation Forest for finding outliers. These models help with decision-making and reducing downtime. Streamlit makes real-time data visualisation possible, giving workers access to dashboards that they can interact with and see real-time this http URL system was tested with fake data and is made to be scalable, so it can be used in real time in XRIT’s production setting. Adding an AI-powered virtual assistant made with GPT-4 lets workers get real-time, useful information that makes complicated questions easier to answer and improves operational decisions. The testing shows that the system makes working efficiency, energy management, and the ability to plan repairs much better. Moving the system to real-time data merging and looking for other ways to make it better will be the main focus of future work.
[LG-59] win Restricted Kernel Machines for Multiview Classification
链接: https://arxiv.org/abs/2512.15757
作者: A. Quadir,M. Sajid,Mushir Akhtar,M. Tanveer
类目: Machine Learning (cs.LG)
*备注: pp. 1-8
Abstract:Multi-view learning (MVL) is an emerging field in machine learning that focuses on improving generalization performance by leveraging complementary information from multiple perspectives or views. Various multi-view support vector machine (MvSVM) approaches have been developed, demonstrating significant success. Moreover, these models face challenges in effectively capturing decision boundaries in high-dimensional spaces using the kernel trick. They are also prone to errors and struggle with view inconsistencies, which are common in multi-view datasets. In this work, we introduce the multiview twin restricted kernel machine (TMvRKM), a novel model that integrates the strengths of kernel machines with the multiview framework, addressing key computational and generalization challenges associated with traditional kernel-based approaches. Unlike traditional methods that rely on solving large quadratic programming problems (QPPs), the proposed TMvRKM efficiently determines an optimal separating hyperplane through a regularized least squares approach, enhancing both computational efficiency and classification performance. The primal objective of TMvRKM includes a coupling term designed to balance errors across multiple views effectively. By integrating early and late fusion strategies, TMvRKM leverages the collective information from all views during training while remaining flexible to variations specific to individual views. The proposed TMvRKM model is rigorously tested on UCI, KEEL, and AwA benchmark datasets. Both experimental results and statistical analyses consistently highlight its exceptional generalization performance, outperforming baseline models in every scenario.
[LG-60] KAN-Matrix: Visualizing Nonlinear Pairwise and Multivariate Contributions for Physical Insight
链接: https://arxiv.org/abs/2512.15755
作者: Luis A. De la Fuente,Hernan A. Moreno,Laura V. Alvarez,Hoshin V. Gupta
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 20 pages, 5 figures, 4 tables, and supplementary information
Abstract:Interpreting complex datasets remains a major challenge for scientists, particularly due to high dimensionality and collinearity among variables. We introduce a novel application of Kolmogorov-Arnold Networks (KANs) to enhance interpretability and parsimony beyond what traditional correlation analyses offer. We present two interpretable, color-coded visualization tools: the Pairwise KAN Matrix (PKAN) and the Multivariate KAN Contribution Matrix (MKAN). PKAN characterizes nonlinear associations between pairs of variables, while MKAN serves as a nonlinear feature-ranking tool that quantifies the relative contributions of inputs in predicting a target variable. These tools support pre-processing (e.g., feature selection, redundancy analysis) and post-processing (e.g., model explanation, physical insights) in model development workflows. Through experimental comparisons, we demonstrate that PKAN and MKAN yield more robust and informative results than Pearson Correlation and Mutual Information. By capturing the strength and functional forms of relationships, these matrices facilitate the discovery of hidden physical patterns and promote domain-informed model development.
[LG-61] A Special Case of Quadratic Extrapolation Under the Neural Tangent Kernel
链接: https://arxiv.org/abs/2512.15749
作者: Abiel Kim
类目: Machine Learning (cs.LG)
*备注: 13 pages
Abstract:It has been demonstrated both theoretically and empirically that the ReLU MLP tends to extrapolate linearly for an out-of-distribution evaluation point. The machine learning literature provides ample analysis with respect to the mechanisms to which linearity is induced. However, the analysis of extrapolation at the origin under the NTK regime remains a more unexplored special case. In particular, the infinite-dimensional feature map induced by the neural tangent kernel is not translationally invariant. This means that the study of an out-of-distribution evaluation point very far from the origin is not equivalent to the evaluation of a point very near the origin. And since the feature map is rotation invariant, these two special cases may represent the most canonically extreme bounds of ReLU NTK extrapolation. Ultimately, it is this loose recognition of the two special cases of extrapolation that motivate the discovery of quadratic extrapolation for an evaluation close to the origin.
[LG-62] A Unified Generative-Predictive Framework for Deterministic Inverse Design
链接: https://arxiv.org/abs/2512.15746
作者: Reza T. Batley,Sourav Saha
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Inverse design of heterogeneous material microstructures is a fundamentally ill-posed and famously computationally expensive problem. This is exacerbated by the high-dimensional design spaces associated with finely resolved images, multimodal input property streams, and a highly nonlinear forward physics. Whilst modern generative models excel at accurately modeling such complex forward behavior, most of them are not intrinsically structured to support fast, stable \emphdeterministic inversion with a physics-informed bias. This work introduces Janus, a unified generative-predictive framework to address this problem. Janus couples a deep encoder-decoder architecture with a predictive KHRONOS head, a separable neural architecture. Topologically speaking, Janus learns a latent manifold simultaneously isometric for generative inversion and pruned for physical prediction; the joint objective inducing \emphdisentanglement of the latent space. Janus is first validated on the MNIST dataset, demonstrating high-fidelity reconstruction, accurate classification and diverse generative inversion of all ten target classes. It is then applied to the inverse design of heterogeneous microstructures labeled with thermal conductivity. It achieves a forward prediction accuracy R^2=0.98 (2% relative error) and sub-5% pixelwise reconstruction error. Inverse solutions satisfy target properties to within 1% relative error. Inverting a sweep through properties reveal smooth traversal of the latent manifold, and UMAP visualization confirms the emergence of a low-dimensional, disentangled manifold. By unifying prediction and generation within a single latent space, Janus enables real-time, physics-informed inverse microstructure generation at a lower computational cost typically associated with classical optimization-based approaches.
[LG-63] How Do Graph Signals Affect Recommendation: Unveiling the Mystery of Low and High-Frequency Graph Signals
链接: https://arxiv.org/abs/2512.15744
作者: Feng Liu,Hao Cang,Huanhuan Yuan,Jiaqing Fan,Yongjing Hao,Fuzhen Zhuang,Guanfeng Liu,Pengpeng Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spectral graph neural networks (GNNs) are highly effective in modeling graph signals, with their success in recommendation often attributed to low-pass filtering. However, recent studies highlight the importance of high-frequency signals. The role of low-frequency and high-frequency graph signals in recommendation remains unclear. This paper aims to bridge this gap by investigating the influence of graph signals on recommendation performance. We theoretically prove that the effects of low-frequency and high-frequency graph signals are equivalent in recommendation tasks, as both contribute by smoothing the similarities between user-item pairs. To leverage this insight, we propose a frequency signal scaler, a plug-and-play module that adjusts the graph signal filter function to fine-tune the smoothness between user-item pairs, making it compatible with any GNN model. Additionally, we identify and prove that graph embedding-based methods cannot fully capture the characteristics of graph signals. To address this limitation, a space flip method is introduced to restore the expressive power of graph embeddings. Remarkably, we demonstrate that either low-frequency or high-frequency graph signals alone are sufficient for effective recommendations. Extensive experiments on four public datasets validate the effectiveness of our proposed methods. Code is avaliable at this https URL.
[LG-64] SHARe-KAN: Holographic Vector Quantization for Memory-Bound Inference
链接: https://arxiv.org/abs/2512.15742
作者: Jeff Smith
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Kolmogorov-Arnold Networks (KANs) face a fundamental memory wall: their learned basis functions create parameter counts that impose extreme bandwidth demands, hindering deployment in memory-constrained environments. We show that Vision KANs exhibit a holographic topology, where information is distributed across the interference of splines rather than localized to specific edges. Consequently, traditional pruning fails (10% sparsity degrades mAP from 85.23% to 45%, a \sim 40-point drop). To address this, we present SHARe-KAN, a framework utilizing Gain-Shape-Bias Vector Quantization to exploit functional redundancy while preserving the dense topology. Coupled with LUTHAM, a hardware-aware compiler with static memory planning, we achieve 88\times runtime memory reduction (1.13 GB \to 12.91 MB) and match uncompressed baseline accuracy on PASCAL VOC. Profiling on NVIDIA Ampere architecture confirms 90% L2 cache residency, demonstrating that the workload is decoupled from DRAM bandwidth constraints inherent to spline-based architectures.
[LG-65] Cartesian-nj: Extending e3nn to Irreducible Cartesian Tensor Product and Contracion
链接: https://arxiv.org/abs/2512.16882
作者: Zemin Xu,Chenyu Wu,Wenbo Xie,Daiqian Xie,P. Hu
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Equivariant atomistic machine learning models have brought substantial gains in both extrapolation capability and predictive accuracy. Depending on the basis of the space, two distinct types of irreducible representations are utilized. From architectures built upon spherical tensors (STs) to more recent formulations employing irreducible Cartesian tensors (ICTs), STs have remained dominant owing to their compactness, elegance, and theoretical completeness. Nevertheless, questions have persisted regarding whether ST constructions are the only viable design principle, motivating continued development of Cartesian networks. In this work, we introduce the Cartesian-3j and Cartesian-nj symbol, which serve as direct analogues of the Wigner-3j and Wigner-nj symbol defined for tensor coupling. These coefficients enable the combination of any two ICTs into a new ICT. Building on this foundation, we extend e3nn to support irreducible Cartesian tensor product, and we release the resulting Python package as cartnn. Within this framework, we implement Cartesian counterparts of MACE, NequIP, and Allegro, allowing the first systematic comparison of Cartesian and spherical models to assess whether Cartesian formulations may offer advantages under specific conditions. Using TACE as a representative example, we further examine whether architectures constructed from irreducible Cartesian tensor product and contraction(ICTP and ICTC) are conceptually well-founded in Cartesian space and whether opportunities remain for improving their design.
[LG-66] Few-Shot Specific Emitter Identification via Integrated Complex Variational Mode Decomposition and Spatial Attention Transfer
链接: https://arxiv.org/abs/2512.16786
作者: Chenyu Zhu,Zeyang Li,Ziyi Xie,Jie Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 14 pages, 12 Figures, 5 Table
Abstract:Specific emitter identification (SEI) utilizes passive hardware characteristics to authenticate transmitters, providing a robust physical-layer security solution. However, most deep-learning-based methods rely on extensive data or require prior information, which poses challenges in real-world scenarios with limited labeled data. We propose an integrated complex variational mode decomposition algorithm that decomposes and reconstructs complex-valued signals to approximate the original transmitted signals, thereby enabling more accurate feature extraction. We further utilize a temporal convolutional network to effectively model the sequential signal characteristics, and introduce a spatial attention mechanism to adaptively weight informative signal segments, significantly enhancing identification performance. Additionally, the branch network allows leveraging pre-trained weights from other data while reducing the need for auxiliary datasets. Ablation experiments on the simulated data demonstrate the effectiveness of each component of the model. An accuracy comparison on a public dataset reveals that our method achieves 96% accuracy using only 10 symbols without requiring any prior knowledge.
[LG-67] Non-Linear Strong Data-Processing for Quantum Hockey-Stick Divergences
链接: https://arxiv.org/abs/2512.16778
作者: Theshani Nuradha,Ian George,Christoph Hirche
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Data-processing is a desired property of classical and quantum divergences and information measures. In information theory, the contraction coefficient measures how much the distinguishability of quantum states decreases when they are transmitted through a quantum channel, establishing linear strong data-processing inequalities (SDPI). However, these linear SDPI are not always tight and can be improved in most of the cases. In this work, we establish non-linear SDPI for quantum hockey-stick divergence for noisy channels that satisfy a certain noise criterion. We also note that our results improve upon existing linear SDPI for quantum hockey-stick divergences and also non-linear SDPI for classical hockey-stick divergence. We define F_\gamma curves generalizing Dobrushin curves for the quantum setting while characterizing SDPI for the sequential composition of heterogeneous channels. In addition, we derive reverse-Pinsker type inequalities for f -divergences with additional constraints on hockey-stick divergences. We show that these non-linear SDPI can establish tighter finite mixing times that cannot be achieved through linear SDPI. Furthermore, we find applications of these in establishing stronger privacy guarantees for the composition of sequential private quantum channels when privacy is quantified by quantum local differential privacy.
[LG-68] On The Hidden Biases of Flow Matching Samplers
链接: https://arxiv.org/abs/2512.16768
作者: Soon Hoe Lim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 20 pages
Abstract:We study the implicit bias of flow matching (FM) samplers via the lens of empirical flow matching. Although population FM may produce gradient-field velocities resembling optimal transport (OT), we show that the empirical FM minimizer is almost never a gradient field, even when each conditional flow is. Consequently, empirical FM is intrinsically energetically suboptimal. In view of this, we analyze the kinetic energy of generated samples. With Gaussian sources, both instantaneous and integrated kinetic energies exhibit exponential concentration, while heavy-tailed sources lead to polynomial tails. These behaviors are governed primarily by the choice of source distribution rather than the data. Overall, these notes provide a concise mathematical account of the structural and energetic biases arising in empirical FM.
[LG-69] How accurate are foundational machine learning interatomic potentials for heterogeneous catalysis?
链接: https://arxiv.org/abs/2512.16702
作者: Luuk H. E. Kempen,Raffaele Cheula,Mie Andersen
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 16 pages, 5 figures, 1 table + supplementary information (37 pages, 16 figures, 15 tables)
[LG-70] Riemannian Stochastic Interpolants for Amorphous Particle Systems
链接: https://arxiv.org/abs/2512.16607
作者: Louis Grenioux,Leonardo Galliano,Ludovic Berthier,Giulio Biroli,Marylou Gabrié
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Modern generative models hold great promise for accelerating diverse tasks involving the simulation of physical systems, but they must be adapted to the specific constraints of each domain. Significant progress has been made for biomolecules and crystalline materials. Here, we address amorphous materials (glasses), which are disordered particle systems lacking atomic periodicity. Sampling equilibrium configurations of glass-forming materials is a notoriously slow and difficult task. This obstacle could be overcome by developing a generative framework capable of producing equilibrium configurations with well-defined likelihoods. In this work, we address this challenge by leveraging an equivariant Riemannian stochastic interpolation framework which combines Riemannian stochastic interpolant and equivariant flow matching. Our method rigorously incorporates periodic boundary conditions and the symmetries of multi-component particle systems, adapting an equivariant graph neural network to operate directly on the torus. Our numerical experiments on model amorphous systems demonstrate that enforcing geometric and symmetry constraints significantly improves generative performance.
[LG-71] Muon is Provably Faster with Momentum Variance Reduction
链接: https://arxiv.org/abs/2512.16598
作者: Xun Qian,Hussein Rammal,Dmitry Kovalev,Peter Richtárik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 31 pages, 4 figures
Abstract:Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from \cal O (\frac1K^1/4) to \cal O (\frac1K^1/3) . Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.
[LG-72] Non-Asymptotic Global Convergence of PPO-Clip
链接: https://arxiv.org/abs/2512.16565
作者: Yin Liu,Qiming Dai,Junyu Zhang,Zaiwen Wen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) has gained attention for aligning large language models (LLMs) via reinforcement learning from human feedback (RLHF). The actor-only variants of Proximal Policy Optimization (PPO) are widely applied for their efficiency. These algorithms incorporate a clipping mechanism to improve stability. Besides, a regularization term, such as the reverse KL-divergence or a more general (f)-divergence, is introduced to prevent policy drift. Despite their empirical success, a rigorous theoretical understanding of the problem and the algorithm’s properties is limited. This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting with (f)-divergence regularization under the softmax policy parameterization. We derive a non-uniform Lipschitz smoothness condition and a Łojasiewicz inequality for the considered problem. Based on these, a non-asymptotic linear convergence rate to the globally optimal policy is established for the forward KL-regularizer. Furthermore, stationary convergence and local linear convergence are derived for the reverse KL-regularizer.
[LG-73] Predictive Inorganic Synthesis based on Machine Learning using Small Data sets: a case study of size-controlled Cu Nanoparticles
链接: https://arxiv.org/abs/2512.16545
作者: Brent Motmans,Digvijay Ghogare,Thijs G.I. van Wijk,An Hardy,Danny E.P. Vanpoucke
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 22 pages, 16 figures, 12 tables (including SI)
Abstract:Copper nanoparticles (Cu NPs) have a broad applicability, yet their synthesis is sensitive to subtle changes in reaction parameters. This sensitivity, combined with the time- and resource-intensive nature of experimental optimization, poses a major challenge in achieving reproducible and size-controlled synthesis. While Machine Learning (ML) shows promise in materials research, its application is often limited by scarcity of large high-quality experimental data sets. This study explores ML to predict the size of Cu NPs from microwave-assisted polyol synthesis using a small data set of 25 in-house performed syntheses. Latin Hypercube Sampling is used to efficiently cover the parameter space while creating the experimental data set. Ensemble regression models, built with the AMADEUS framework, successfully predict particle sizes with high accuracy ( R^2 = 0.74 ), outperforming classical statistical approaches ( R^2 = 0.60 ). Overall, this study highlights that, for lab-scale synthesis optimization, high-quality small datasets combined with classical, interpretable ML models outperform traditional statistical methods and are fully sufficient for quantitative synthesis prediction. This approach provides a sustainable and experimentally realistic pathway toward data-driven inorganic synthesis design.
[LG-74] Advantages and limitations in the use of transfer learning for individual treatment effects in causal machine learning
链接: https://arxiv.org/abs/2512.16489
作者: Seyda Betul Aydin,Holger Brandt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Generalizing causal knowledge across diverse environments is challenging, especially when estimates from large-scale datasets must be applied to smaller or systematically different contexts, where external validity is critical. Model-based estimators of individual treatment effects (ITE) from machine learning require large sample sizes, limiting their applicability in domains such as behavioral sciences with smaller datasets. We demonstrate how estimation of ITEs with Treatment Agnostic Representation Networks (TARNet; Shalit et al., 2017) can be improved by leveraging knowledge from source datasets and adapting it to new settings via transfer learning (TL-TARNet; Aloui et al., 2023). In simulations that vary source and sample sizes and consider both randomized and non-randomized intervention target settings, the transfer-learning extension TL-TARNet improves upon standard TARNet, reducing ITE error and attenuating bias when a large unbiased source is available and target samples are small. In an empirical application using the India Human Development Survey (IHDS-II), we estimate the effect of mothers’ firewood collection time on children’s weekly study time; transfer learning pulls the target mean ITEs toward the source ITE estimate, reducing bias in the estimates obtained without transfer. These results suggest that transfer learning for causal models can improve the estimation of ITE in small samples.
[LG-75] Global universal approximation with Brownian signatures
链接: https://arxiv.org/abs/2512.16396
作者: Mihriban Ceylan,David J. Prömel
类目: Probability (math.PR); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注:
Abstract:We establish L^p -type universal approximation theorems for general and non-anticipative functionals on suitable rough path spaces, showing that linear functionals acting on signatures of time-extended rough paths are dense with respect to an L^p -distance. To that end, we derive global universal approximation theorems for weighted rough path spaces. We demonstrate that these L^p -type universal approximation theorems apply in particular to Brownian motion. As a consequence, linear functionals on the signature of the time-extended Brownian motion can approximate any p -integrable stochastic process adapted to the Brownian filtration, including solutions to stochastic differential equations.
[LG-76] Can Transformers overcome the lack of data in the simulation of history-dependent flows?
链接: https://arxiv.org/abs/2512.16305
作者: P. Urdeitx,I. Alfaro,D. Gonzalez,F. Chinesta,E. Cueto
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:It is well known that the lack of information about certain variables necessary for the description of a dynamical system leads to the introduction of historical dependence (lack of Markovian character of the model) and noise. Traditionally, scientists have made up for these shortcomings by designing phenomenological variables that take into account this historical dependence (typically, conformational tensors in fluids). Often, these phenomenological variables are not easily measurable experimentally. In this work, we study to what extent Transformer architectures are able to cope with the lack of experimental data on these variables. The methodology is evaluated on three benchmark problems: a cylinder flow with no history dependence, a viscoelastic Couette flow modeled via the Oldroyd-B formalism, and a non-linear polymeric fluid described by the FENE model. Our results show that the Transformer outperforms a thermodynamically consistent, structure-preserving neural network with metriplectic bias in systems with missing experimental data, providing lower errors even in low-dimensional latent spaces. In contrast, for systems whose state variables can be fully known, the metriplectic model achieves superior performance.
[LG-77] DAG Learning from Zero-Inflated Count Data Using Continuous Optimization
链接: https://arxiv.org/abs/2512.16233
作者: Noriaki Sato,Marco Scutari,Shuichi Kawano,Rui Yamaguchi,Seiya Imoto
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We address network structure learning from zero-inflated count data by casting each node as a zero-inflated generalized linear model and optimizing a smooth, score-based objective under a directed acyclic graph constraint. Our Zero-Inflated Continuous Optimization (ZICO) approach uses node-wise likelihoods with canonical links and enforces acyclicity through a differentiable surrogate constraint combined with sparsity regularization. ZICO achieves superior performance with faster runtimes on simulated data. It also performs comparably to or better than common algorithms for reverse engineering gene regulatory networks. ZICO is fully vectorized and mini-batched, enabling learning on larger variable sets with practical runtimes in a wide range of domains.
[LG-78] Physics-Informed Neural Networks for Modeling the Martian Induced Magnetosphere
链接: https://arxiv.org/abs/2512.16175
作者: Jiawei Gao,Chuanfei Dong,Chi Zhang,Yilan Qin,Simin Shekarpaz,Xinmin Li,Liang Wang,Hongyang Zhou,Abigail Tadlock
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注:
[LG-79] Artificial Intelligence-Enabled Holistic Design of Catalysts Tailored for Semiconducting Carbon Nanotube Growth
链接: https://arxiv.org/abs/2512.16151
作者: Liu Qian,Yue Li,Ying Xie,Jian Zhang,Pai Li,Yue Yu,Zhe Liu,Feng Ding,Jin Zhang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 16 pages and 4 figures in main text
[LG-80] BayesSum: Bayesian Quadrature in Discrete Spaces
链接: https://arxiv.org/abs/2512.16105
作者: Sophia Seulkee Kang,François-Xavier Briol,Toni Karvonen,Zonghao Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses the challenging computational problem of estimating intractable expectations over discrete domains. Existing approaches, including Monte Carlo and Russian Roulette estimators, are consistent but often require a large number of samples to achieve accurate results. We propose a novel estimator, \emphBayesSum, which is an extension of Bayesian quadrature to discrete domains. It is more sample efficient than alternatives due to its ability to make use of prior information about the integrand through a Gaussian process. We show this through theory, deriving a convergence rate significantly faster than Monte Carlo in a broad range of settings. We also demonstrate empirically that our proposed method does indeed require fewer samples on several synthetic settings as well as for parameter estimation for Conway-Maxwell-Poisson and Potts models.
[LG-81] Graph Neural Networks for Interferometer Simulations
链接: https://arxiv.org/abs/2512.16051
作者: Sidharth Kannan,Pooyan Goodarzi,Evangelos E. Papalexakis,Jonathan W. Richardson
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:In recent years, graph neural networks (GNNs) have shown tremendous promise in solving problems in high energy physics, materials science, and fluid dynamics. In this work, we introduce a new application for GNNs in the physical sciences: instrumentation design. As a case study, we apply GNNs to simulate models of the Laser Interferometer Gravitational-Wave Observatory (LIGO) and show that they are capable of accurately capturing the complex optical physics at play, while achieving runtimes 815 times faster than state of the art simulation packages. We discuss the unique challenges this problem provides for machine learning models. In addition, we provide a dataset of high-fidelity optical physics simulations for three interferometer topologies, which can be used as a benchmarking suite for future work in this direction.
[LG-82] Concurrence: A dependence criterion for time series applied to biological data
链接: https://arxiv.org/abs/2512.16001
作者: Evangelos Sariyanidi,John D. Herrington,Lisa Yankowitz,Pratik Chaudhari,Theodore D. Satterthwaite,Casey J. Zampella,Jeffrey S. Morris,Edward Gunning,Robert T. Schultz,Russell T. Shinohara,Birkan Tunc
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Measuring the statistical dependence between observed signals is a primary tool for scientific discovery. However, biological systems often exhibit complex non-linear interactions that currently cannot be captured without a priori knowledge or large datasets. We introduce a criterion for dependence, whereby two time series are deemed dependent if one can construct a classifier that distinguishes between temporally aligned vs. misaligned segments extracted from them. We show that this criterion, concurrence, is theoretically linked with dependence, and can become a standard approach for scientific analyses across disciplines, as it can expose relationships across a wide spectrum of signals (fMRI, physiological and behavioral data) without ad-hoc parameter tuning or large amounts of data.
[LG-83] Consensus dimension reduction via multi-view learning
链接: https://arxiv.org/abs/2512.15802
作者: Bingxue An,Tiffany M. Tang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:A plethora of dimension reduction methods have been developed to visualize high-dimensional data in low dimensions. However, different dimension reduction methods often output different and possibly conflicting visualizations of the same data. This problem is further exacerbated by the choice of hyperparameters, which may substantially impact the resulting visualization. To obtain a more robust and trustworthy dimension reduction output, we advocate for a consensus approach, which summarizes multiple visualizations into a single consensus dimension reduction visualization. Here, we leverage ideas from multi-view learning in order to identify the patterns that are most stable or shared across the many different dimension reduction visualizations, or views, and subsequently visualize this shared structure in a single low-dimensional plot. We demonstrate that this consensus visualization effectively identifies and preserves the shared low-dimensional data structure through both simulated and real-world case studies. We further highlight our method’s robustness to the choice of dimension reduction method and hyperparameters – a highly-desirable property when working towards trustworthy and reproducible data science.
[LG-84] he Red Queens Trap: Limits of Deep Evolution in High-Frequency Trading
链接: https://arxiv.org/abs/2512.15732
作者: Yijia Chen
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Finance (q-fin.CP)
*备注:
Abstract:The integration of Deep Reinforcement Learning (DRL) and Evolutionary Computation (EC) is frequently hypothesized to be the “Holy Grail” of algorithmic trading, promising systems that adapt autonomously to non-stationary market regimes. This paper presents a rigorous post-mortem analysis of “Galaxy Empire,” a hybrid framework coupling LSTM/Transformer-based perception with a genetic “Time-is-Life” survival mechanism. Deploying a population of 500 autonomous agents in a high-frequency cryptocurrency environment, we observed a catastrophic divergence between training metrics (Validation APY 300% ) and live performance (Capital Decay 70% ). We deconstruct this failure through a multi-disciplinary lens, identifying three critical failure modes: the overfitting of \textitAleatoric Uncertainty in low-entropy time-series, the \textitSurvivor Bias inherent in evolutionary selection under high variance, and the mathematical impossibility of overcoming microstructure friction without order-flow data. Our findings provide empirical evidence that increasing model complexity in the absence of information asymmetry exacerbates systemic fragility.
[LG-85] Random matrix theory of sparse neuronal networks with heterogeneous timescales
链接: https://arxiv.org/abs/2512.12767
作者: Thiparat Chotibut,Oleg Evnin,Weerawit Horinouchi
类目: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Probability (math.PR)
*备注:
Abstract:Training recurrent neuronal networks consisting of excitatory (E) and inhibitory (I) units with additive noise for working memory computation slows and diversifies inhibitory timescales, leading to improved task performance that is attributed to emergent marginally stable equilibria [PNAS 122 (2025) e2316745122]. Yet the link between trained network characteristics and their roles in shaping desirable dynamical landscapes remains unexplored. Here, we investigate the Jacobian matrices describing the dynamics near these equilibria and show that they are sparse, non-Hermitian rectangular-block matrices modified by heterogeneous synaptic decay timescales and activation-function gains. We specify a random matrix ensemble that faithfully captures the spectra of trained Jacobian matrices, arising from the inhibitory core - excitatory periphery network motif (pruned E weights, broadly distributed I weights) observed post-training. An analytic theory of this ensemble is developed using statistical field theory methods: a Hermitized resolvent representation of the spectral density processed with a supersymmetry-based treatment in the style of Fyodorov and Mirlin. In this manner, an analytic description of the spectral edge is obtained, relating statistical parameters of the Jacobians (sparsity, weight variances, E/I ratio, and the distributions of timescales and gains) to near-critical features of the equilibria essential for robust working memory computation.
信息检索
[IR-0] InfoDCL: Informative Noise Enhanced Diffusion Based Contrastive Learning
链接: https://arxiv.org/abs/2512.16576
作者: Xufeng Liang,Zhida Qin,Chong Zhang,Tianyu Huang,Gangyi Ding
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Contrastive learning has demonstrated promising potential in recommender systems. Existing methods typically construct sparser views by randomly perturbing the original interaction graph, as they have no idea about the authentic user preferences. Owing to the sparse nature of recommendation data, this paradigm can only capture insufficient semantic information. To address the issue, we propose InfoDCL, a novel diffusion-based contrastive learning framework for recommendation. Rather than injecting randomly sampled Gaussian noise, we employ a single-step diffusion process that integrates noise with auxiliary semantic information to generate signals and feed them to the standard diffusion process to generate authentic user preferences as contrastive views. Besides, based on a comprehensive analysis of the mutual influence between generation and preference learning in InfoDCL, we build a collaborative training objective strategy to transform the interference between them into mutual collaboration. Additionally, we employ multiple GCN layers only during inference stage to incorporate higher-order co-occurrence information while maintaining training efficiency. Extensive experiments on five real-world datasets demonstrate that InfoDCL significantly outperforms state-of-the-art methods. Our InfoDCL offers an effective solution for enhancing recommendation performance and suggests a novel paradigm for applying diffusion method in contrastive learning frameworks.
[IR-1] From Flows to Functions: Macroscopic Behavioral Fingerprinting of IoT Devices via Network Services
链接: https://arxiv.org/abs/2512.16348
作者: Shayan Azizi,Norihiro Okui,Masataka Nakahara,Ayumu Kubota,Hassan Habibi Gharakheili
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 3 figures, 1 table, and 1 algorithm
Abstract:Identifying devices such as cameras, printers, voice assistants, or health monitoring sensors, collectively known as the Internet of Things (IoT), within a network is a critical operational task, particularly to manage the cyber risks they introduce. While behavioral fingerprinting based on network traffic analysis has shown promise, most existing approaches rely on machine learning (ML) techniques applied to fine-grained features of short-lived traffic units (packets and/or flows). These methods tend to be computationally expensive, sensitive to traffic measurement errors, and often produce opaque inferences. In this paper, we propose a macroscopic, lightweight, and explainable alternative to behavioral fingerprinting focusing on the network services (e.g., TCP/80, UDP/53) that IoT devices use to perform their intended functions over extended periods. Our contributions are threefold. (1) We demonstrate that IoT devices exhibit stable and distinguishable patterns in their use of network services over a period of time. We formalize the notion of service-level fingerprints and derive a generalized method to represent network behaviors using a configurable granularity parameter. (2) We develop a procedure to extract service-level fingerprints, apply it to traffic from 13 consumer IoT device types in a lab testbed, and evaluate the resulting representations in terms of their convergence and recurrence properties. (3) We validate the efficacy of service-level fingerprints for device identification in closed-set and open-set scenarios. Our findings are based on a large dataset comprising about 10 million IPFIX flow records collected over a 1.5-year period.
[IR-2] On Recommending Category: A Cascading Approach
链接: https://arxiv.org/abs/2512.16033
作者: Qihao Wang,Pritom Saha Akash,Varvara Kollia,Kevin Chen-Chuan Chang,Biwei Jiang,Vadim Von Brzeski
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recommendation plays a key role in e-commerce, enhancing user experience and boosting commercial success. Existing works mainly focus on recommending a set of items, but online e-commerce platforms have recently begun to pay attention to exploring users’ potential interests at the category level. Category-level recommendation allows e-commerce platforms to promote users’ engagements by expanding their interests to different types of items. In addition, it complements item-level recommendations when the latter becomes extremely challenging for users with little-known information and past interactions. Furthermore, it facilitates item-level recommendations in existing works. The predicted category, which is called intention in those works, aids the exploration of item-level preference. However, such category-level preference prediction has mostly been accomplished through applying item-level models. Some key differences between item-level recommendations and category-level recommendations are ignored in such a simplistic adaptation. In this paper, we propose a cascading category recommender (CCRec) model with a variational autoencoder (VAE) to encode item-level information to perform category-level recommendations. Experiments show the advantages of this model over methods designed for item-level recommendations.

