本篇博文主要内容为 2026-03-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-04)
今日共更新597篇论文,其中:
- 自然语言处理共63篇(Computation and Language (cs.CL))
- 人工智能共188篇(Artificial Intelligence (cs.AI))
- 计算机视觉共136篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共193篇(Machine Learning (cs.LG))
- 多智能体系统共8篇(Multiagent Systems (cs.MA))
- 信息检索共14篇(Information Retrieval (cs.IR))
- 人机交互共30篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Generative adversarial imitation learning for robot swarms: Learning from human demonstrations and trained policies ICRA2026
【速读】:该论文旨在解决群体机器人(swarm robotics)中如何从人类示范中学习集体行为的问题,尤其关注如何在缺乏先验策略的情况下,通过模仿学习实现高质量的群体协作任务。其解决方案的关键在于提出了一种基于生成对抗模仿学习(Generative Adversarial Imitation Learning, GAIL)的框架,该框架能够直接从人类演示或由PPO训练得到的策略演示中提取有意义的集体行为模式,并在仿真与真实机器人(TurtleBot 4)平台上验证其有效性,结果显示所学策略不仅保持了视觉可识别的群体行为特征,且性能与原始演示相当。
链接: https://arxiv.org/abs/2603.02783
作者: Mattes Kraus,Jonas Kuckling
机构: 未知
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
点击查看摘要
Abstract:In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.
[MA-1] EvoSkill: Automated Skill Discovery for Multi-Agent Systems
【速读】:该论文旨在解决当前编码代理(coding agents)在面对专业化任务时缺乏领域知识的问题,尽管其具备通用性但难以胜任特定领域的复杂推理与执行。现有方法依赖人工设计的技能(agent skills),且进化策略多优化低层模型相关组件(如提示词或代码片段),导致可迁移性和泛化能力受限。本文提出EvoSkill框架,其核心创新在于通过迭代失败分析自动发现并优化可复用的技能模块:系统基于执行失败进行诊断,生成新技能或改进已有技能,并以结构化形式存储;同时利用帕累托前沿筛选机制,在不更新底层模型的前提下保留能提升验证性能的技能,从而实现技能层面的自演化。实验表明,EvoSkill在OfficeQA和SealQA两个基准上分别取得7.3%和12.1%的准确率提升,并展现出跨任务零样本迁移能力,证明了技能级优化能够产生超越训练任务的通用能力。
链接: https://arxiv.org/abs/2603.02766
作者: Salaheddin Alzubi,Noah Provenzano,Jaydon Bingham,Weiyuan Chen,Tu Vu
机构: Sentient; Virginia Tech
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textitagent skills: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \ code) that are tightly coupled to specific models and tasks. We introduce \textbfEvoSkill, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf7.3% (60.6% \to 67.9%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf12.1% gain (26.6% \to 38.7%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf5.3% without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.
[MA-2] Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization AAMAS2026
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中样本效率低和协作能力弱的问题。其核心解决方案是提出广义的每个智能体优势估计器(Generalized Per-Agent Advantage Estimator, GPAE),通过引入基于每个智能体的价值迭代算子来精确计算各智能体的优势值,从而实现稳定的离策略学习;该方法通过间接利用动作概率估计价值,避免了对Q函数的直接估计,提升了稳定性。此外,论文进一步设计了一种双截断重要性采样比机制,有效平衡了对自身策略变化的敏感性和对其他智能体非平稳性干扰的鲁棒性,从而优化了离策略轨迹上的信用分配。实验表明,该方法在复杂场景下显著优于现有方法,在协调能力和样本效率方面表现突出。
链接: https://arxiv.org/abs/2603.02654
作者: Seongmin Kim,Giseung Park,Woojun Kim,Jiwon Jeon,Seungyeol Han,Youngchul Sung
机构: KAIST(韩国科学技术院); University of Toronto (多伦多大学); Carnegie Mellon University (卡内基梅隆大学); UNIST(韩国蔚山科学技术院)
类目: Multiagent Systems (cs.MA)
备注: Accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
点击查看摘要
Abstract:In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent’s own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.
[MA-3] Credibility Governance: A Social Mechanism for Collective Self-Correction under Weak Truth Signals
【速读】:该论文旨在解决在线平台在意见聚合过程中因依赖易被操纵的信号(如点赞数或资本加权承诺)而导致集体判断脆弱的问题,尤其是在弱真理信号、噪声或延迟反馈、早期流行度激增及策略性操控等场景下。其解决方案的核心是提出可信度治理(Credibility Governance, CG)机制,通过动态学习哪些个体和观点持续追踪演化的公共证据来重新分配影响力:CG同时维护个体与观点的动态可信度评分,以可信度加权的方式更新观点影响力,并根据个体所支持观点的长期表现调整其可信度,从而奖励早期且持续与新兴证据一致的判断,过滤短期噪声。
链接: https://arxiv.org/abs/2603.02640
作者: Wanying He,Yanxi Lin,Ziheng Zhou,Xue Feng,Min Peng,Qianqian Xie,Zilong Zheng,Yipeng Kang
机构: Wuhan University (武汉大学); BIGAI (北京通用人工智能研究院); Tsinghua University (清华大学); University of California, Los Angeles (加州大学洛杉矶分校); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,北京通用人工智能研究院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Online platforms increasingly rely on opinion aggregation to allocate real-world attention and resources, yet common signals such as engagement votes or capital-weighted commitments are easy to amplify and often track visibility rather than reliability. This makes collective judgments brittle under weak truth signals, noisy or delayed feedback, early popularity surges, and strategic manipulation. We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence. CG maintains dynamic credibility scores for both agents and opinions, updates opinion influence via credibility-weighted endorsements, and updates agent credibility based on the long-run performance of the opinions they support, rewarding early and persistent alignment with emerging evidence while filtering short-lived noise. We evaluate CG in POLIS, a socio-physical simulation environment that models coupled belief dynamics and downstream feedback under uncertainty. Across settings with initial majority misalignment, observation noise and contamination, and misinformation shocks, CG outperforms vote-based, stake-weighted, and no-governance baselines, yielding faster recovery to the true state, reduced lock-in and path dependence, and improved robustness under adversarial pressure. Our implementation and experimental scripts are publicly available at this https URL.
[MA-4] StitchCUDA: An Automated Multi-Agent s End-to-End GPU Programing Framework with Rubric-based Agent ic Reinforcement Learning
【速读】:该论文旨在解决现代机器学习(ML)工作负载在GPU上实现高端到端性能的难题,尤其是现有基于大语言模型(LLM)的方法主要聚焦于单核优化,缺乏对完整GPU程序的端到端生成能力,限制了实际部署。其解决方案的关键在于提出StitchCUDA,一个由三个专用智能体构成的多智能体框架:Planner负责系统级设计调度,Coder逐步实现代码编写,Verifier通过Nsys/NCU进行正确性验证与性能分析。其中,Coder模块引入基于评分标准(rubric)的智能体强化学习机制,融合任务到代码生成与反馈驱动优化两种原子技能,并结合来自真实执行的组合评分奖励和规则奖励,从而有效提升其掌握高级CUDA编程技术(如自定义内核融合、cuBLAS后处理)的能力,同时防止奖励作弊行为(如复制PyTorch代码或硬编码输出),最终在KernelBench测试中实现了接近100%的成功率及显著优于基线方法的加速比。
链接: https://arxiv.org/abs/2603.02637
作者: Shiyang Li,Zijian Zhang,Winson Chen,Yuebo Luo,Mingyi Hong,Caiwen Ding
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
点击查看摘要
Abstract:Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder’s ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder’s reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines. Subjects: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Programming Languages (cs.PL) Cite as: arXiv:2603.02637 [cs.MA] (or arXiv:2603.02637v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.02637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-5] CUCo: An Agent ic Framework for Compute and Communication Co-design
【速读】:该论文旨在解决大规模分布式大语言模型(Large Language Model, LLM)训练与推理中,GPU利用率不足的问题,特别是针对手动编写同时优化计算(computation)与通信(communication)的CUDA内核所面临的高复杂性和易错性挑战。现有工作几乎仅聚焦于计算优化,忽视了通信内核的性能潜力,而通信在总执行时间中占比较大。论文提出的解决方案是CUCo——一种无需训练的、由智能体驱动的工作流,能够自动生成协同调度计算与通信的高性能CUDA内核。其关键在于通过联合优化传统上分离的计算与通信组件,发掘出现有方法无法触及的新优化空间,从而显著降低端到端延迟,最高达1.57倍加速。
链接: https://arxiv.org/abs/2603.02376
作者: Bodun Hu,Yoga Sri Varshan V,Saurabh Agarwal,Aditya Akella
机构: UT Austin (德克萨斯大学奥斯汀分校)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to 1.57\times .
[MA-6] RIVA: Leverag ing LLM Agents for Reliable Configuration Drift Detection
【速读】:该论文旨在解决基础设施即代码(Infrastructure as Code, IaC)在云环境中因配置漂移(configuration drift)导致的合规性与一致性难以保障的问题,尤其针对现有基于大语言模型(Large Language Model, LLM)的智能体系统在工具调用出现错误或误导性输出时可靠性不足的缺陷。其解决方案的关键在于提出RIVA(Robust Infrastructure by Verification Agents)——一个由验证代理(verifier agent)和工具生成代理(tool generation agent)组成的多智能体协同系统,通过迭代交叉验证、多视角验证及工具调用历史追踪机制,在存在异常工具响应的情况下仍能实现鲁棒的IaC验证,从而显著提升任务准确率并减少误报与漏报。
链接: https://arxiv.org/abs/2603.02345
作者: Sami Abuzakuk,Lucas Crijns,Anne-Marie Kermarrec,Rafael Pires,Martijn de Vos
机构: EPFL Lausanne(瑞士洛桑联邦理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such configuration drift occurs because of bugs in the IaC specification, manual changes, or system updates. Large language model (LLM)-based agentic AI systems can automate the analysis of large volumes of telemetry data, making them suitable for the detection of configuration drift. However, existing agentic systems implicitly assume that the tools they invoke always return correct outputs, making them vulnerable to erroneous tool responses. Since agents cannot distinguish whether an anomalous tool output reflects a real infrastructure problem or a broken tool, such errors may cause missed drift or false alarms, reducing reliability precisely when it is most needed. We introduce RIVA (Robust Infrastructure by Verification Agents), a novel multi-agent system that performs robust IaC verification even when tools produce incorrect or misleading outputs. RIVA employs two specialized agents, a verifier agent and a tool generation agent, that collaborate through iterative cross-validation, multi-perspective verification, and tool call history tracking. Evaluation on the AIOpsLab benchmark demonstrates that RIVA, in the presence of erroneous tool responses, recovers task accuracy from 27.3% when using a baseline ReAct agent to 50.0% on average. RIVA also improves task accuracy 28% to 43.8% without erroneous tool responses. Our results show that cross-validation of diverse tool calls enables more reliable autonomous infrastructure verification in production cloud environments.
[MA-7] he Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety
【速读】:该论文旨在解决当前高度自主的决策组件(如生成式 AI)在部署后安全性难以保障的问题,其核心挑战在于安全行为与训练过程深度耦合,导致安全性不透明、难审计且更新成本高昂。解决方案的关键在于提出“对齐飞轮”(Alignment Flywheel)这一以治理为中心的混合多智能体系统(Multi-agent System, MAS)架构,通过解耦决策生成与安全治理:由 Proposer 生成候选决策轨迹,Safety Oracle 提供稳定的安全信号接口,执行层实施显式风险策略,而治理 MAS 则负责监督 Oracle 的审计、不确定性驱动的验证及版本化迭代。该架构的核心工程原则是“补丁局部性”(patch locality),即多数新出现的安全问题可通过更新受控的 Oracle 及其发布流水线来修复,无需回退或重新训练底层决策组件,从而实现高效、可审计、可版本控制的安全治理。
链接: https://arxiv.org/abs/2603.02259
作者: Elias Malomgré,Pieter Simoens
机构: IDLab, Ghent University - imec, Belgium(比利时根特大学-imec研究所)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight.
自然语言处理
[NLP-0] Using Learning Progressions to Guide AI Feedback for Science Learning
【速读】: 该论文旨在解决生成式 AI 在提供形成性反馈时面临的可扩展性问题,即当前大多数AI生成反馈依赖于由领域专家手动设计的任务特定评分量规(rubric),而这一过程耗时且难以在不同教学情境中推广。其解决方案的关键在于利用学习进展(Learning Progression, LP)自动构建任务特定的评分量规,从而替代人工编写,实现高效、可扩展的反馈生成。研究通过对比两种反馈管道——一种基于专家设计的量规,另一种基于LP自动生成的量规——发现两者在反馈质量上无显著差异,表明LP驱动的量规生成方法可以作为专家量规的有效替代方案。
链接: https://arxiv.org/abs/2603.03249
作者: Xin Xia(1),Nejla Yuruk(2),Yun Wang(1),Xiaoming Zhai(1) ((1) University of Georgia, (2) Gazi University)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15pages, 4 figures
点击查看摘要
Abstract:Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students’ developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen’s kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.
[NLP-1] Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals
【速读】: 该论文旨在解决语言模型在在线社区中对不同社会、文化及领域特定规范的适应问题,尤其是在缺乏显式偏好标注(explicit preference supervision)或组织支持的资源匮乏场景下,传统对齐方法难以适用。其核心挑战在于如何在不依赖人工标注或预设原则的前提下,使模型能够学习并遵循社区隐含的接受行为(acceptance behavior)。解决方案的关键在于发现并利用社区内容接受行为所诱导的表示空间几何结构:被接受的内容在嵌入空间中形成高密度、连贯的区域,而被拒绝的内容则分布于稀疏或错位区域;基于此,作者提出密度引导响应优化(Density-Guided Response Optimization, DGRO),通过局部密度作为隐式偏好信号来优化模型输出,从而实现无需显式标签的社区对齐。实验证明,DGRO在多种语言、平台和主题的社区中均优于监督和提示基线方法。
链接: https://arxiv.org/abs/2603.03242
作者: Patrick Gerard,Svitlana Volkova
机构: Information Sciences Institute, University of Southern California (南加州大学信息科学研究所); Aptima Inc. (艾普蒂玛公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 Pages
点击查看摘要
Abstract:Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings but exclude most online communities – particularly those without institutional backing, annotation infrastructure, or organized around sensitive topics – where preference elicitation is costly, ethically fraught, or culturally misaligned. We observe that communities already express preferences implicitly through what content they accept, engage with, and allow to persist. We show that this acceptance behavior induces measurable geometric structure in representation space: accepted responses occupy coherent, high-density regions that reflect community-specific norms, while rejected content falls in sparser or misaligned areas. We operationalize this structure as an implicit preference signal for alignment and introduce density-guided response optimization (DGRO), a method that aligns language models to community norms without requiring explicit preference labels. Using labeled preference data, we demonstrate that local density recovers pairwise community judgments, indicating that geometric structure encodes meaningful preference signal. We then apply DGRO in annotation-scarce settings across diverse communities spanning platform, topic, and language. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines. We position DGRO as a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned with situated practices, and discuss the implications and risks of learning from emergent acceptance behavior. Comments: 27 Pages Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.03242 [cs.AI] (or arXiv:2603.03242v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.03242 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-2] Understanding and Mitigating Dataset Corruption in LLM Steering
【速读】: 该论文旨在解决对比性调制(contrastive steering)在训练数据受到噪声或恶意篡改时的鲁棒性问题,尤其是在生成式 AI(Generative AI)安全应用中,其性能是否稳定尚不明确。解决方案的关键在于识别出学习调制方向过程中高维均值计算步骤对异常数据敏感,并提出用一种近期提出的鲁棒均值估计器替代传统均值计算,从而有效缓解恶意数据污染带来的不良副作用。
链接: https://arxiv.org/abs/2603.03206
作者: Cullen Anderson,Narmeen Oozeer,Foad Namjoo,Remy Ogasawara,Amirali Abdullah,Jeff M. Phillips
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Contrastive steering has been shown as a simple and effective method to adjust the generative behavior of LLMs at inference time. It uses examples of prompt responses with and without a trait to identify a direction in an intermediate activation layer, and then shifts activations in this 1-dimensional subspace. However, despite its growing use in AI safety applications, the robustness of contrastive steering to noisy or adversarial data corruption is poorly understood. We initiate a study of the robustness of this process with respect to corruption of the dataset of examples used to train the steering direction. Our first observation is that contrastive steering is quite robust to a moderate amount of corruption, but unwanted side effects can be clearly and maliciously manifested when a non-trivial fraction of the training data is altered. Second, we analyze the geometry of various types of corruption, and identify some safeguards. Notably, a key step in learning the steering direction involves high-dimensional mean computation, and we show that replacing this step with a recently developed robust mean estimator often mitigates most of the unwanted effects of malicious corruption.
[NLP-3] Learning When to Act or Refuse: Guarding Agent ic Reasoning Models for Safe Multi-Step Tool Use
【速读】: 该论文旨在解决**代理型语言模型(Agentic language models)**在多步工具调用场景下的安全问题,这类模型因具备规划、调用工具及执行长程任务的能力,一旦出现错误决策(如访问敏感文件或输入凭据),可能造成不可逆的损害。传统对齐方法主要针对静态生成和任务完成优化,在面对序列决策、对抗性工具反馈及中间推理过自信等问题时失效。其解决方案的关键在于提出MOSAIC框架,通过将安全决策显式化并作为可学习动作纳入推理流程——即采用“计划-检查-执行或拒绝”的循环结构,使模型在每一步都能进行显式的安全推理与主动拒绝,从而提升安全性。训练阶段使用基于偏好强化学习的成对轨迹比较策略,无需轨迹级标签即可捕捉细微的安全差异,实现跨模型、跨领域和跨任务的鲁棒泛化能力。
链接: https://arxiv.org/abs/2603.03205
作者: Aradhye Agarwal,Gurdit Siyan,Yash Pandya,Joykirat Singh,Akshay Nambi,Ahmed Awadallah
机构: Microsoft Research
类目: Computation and Language (cs.CL)
备注: 24 pages, 5 figures
点击查看摘要
Abstract:Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.
[NLP-4] No Memorization No Detection: Output Distribution-Based Contamination Detection in Small Language Models
【速读】: 该论文旨在解决数据污染检测(data contamination detection)问题,特别是针对小规模语言模型中因训练数据被污染而导致的潜在风险。传统方法如基于输出分布的污染检测(Contamination Detection via output Distribution, CDD)依赖于模型输出分布的峰度(peakedness)来识别污染样本,但其有效性尚未在不同微调策略下系统评估。论文的核心发现是:CDD的成功与否关键取决于微调是否引发对污染数据的逐字记忆(verbatim memorization)。当采用低秩适应(Low-Rank Adaptation, LoRA)等参数高效微调技术时,模型可在不产生记忆的情况下学习污染数据,导致CDD的检测性能退化至随机水平,即便污染已被验证存在。因此,论文揭示了一个决定可检测性的“记忆阈值”(memorization threshold),并指出参数高效微调可能隐藏污染行为,使现有基于输出分布的方法失效。
链接: https://arxiv.org/abs/2603.03203
作者: Omer Sela
机构: Tel Aviv University (特拉维夫大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages main text, 5 pages appendix, 9 figures, 7 tables. Code available at this https URL
点击查看摘要
Abstract:CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model’s sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD’s effectiveness depends critically on whether fine-tuning produces verbatim memorization. With low-rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine-tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter-efficient fine-tuning can produce contamination that output-distribution methods do not detect. Our code is available at this https URL
[NLP-5] Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration? ICML2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在向国际数学奥林匹克(IMO)级别数学能力演进过程中,因高质量、高难度训练与评估题目稀缺而导致的瓶颈问题。其解决方案的关键在于利用代码代理(code agents)的自主演化能力,通过多代理框架在可扩展的计算环境中对现有数学问题进行结构化变异,生成新的、可解且更具挑战性的变体问题。该方法借助代码执行作为验证机制,确保生成问题的合理性与难度提升,从而为LLMs提供更丰富的训练资源。
链接: https://arxiv.org/abs/2603.03202
作者: Dadi Guo,Yuejin Xie,Qingyu Liu,Jiayu Liu,Zhiyuan Fan,Qihan Ren,Shuai Shao,Tianyi Zhou,Dongrui Liu,Yi R. Fung
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review in ICML 2026
点击查看摘要
Abstract:As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at this https URL.
[NLP-6] ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments
【速读】: 该论文旨在解决通用具身智能(Universal Embodied Intelligence)在异构具身形态(如自动驾驶、机器人和无人机)之间实现鲁棒泛化的问题,尤其针对长尾数据分布、梯度干扰以及灾难性遗忘等挑战,这些因素使得统一模型在保持跨领域泛化能力的同时难以维持特定领域的专业性能。解决方案的关键在于提出“Scaffold-Specialize-Reconcile”(SSR)范式:首先构建一个共享的空间认知基础(spatial foundation),利用空间智能作为跨具身形态的通用架构;随后训练领域专用专家模型;最后通过无数据模型融合策略实现专家间的协调与整合。该方法有效平衡了通用性与领域专精性,并借助Group Relative Policy Optimization(GRPO)进一步增强模型的整体能力。
链接: https://arxiv.org/abs/2603.03198
作者: Ziyang Gong,Zehang Luo,Anke Tang,Zhe Liu,Shi Fu,Zhi Hou,Ganlin Yang,Weiyun Wang,Xiaofeng Wang,Jianbo Liu,Gen Luo,Haolan Kang,Shuang Luo,Yue Zhou,Yong Luo,Li Shen,Xiaosong Jia,Yao Mu,Xue Yang,Chunxiao Liu,Junchi Yan,Hengshuang Zhao,Dacheng Tao,Xiaogang Wang
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Hugging Face: this https URL
点击查看摘要
Abstract:Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long-tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain-specific proficiency. In this report, we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model~(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross-embodiment transfer. Building on this insight, we propose the Scaffold-Specialize-Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model’s comprehensive capability. Extensive experiments demonstrate that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.
[NLP-7] BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
【速读】: 该论文旨在解决当前代码智能体(Code Agent)评估基准过于狭窄的问题,即现有评测主要聚焦于单一代码库内的局部修复任务,忽略了现实场景中更为复杂的挑战,如跨代码库推理、领域专业化问题求解、依赖驱动的迁移以及整个代码库的生成等。为应对这一局限,作者提出了BeyondSWE基准,通过在两个维度——任务解决范围(resolution scope)与知识覆盖范围(knowledge scope)——上扩展评估维度,构建了包含500个真实世界实例的多场景测试集。其关键创新在于引入SearchSWE框架,该框架将深度搜索能力与代码生成能力深度融合,以系统性探究外部知识在代码任务中的作用。实验表明,即使是最先进的模型在BeyondSWE上也难以突破45%的成功率,且无模型能在所有任务类型中保持稳定表现;更重要的是,搜索增强带来的性能提升具有高度不确定性,甚至可能降低效果,揭示了模拟开发者在编码过程中交替进行检索与推理的复杂性,从而为未来更强大的代码智能体研究提供了更具挑战性的评估标准和可扩展的开发框架。
链接: https://arxiv.org/abs/2603.03194
作者: Guoxin Chen,Fanzhe Meng,Jiale Zhao,Minghao Li,Daixuan Cheng,Huatong Song,Jie Chen,Yuzhi Lin,Hui Chen,Xin Zhao,Ruihua Song,Chang Liu,Cheng Chen,Kai Jia,Ji-Rong Wen
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Benchmark: this https URL . Repo: this https URL . Scaffold: this https URL
点击查看摘要
Abstract:Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
[NLP-8] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLM s using Modality Decoupled Preference Optimization CVPR2026
【速读】: 该论文旨在解决多模态大语言模型(omni LLMs)在跨模态理解任务中易受虚假相关性和语言先验主导影响而导致的跨模态幻觉问题。其解决方案的关键在于提出一种名为“模态解耦直接偏好优化”(Modality-Decoupled Direct Preference Optimization, MoD-DPO)的框架,通过引入模态感知的正则化项,显式强制模型对无关模态的扰动保持不变、对相关模态的扰动保持敏感,从而减少非预期的跨模态交互;同时结合语言先验去偏惩罚机制,抑制仅依赖文本信息导致的幻觉响应,实现更忠实于模态输入的对齐与鲁棒性提升。
链接: https://arxiv.org/abs/2603.03192
作者: Ashutosh Chaubey,Jiacheng Pang,Mohammad Soleymani
机构: Institute for Creative Technologies, University of Southern California(南加州大学创意技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CVPR 2026. Project Page: this https URL
点击查看摘要
Abstract:Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
[NLP-9] ype-Aware Retrieval-Augmented Generation with Dependency Closure for Solver-Executable Industrial Optimization Modeling
【速读】: 该论文旨在解决将自然语言需求可靠地转化为求解器可执行代码的问题,尤其针对大语言模型在工业优化建模中常因缺少声明、类型不一致和依赖关系不完整而导致生成非编译通过模型的挑战。其解决方案的关键在于提出一种类型感知的检索增强生成(type-aware retrieval-augmented generation, RAG)方法,通过构建领域特定的类型化知识库(将学术论文与求解器代码等异构源解析为带类型的单元并编码数学依赖关系至知识图谱),并在给定自然语言指令时执行混合检索并基于图上的依赖传播计算最小依赖闭包上下文——即生成求解器可执行代码所需的最小类型符号集合,从而确保模型的可执行性与结构正确性。
链接: https://arxiv.org/abs/2603.03180
作者: Y. Zhong,R. Huang,M. Wang,Z. Guo,YC. Li,M. Yu,Z. Jin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Automated industrial optimization modeling requires reliable translation of natural-language requirements into solver-executable code. However, large language models often generate non-compilable models due to missing declarations, type inconsistencies, and incomplete dependency contexts. We propose a type-aware retrieval-augmented generation (RAG) method that enforces modeling entity types and minimal dependency closure to ensure executability. Unlike existing RAG approaches that index unstructured text, our method constructs a domain-specific typed knowledge base by parsing heterogeneous sources, such as academic papers and solver code, into typed units and encoding their mathematical dependencies in a knowledge graph. Given a natural-language instruction, it performs hybrid retrieval and computes a minimal dependency-closed context, the smallest set of typed symbols required for solver-executable code, via dependency propagation over the graph. We validate the method on two constraint-intensive industrial cases: demand response optimization in battery production and flexible job shop scheduling. In the first case, our method generates an executable model incorporating demand-response incentives and load-reduction constraints, achieving peak shaving while preserving profitability; conventional RAG baselines fail. In the second case, it consistently produces compilable models that reach known optimal solutions, demonstrating robust cross-domain generalization; baselines fail entirely. Ablation studies confirm that enforcing type-aware dependency closure is essential for avoiding structural hallucinations and ensuring executability, addressing a critical barrier to deploying large language models in complex engineering optimization tasks.
[NLP-10] APRES: An Agent ic Paper Revision and Evaluation System
【速读】: 该论文旨在解决当前科学论文在同行评审过程中因评审意见不一致而导致稿件质量提升受限、潜在影响力不足的问题。其核心挑战在于如何在不改变论文核心科学内容的前提下,系统性地提升论文的可读性与传播潜力。解决方案的关键在于提出一种名为APRES(Automated Paper Revision and Evaluation System)的新方法,该方法基于大语言模型(Large Language Models, LLMs)构建,并引入一个能高度预测未来引用次数的评价指标体系(rubric),实现对论文文本的自动化修订。实验证明,APRES在降低未来引用预测误差方面优于现有最佳基线19.6%,且人类专家评审者79%的情况下更偏好经APRES修订后的版本,从而为作者提供了一种有效的“压力测试”工具,辅助其优化投稿前稿件质量。
链接: https://arxiv.org/abs/2603.03142
作者: Bingchen Zhao,Jenny Zhang,Chenxi Whitehouse,Minqi Jiang,Michael Shvartsman,Abhishek Charnalia,Despoina Magka,Tatiana Shavrina,Derek Dunfield,Oisin Mac Aodha,Yoram Bachrach
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.
[NLP-11] UniSkill: A Dataset for Matching University Curricula to Professional Competencies LREC2026
【速读】: 该论文旨在解决技能提取与推荐系统中因缺乏公开标注数据而导致的“指令技能侧”(instructed skills side)研究不足问题,尤其聚焦于大学课程与职业技能之间的匹配任务。其关键解决方案是构建并发布两个数据集——人工标注和合成的来自欧洲技能、能力、资格与职业分类(ESCO)体系的大学课程与技能配对数据,并提供详细的标注指南;在此基础上训练基于BERT的语言模型,实现课程到技能及技能到课程的匹配,实验表明该模型在F1-score上达到87%,验证了课程与技能匹配任务的可行性。
链接: https://arxiv.org/abs/2603.03134
作者: Nurlan Musazade,Joszef Mezei,Mike Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: LREC 2026
点击查看摘要
Abstract:Skill extraction and recommendation systems have been studied from recruiter, applicant, and education perspectives. While AI applications in job advertisements have received broad attention, deficiencies in the instructed skills side remain a challenge. In this work, we address the scarcity of publicly available datasets by releasing both manually annotated and synthetic datasets of skills from the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and university course pairs and publishing corresponding annotation guidelines. Specifically, we match graduate-level university courses with skills from the Systems Analysts and Management and Organization Analyst ESCO occupation groups at two granularities: course title with a skill, and course sentence with a skill. We train language models on this dataset to serve as a baseline for retrieval and recommendation systems for course-to-skill and skill-to-course matching. We evaluate the models on a portion of the annotated data. Our BERT model achieves 87% F1-score, showing that course and skill matching is a feasible task.
[NLP-12] Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems
【速读】: 该论文旨在解决多轮大语言模型(Large Language Model, LLM)系统在运行过程中因模型切换(如升级、跨提供商路由或回退机制)导致的上下文不匹配问题,这种不匹配可能引发无声性能漂移(silent performance drift),从而影响下游任务的稳定性和可靠性。解决方案的关键在于提出一种名为“switch-matrix benchmark”的评估框架,通过在早期轮次使用前缀模型(prefix model)、最终轮次使用后缀模型(suffix model),并以无切换基线为参照,利用成对的episode-level bootstrap置信区间进行统计比较,量化手柄引起的性能变化;同时进一步分解手柄漂移为每模型的前缀影响与后缀敏感性两个维度,揭示出系统性的兼容模式,并证明其能解释约70%的方差,从而将手柄鲁棒性确立为单模型基准所忽略的重要操作可靠性维度。
链接: https://arxiv.org/abs/2603.03111
作者: Raad Khraishi,Iman Zafar,Katie Myles,Greig A Cowan
机构: NatWest AI Research (NatWest人工智能研究); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks. Such handoffs create a context mismatch: the model generating later turns must condition on a dialogue prefix authored by a different model, potentially inducing silent performance drift. We introduce a switch-matrix benchmark that measures this effect by running a prefix model for early turns and a suffix model for the final turn, and comparing against the no-switch baseline using paired episode-level bootstrap confidence intervals. Across CoQA conversational QA and Multi-IF benchmarks, even a single-turn handoff yields prevalent and statistically significant, directional effects and may swing outcomes by -8 to +13 percentage points in Multi-IF strict success rate and +/- 4 absolute F1 on CoQA, comparable to the no-switch gap between common model tiers (e.g., GPT-5-nano vs GPT-5-mini). We further find systematic compatibility patterns: some suffix models degrade under nearly any non-self dialogue history, while others improve under nearly any foreign prefix. To enable compressed handoff risk monitoring, we decompose switch-induced drift into per-model prefix influence and suffix susceptibility terms, accounting for ~70% of variance across benchmarks. These results position handoff robustness as an operational reliability dimension that single-model benchmarks miss, motivating explicit monitoring and handoff-aware mitigation in multi-turn systems.
[NLP-13] Compact Prompting in Instruction-tuned LLM s for Joint Argumentative Component Detection
【速读】: 该论文旨在解决论点挖掘(Argument Mining, AM)中的论点组件检测(Argumentative Component Detection, ACD)问题,该任务要求同时识别论点片段并将其分类为声明(claim)或前提(premise)等组件,是AM中最具挑战性的子任务之一。传统方法通常将其简化为序列标注或分阶段的分割与分类流程,难以有效捕捉组件间的语义关联。本文的关键解决方案是将ACD重构为语言生成任务,利用指令微调的大语言模型(Instruction-tuned Large Language Models, LLMs)和紧凑的指令提示(instruction-based prompts),直接从原始文本中生成结构化的论点组件,无需依赖预分割的片段,从而更高效地建模复杂论点结构。实验表明,该方法在标准基准上优于现有最先进系统,首次实现了对ACD任务的端到端生成式建模。
链接: https://arxiv.org/abs/2603.03095
作者: Sofiane Elguendouze,Erwan Hain,Elena Cabrio,Serena Villata
机构: Université Côte d’Azur, I3S, CNRS, Inria (Marianne), France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review (COLM 2026)
点击查看摘要
Abstract:Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.
[NLP-14] AO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐机制下易受越狱攻击(jailbreak attacks)的问题,即攻击者通过精心设计的提示词绕过模型的安全约束,诱导其生成有害内容。现有基于优化的攻击方法虽具有效性,但普遍存在频繁拒绝响应、伪有害输出及token级更新效率低下的缺陷。论文提出的解决方案核心是TAO-Attack,其关键创新在于:首先设计两阶段损失函数,第一阶段抑制模型拒绝以维持有害前缀,第二阶段惩罚伪有害输出并引导模型生成更严重的危害性内容;其次引入方向优先token优化(Direction-Priority Token Optimization, DPTO)策略,在更新候选token时优先考虑梯度方向而非仅依赖更新幅度,从而显著提升优化效率。实验表明,TAO-Attack在多种LLM上均优于当前最先进方法,攻击成功率更高,甚至在特定场景下达到100%。
链接: https://arxiv.org/abs/2603.03081
作者: Zhi Xu,Jiaqi Li,Xiaotong Zhang,Hong Yu,Han Liu
机构: Dalian University of Technology (大连理工大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100% in certain scenarios.
[NLP-15] kZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在科学绘图任务中从文本描述生成高质量 TikZ 程序的难题,特别是现有数据集规模小、噪声大,导致文本与渲染图像之间语义不一致,且仅依赖监督微调(Supervised Fine-Tuning, SFT)无法有效捕捉图形的语义结构,常出现循环错误、无关内容和空间关系错误等问题。解决方案的关键在于构建了一个规模更大(超过 DaTikZ-V3 四倍)、质量更高的数据集 DaTikZ-V4,并采用两阶段训练策略:首先进行 SFT,随后引入强化学习(Reinforcement Learning, RL),利用通过逆向图形(inverse graphics)训练的图像编码器提供语义忠实的奖励信号,从而显著提升生成 TikZ 代码的准确性与一致性。
链接: https://arxiv.org/abs/2603.03072
作者: Christian Greisinger,Steffen Eger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.
[NLP-16] Incremental Graph Construction Enables Robust Spectral Clustering of Texts
【速读】: 该论文旨在解决标准k近邻(k-NN)图在文本嵌入谱聚类中的脆弱性问题,特别是在低稀疏度(小k值)下容易出现多个不连通组件,导致谱聚类失效且对超参数敏感。其解决方案的关键在于提出一种增量式k-NN图构建方法:新节点仅连接到已插入的k个最近邻节点,从而保证图的连通性,无论k取何值。作者通过归纳法证明了该方法的连通性,并在SentenceTransformer嵌入上验证其在低k条件下优于传统k-NN图,在高k时性能相当,显著提升了谱聚类在实际文本数据上的鲁棒性。
链接: https://arxiv.org/abs/2603.03056
作者: Marko Pranjić,Boshko Koloski,Nada Lavrač,Senja Pollak,Marko Robnik-Šikonja
机构: Institute Jožef Stefan (乔泽夫·斯蒂芬研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: MP and BK contributed equally
点击查看摘要
Abstract:Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard k -NN graphs can contain many disconnected components at practical sparsity levels (small k ), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental k -NN graph construction that preserves connectivity by design: each new node is linked to its k nearest previously inserted nodes, which guarantees a connected graph for any k . We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding this http URL to standard k -NN graphs, our method outperforms in the low- k regime where disconnected components are prevalent, and matches standard k -NN at larger k .
[NLP-17] PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗对话场景中应用时的隐私泄露风险问题。当前基于医生-患者对话数据进行监督微调(Supervised Fine-Tuning, SFT)和人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)的方法容易导致训练数据中的敏感信息被记忆并可通过成员推断攻击(Membership Inference Attack)提取,从而危及患者隐私。其解决方案的关键在于提出PrivMedChat框架,实现端到端的差分隐私强化学习(Differentially Private RLHF, DP-RLHF),并在所有直接访问对话监督信号的训练阶段引入差分隐私机制:一是使用差分隐私随机梯度下降(Differential Private Stochastic Gradient Descent, DP-SGD)对医学SFT和奖励模型训练施加隐私保护;二是针对对齐过程中的PPO优化器,在处理对话提示时也应用DP-SGD以限制额外隐私损耗,同时固定训练后的奖励模型以减少隐私预算消耗;此外,还设计了一种无需人工标注的偏好数据构建策略,通过将医师回答与过滤后的非专家生成内容配对,实现可扩展且隐私友好的偏好数据生成。实验表明,该方法在ε=7时达到最优ROUGE-L得分(0.156),显著降低临床幻觉(1.4%)和有害建议(0.4%),且成员推断攻击性能接近随机猜测(AUC 0.510–0.555)。
链接: https://arxiv.org/abs/2603.03054
作者: Sudip Bhujel
机构: University of Kentucky (肯塔基大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at \varepsilon=7 achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.03054 [cs.CL] (or arXiv:2603.03054v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.03054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-18] rustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在心理健康领域部署时面临的可信性(trustworthiness)不足问题。由于心理健康场景具有高风险和安全性敏感的特点,现有针对通用LLMs的评估范式无法充分覆盖其特定需求,导致模型在实际应用中存在可靠性、安全性、公平性等方面的显著缺陷。解决方案的关键在于提出TrustMH-Bench框架,该框架通过建立领域特有规范到可量化评估指标的深度映射,系统性地从八个核心维度——可靠性、危机识别与升级、安全性、公平性、隐私保护、鲁棒性、反谄媚性及伦理合规性——对心理健康LLMs进行全面评估,从而为提升模型可信性提供结构化依据与实证基础。
链接: https://arxiv.org/abs/2603.03047
作者: Zixin Xiong,Ziteng Wang,Haotian Fan,Xinjie Zhang,Wenxuan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific requirements, highlighting an urgent need to prioritize and enhance their trustworthiness. To address this, we propose TrustMH-Bench, a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs. By establishing a deep mapping from domain-specific norms to quantitative evaluation metrics, TrustMH-Bench evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics. We conduct extensive experiments across six general-purpose LLMs and six specialized mental health models. Experimental results indicate that the evaluated models underperform across various trustworthiness dimensions in mental health scenarios, revealing significant deficiencies. Notably, even generally powerful models (e.g., GPT-5.1) fail to maintain consistently high performance across all dimensions. Consequently, systematically improving the trustworthiness of LLMs has become a critical task. Our data and code are released.
[NLP-19] MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling
【速读】: 该论文旨在解决自注意力编码器(如BERT)在处理长序列时计算复杂度呈二次增长、导致长上下文建模成本高昂的问题,同时克服线性时间状态空间模型(如Mamba)在建模全局依赖关系上的局限性及因填充导致的状态污染问题。解决方案的关键在于提出MaBERT,一种混合编码器架构,通过交替堆叠Transformer层与Mamba层,实现全局依赖建模与线性时间状态更新的协同优化;此外,引入padding-safe masking机制以阻止填充位置的状态传播,并采用mask-aware attention pooling仅聚合有效token的信息,从而提升可变长度批处理的稳定性,显著降低训练时间和推理延迟,在长序列场景下展现出高效性和优越性能。
链接: https://arxiv.org/abs/2603.03001
作者: Jinwoong Kim,Sangjin Park
机构: Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注: 8 pages
点击查看摘要
Abstract:Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.
[NLP-20] Contextualized Privacy Defense for LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLM)代理在多步执行过程中对用户个人隐私信息的不当处理问题,现有隐私防御机制多为静态或被动策略(如提示词引导或限制),难以支持上下文感知的主动决策。其解决方案的关键在于提出一种新的隐私防御范式——情境化防御指导(Contextualized Defense Instructing, CDI),该范式通过一个教学模型在代理执行过程中生成步骤特定、上下文敏感的隐私指导,主动塑造行为而非事后约束或否决;同时,CDI结合基于经验的强化学习优化框架,将包含隐私违规的失败轨迹转化为训练环境,从而实现持续迭代优化。实验表明,CDI在隐私保护率(94.2%)与有用性(80.6%)之间实现了更优平衡,并展现出更强的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2603.02983
作者: Yule Wen,Yanzhe Zhang,Jianxun Lian,Xiaoyuan Yi,Xing Xie,Diyi Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages
点击查看摘要
Abstract:LLM agents increasingly act on users’ personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution. We propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm in which an instructor model generates step-specific, context-aware privacy guidance during execution, proactively shaping actions rather than merely constraining or vetoing them. Crucially, CDI is paired with an experience-driven optimization framework that trains the instructor via reinforcement learning (RL), where we convert failure trajectories with privacy violations into learning environments. We formalize baseline defenses and CDI as distinct intervention points in a canonical agent loop, and compare their privacy-helpfulness trade-offs within a unified simulation framework. Results show that our CDI consistently achieves a better balance between privacy preservation (94.2%) and helpfulness (80.6%) than baselines, with superior robustness to adversarial conditions and generalization.
[NLP-21] ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation CVPR2026
【速读】: 该论文旨在解决多任务专家模型(expert models)在无数据、无需重新训练或修改架构的前提下进行合并时,因任务间干扰导致性能显著下降的问题。解决方案的关键在于提出了一种基于参数差异的输入协方差(input covariance)隐式估计方法,该方法可在完全数据自由的设置下准确捕捉每个任务的最优合并信息;进而构建了自适应协方差估计框架(Adaptive Covariance Estimation, \acem),其核心是提供了一个理论严谨且闭式解(closed-form solution)的优化策略,从而有效缓解任务间的干扰,相较以往依赖迭代或启发式方法的方案具有更高的效率和性能表现。
链接: https://arxiv.org/abs/2603.02945
作者: Bo Xu,Haotian Wu,Hehai Lin,Weiquan Huang,Beier Zhu,Yao Shu,Chengwei Qin
机构: The Hong Kong University of Science and Technology (Guangzhou); University of Science and Technology of China
类目: Computation and Language (cs.CL)
备注: Accepted to CVPR 2026 (Main Track)
点击查看摘要
Abstract:Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.
[NLP-22] Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction AAAI2026
【速读】: 该论文旨在解决零样本文档级事件论元抽取(Zero-shot Document-level Event Argument Extraction, ZS-DEAE)中因标注数据稀缺导致的性能瓶颈问题,尤其针对现有基于大语言模型(Large Language Models, LLMs)生成合成数据的方法难以准确捕捉未见事件的上下文与结构关系,且缺乏对合成数据质量的有效评估机制这一挑战。其解决方案的关键在于提出一种多智能体协作框架,模拟人类“提出-评估-修正”的认知过程:通过生成代理(generation agent)利用已知事件知识合成未见事件的合成数据,同时由评估代理(evaluation agent)提取论元并基于语义一致性及事件结构约束进行质量评估,将评估结果转化为奖励信号,从而通过强化学习实现两者的迭代优化。该方法在RAMS和WikiEvents数据集构建的三个零样本场景中均显著提升了合成数据质量和论元抽取性能,并有效增强其他DEAE模型的零样本表现。
链接: https://arxiv.org/abs/2603.02909
作者: Guangjun Zhang,Hu Zhang,Yazhou Han,Yue Fan,Yuhang Shao,Ru Li,Hongye Tan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from this http URL the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of “Propose-Evaluate-Revise.” Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement this http URL three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.
[NLP-23] Eval4Sim: An Evaluation Framework for Persona Simulation
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的对话模拟评估中缺乏对人类真实对话行为的忠实反映问题。现有方法多依赖LLM作为评判者(LLM-as-a-judge),仅提供难以解释的标量评分,且未充分锚定于可观察的人类对话模式。解决方案的关键在于提出Eval4Sim评估框架,该框架通过三个互补维度——一致性(Consistency)、自然性(Naturalness)和依从性(Adherence)——量化模拟对话与人类对话模式的接近程度:其中依从性利用带说话人感知的密集检索衡量角色背景信息在话语中的隐式编码;一致性通过作者身份验证判断角色身份是否稳定;自然性则基于对话导向的自然语言推理分布捕捉对话流的自然度,避免过度优化或机械结构。该框架以PersonaChat等标注说话人级别的对话语料为基准,双向惩罚偏离,从而区分角色编码不足与过度优化的问题,提升了评估的客观性和可解释性。
链接: https://arxiv.org/abs/2603.02876
作者: Eliseo Bao,Anxo Perez,Xi Wang,Javier Parapar
机构: Universidade da Coruña(拉科鲁尼亚大学); University of Sheffield(谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.
[NLP-24] LaTeX Compilation: Challenges in the Era of LLM s
【速读】: 该论文旨在解决传统排版系统TeX在大语言模型(Large Language Models, LLMs)辅助科学写作场景下的局限性问题,具体表现为编译效率低、语义生成不清晰、错误定位困难以及工具生态薄弱。其解决方案的核心在于引入一种基于所见即所得(WYSIWYG)结构化编辑理念的新型编辑器Mogan STEM,该方案通过高效的数据结构设计、快速渲染机制和按需插件加载策略,在编译/渲染时间、LLM任务性能等方面显著优于TeX。此外,研究发现Mogan使用的文档格式.tmuf具有更低的信息熵,更利于LLM微调训练,因而呼吁开展更大规模实验以验证.tmuf格式在LLM训练中的优势。
链接: https://arxiv.org/abs/2603.02873
作者: Tianyou Liu,Ziqiang Li,Yansong Li,Xurui Liu
机构: Southern University of Science and Technology (南方科技大学); Alibaba (阿里巴巴); Liii Network (利利网络); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 12 figures
点击查看摘要
Abstract:As large language models (LLMs) increasingly assist scientific writing, limitations and the significant token cost of TeX become more and more visible. This paper analyzes TeX’s fundamental defects in compilation and user experience design to illustrate its limitations on compilation efficiency, generated semantics, error localization, and tool ecosystem in the era of LLMs. As an alternative, Mogan STEM, a WYSIWYG structured editor, is introduced. Mogan outperforms TeX in the above aspects by its efficient data structure, fast rendering, and on-demand plugin loading. Extensive experiments are conducted to verify the benefits on compilation/rendering time and performance in LLM tasks. What’s more, we show that due to Mogan’s lower information entropy, it is more efficient to use .tmu (the document format of Mogan) to fine-tune LLMs than TeX. Therefore, we launch an appeal for larger experiments on LLM training using the .tmu format.
[NLP-25] Nodes Are Early Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在理解图表中元素间关系时的局限性,尤其是对由节点和有向边(如箭头和线条)表示的关系理解能力不足的问题。其解决方案的关键在于通过构建基于有向图的合成图表数据集,对LVLM内部表征进行探针实验,发现节点信息和全局结构特征在视觉编码器的隐藏状态中已呈线性可分,而边信息则仅在语言模型的文本标记中才呈现线性编码,表明边信息的表征形成存在延迟。这一发现揭示了不同视觉信息类型在模型处理阶段中线性可分性的差异,从而为解释LVLM在关系推理上的困难提供了机制层面的依据。
链接: https://arxiv.org/abs/2603.02865
作者: Haruto Yoshida,Keito Kudo,Yoichi Aoki,Ryota Tanaka,Itsumi Saito,Keisuke Sakaguchi,Kentaro Inui
机构: Tohoku University(东北大学); Human Informatics Labs., NTT, Inc.(人类信息学实验室,NTT公司)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.
[NLP-26] he Distribution of Phoneme Frequencies across the Worlds Languages: Macroscopic and Microscopic Information-Theoretic Models
【速读】: 该论文旨在解决语言中音位(phoneme)频率分布的跨语言规律性问题,即如何从宏观和微观两个层面解释不同语言中音位出现频率的差异与共性。其解决方案的关键在于:在宏观层面,发现音位频次排名分布符合对称狄利克雷分布(symmetric Dirichlet distribution)的顺序统计特性,且其浓度参数随音位库规模系统性变化,揭示了音位库越大相对熵越低的补偿效应;在微观层面,构建了一个最大熵模型(Maximum Entropy model),引入发音、音系及词汇结构约束,能够准确预测特定语言的音位概率分布。二者结合提供了一个统一的信息论框架来刻画音位频率结构。
链接: https://arxiv.org/abs/2603.02860
作者: Fermín Moscoso del Prado Martín,Suchir Salhan
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.
[NLP-27] A Browser-based Open Source Assistant for Multimodal Content Verification
【速读】: 该论文旨在解决生成式 AI(Generative AI)产生的虚假信息和误导性内容对新闻从业者与事实核查人员构成的挑战,即如何高效、准确地验证数字媒体信息的真实性。现有自然语言处理(Natural Language Processing, NLP)模型虽能识别诸如说服技巧、主观性或机器生成文本等可信度信号,但这些方法通常难以被非专家用户使用,且未嵌入其日常工作中。解决方案的关键在于开发了一个名为 VERIFICATION ASSISTANT 的浏览器端工具,作为广泛采用的 VERIFICATION PLUGIN 的核心组件,该工具提供统一接口以提交 URL 或媒体文件,并自动调用后端多个 NLP 分类器,输出清晰易懂的可信度信号、AI 生成内容概率估计及其他验证建议,从而实现可操作、集成化的信息验证流程。
链接: https://arxiv.org/abs/2603.02842
作者: Rosanna Milner,Michael Foster,Olesya Razuvayevskaya,Ian Roberts,Valentin Porcellini,Denis Teyssou,Kalina Bontcheva
机构: University of Sheffield (谢菲尔德大学); AFP Medialab (法新社媒体实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.
[NLP-28] Faster Cheaper More Accurate: Specialised Knowledge Tracing Models Outperform LLM s
【速读】: 该论文旨在解决如何在教育学习平台中有效预测学生未来答题行为的问题,核心关注点在于比较大型语言模型(Large Language Models, LLMs)与知识追踪(Knowledge Tracing, KT)模型在该任务上的性能差异。解决方案的关键在于系统性地对比两类模型在预测准确性、推理速度和部署成本三个维度的表现:研究发现,KT模型在特定教育领域任务上显著优于LLMs,在准确率和F1分数上更具优势,且推理速度更快、部署成本更低;而LLMs则表现出明显的效率劣势,表明针对教育预测任务应优先采用领域专用的小型时序模型,而非通用型的大模型。
链接: https://arxiv.org/abs/2603.02830
作者: Prarthana Bhattacharyya,Joshua Mitton,Ralph Abboud,Simon Woodhead
机构: OpenAI; DeepMind; Qwen Team; Meta; Stability.AI; Anthropic; Character.ai; Claude
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures. Prarthana Bhattacharyya and Joshua Mitton contributed equally to this work
点击查看摘要
Abstract:Predicting future student responses to questions is particularly valuable for educational learning platforms where it enables effective interventions. One of the key approaches to do this has been through the use of knowledge tracing (KT) models. These are small, domain-specific, temporal models trained on student question-response data. KT models are optimised for high accuracy on specific educational domains and have fast inference and scalable deployments. The rise of Large Language Models (LLMs) motivates us to ask the following questions: (1) How well can LLMs perform at predicting students’ future responses to questions? (2) Are LLMs scalable for this domain? (3) How do LLMs compare to KT models on this domain-specific task? In this paper, we compare multiple LLMs and KT models across predictive performance, deployment cost, and inference speed to answer the above questions. We show that KT models outperform LLMs with respect to accuracy and F1 scores on this domain-specific task. Further, we demonstrate that LLMs are orders of magnitude slower than KT models and cost orders of magnitude more to deploy. This highlights the importance of domain-specific models for education prediction tasks and the fact that current closed source LLMs should not be used as a universal solution for all tasks.
[NLP-29] Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在高风险决策场景(如临床诊断)中缺乏可靠验证机制的问题,以提升其部署的可信度。解决方案的关键在于提出GLEAN框架,该框架基于指南的证据累积机制,将专家制定的诊疗协议转化为轨迹感知且校准良好的正确性信号;通过评估每一步与领域指南的一致性,并聚合多指南评分作为代理特征,利用贝叶斯逻辑回归对这些特征进行轨迹累积和概率校准,从而生成高精度的决策置信度;此外,估计的不确定性可触发主动验证机制,通过扩展指南覆盖范围和执行差异性检查来针对性收集额外证据,显著提升了模型在判别能力(AUROC提升12%)和校准性能(Brier分数降低50%)上的表现。
链接: https://arxiv.org/abs/2603.02798
作者: Yichi Zhang,Nabeel Seedat,Yinpeng Dong,Peng Cui,Jun Zhu,Mihaela van de Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As LLM-powered agents have been used for high-stakes decision-making, such as clinical diagnosis, it becomes critical to develop reliable verification of their decisions to facilitate trustworthy deployment. Yet, existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration. To address this, we establish GLEAN, an agent verification framework with Guideline-grounded Evidence Accumulation that compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals. GLEAN evaluates the step-wise alignment with domain guidelines and aggregates multi-guideline ratings into surrogate features, which are accumulated along the trajectory and calibrated into correctness probabilities using Bayesian logistic regression. Moreover, the estimated uncertainty triggers active verification, which selectively collects additional evidence for uncertain cases via expanding guideline coverage and performing differential checks. We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset, surpassing the best baseline by 12% in AUROC and 50% in Brier score reduction, which confirms the effectiveness in both discrimination and calibration. In addition, the expert study with clinicians recognizes GLEAN’s utility in practice.
[NLP-30] OCR or Not? Rethinking Document Information Extraction in the MLLM s Era with Real-World Large-Scale Datasets
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在商业文档信息抽取任务中的实际效能问题,特别是验证仅使用图像输入的MLLM-only管道是否能与传统OCR+MLLM架构相媲美。其解决方案的关键在于:首先通过大规模基准测试评估多种现成MLLMs的表现;其次提出一种基于大语言模型(Large Language Models, LLMs)的自动化分层错误分析框架,系统性诊断失败模式;最终发现,对于强大的MLLM而言,OCR可能并非必要,图像输入即可达到与OCR增强方法相当的性能,同时强调精心设计的schema、示例和指令可进一步提升模型表现。
链接: https://arxiv.org/abs/2603.02789
作者: Jiyuan Shen,Peiyue Yuan,Atin Ghosh,Yifan Mai,Daniel Dahlmeier
机构: SAP; Stanford University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline–while simpler–can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.
[NLP-31] From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLM s with KMP-Bench
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学教学场景中评估不足的问题,尤其是现有评价体系多依赖简单指标或局限于单一教学情境,无法全面衡量模型在多轮互动中的综合教学能力。其解决方案的关键在于构建了一个面向K-8年级的综合性数学教学评估基准——KMP-Bench,该基准包含两个互补模块:KMP-Dialogue用于从六个核心教学原则(如挑战性、解释力、反馈机制等)出发,通过一个融合多样教学要素的多轮对话数据集评估模型的整体教学素养;KMP-Skills则聚焦于基础教学能力的细粒度测评,包括多轮解题、错误识别与修正、题目生成等任务。此外,研究还提出了KMP-Pile这一大规模(15万条)教学对话数据集,实验证明基于该数据集微调后的模型在KMP-Bench上表现显著提升,凸显了高质量教学语料对增强AI数学助教能力的重要性。
链接: https://arxiv.org/abs/2603.02775
作者: Weikang Shi,Houxing Ren,Junting Pan,Aojun Zhou,Ke Wang,Zimu Lu,Yunqiao Yang,Yuxuan Hu,Linda Wei,Mingjie Zhan,Hongsheng Li
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. Peking University (北京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.
[NLP-32] Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)因非序列化、双向掩码生成机制导致的质量评估困难问题,尤其是缺乏有效的自评估方法来衡量生成内容的可靠性。解决方案的关键在于提出DiSE(Diffusion Self-Evaluation),该方法通过计算在给定完整上下文条件下重新生成整个生成序列中所有token的概率来量化置信度,从而实现更高效且可靠的自我评估。这一机制不仅支持似然估计,还提供了鲁棒的不确定性量化能力,并进一步构建了一个基于模型自评估结果动态调整生成长度的灵活长度生成框架,实验证明DiSE与语义连贯性和答案准确性呈正相关,显著提升了dLLMs的质量控制能力。
链接: https://arxiv.org/abs/2603.02760
作者: Linhao Zhong,Linyu Wu,Wen Wang,Yuling Xi,Chenchen Jing,Jiaheng Zhang,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Zhejiang University of Technology (浙江工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model’s self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.
[NLP-33] Sensory-Aware Sequential Recommendation via Review-Distilled Representations
【速读】: 该论文旨在解决传统序列推荐模型在捕捉商品细粒度感知特征方面的局限性,即如何有效利用用户评论中蕴含的感官属性(sensory attributes)来增强物品表征,从而提升推荐效果。其解决方案的关键在于提出了一种两阶段框架——ASEGR(Attribute-based Sensory Enhanced Generative Recommendation),首先使用大语言模型作为教师模型从非结构化评论文本中提取结构化的感官属性-值对(如颜色:哑光黑、气味:香草),随后将这些结构化信息蒸馏至轻量级学生Transformer模型,生成固定维度的感官嵌入(sensory embeddings)。这些嵌入以可复用的形式编码体验语义,并作为额外的物品级表示集成到主流序列推荐架构(如SASRec、BERT4Rec等)中,实验证明该方法在多个Amazon领域均显著优于仅依赖行为交互的基线模型,且所提取属性与人类感知高度一致,增强了推荐的可解释性。
链接: https://arxiv.org/abs/2603.02709
作者: Yeo Chan Yoon
机构: Jeju National University (济州国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We propose a novel framework for sensory-aware sequential recommendation that enriches item representations with linguistically extracted sensory attributes from product reviews. Our approach, \textscASEGR (Attribute-based Sensory Enhanced Generative Recommendation), introduces a two-stage pipeline in which a large language model is first fine-tuned as a teacher to extract structured sensory attribute–value pairs, such as \textitcolor: matte black and \textitscent: vanilla, from unstructured review text. The extracted structures are then distilled into a compact student transformer that produces fixed-dimensional sensory embeddings for each item. These embeddings encode experiential semantics in a reusable form and are incorporated into standard sequential recommender architectures as additional item-level representations. We evaluate our method on four Amazon domains and integrate the learned sensory embeddings into representative sequential recommendation models, including SASRec, BERT4Rec, and BSARec. Across domains, sensory-enhanced models consistently outperform their identifier-based counterparts, indicating that linguistically grounded sensory representations provide complementary signals to behavioral interaction patterns. Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior. Overall, this work demonstrates that sensory attribute distillation offers a principled and scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning.
[NLP-34] Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent System, MAS)中通信拓扑结构优化问题,尤其是现有基于强化学习的方法因依赖单样本策略梯度与绝对奖励(如二值正确性)而导致梯度方差大、信用分配困难的问题。其解决方案的关键在于提出Graph-GRPO框架,通过引入组相对策略优化(Group Relative Policy Optimization),在每次查询时采样一组多样化的通信图,并基于这些图中边的相对表现计算优势值,从而实现奖励归一化,有效抑制任务难度差异带来的噪声并支持细粒度的信用分配,最终提升训练稳定性和识别关键通信路径的能力。
链接: https://arxiv.org/abs/2603.02701
作者: Yueyang Cang,Xiaoteng Zhang,Erlu Zhao,Zehua Ji,Yuhang Liu,Yuchen He,Zhiyuan Ning,Chen Yijun,Wenge Que,Li Shi
机构: Tsinghua University (清华大学); Donghua University (东华大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.
[NLP-35] HateMirag e: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse LREC2026
【速读】: 该论文旨在解决在线安全研究中一个长期被忽视的问题:隐性仇恨言论(subtle and indirect hate speech)的识别与解释难题,尤其是当有害意图嵌套在误导或操纵性叙事中时,现有数据集因主要聚焦于显性毒性而难以捕捉此类复杂现象。解决方案的关键在于构建 HateMirage 数据集,该数据集由广泛被辟谣的虚假信息(faux information)相关的 YouTube 用户评论组成,共 4,530 条,并引入三个可解释维度——目标对象(Target)、意图(Intent)和潜在社会影响(Implication),从而形成多维解释框架(multi-dimensional explanation framework),突破以往仅提供词级别或单维度解释的局限,使模型能够更全面地理解虚假信息如何引发或合理化仇恨行为。这一设计为可解释仇恨检测与负责任的人工智能研究提供了新的基准。
链接: https://arxiv.org/abs/2603.02684
作者: Sai Kartheek Reddy Kasu,Shankar Biradar,Sunil Saumya,Md. Shad Akhtar
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted at LREC 2026
点击查看摘要
Abstract:Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.
[NLP-36] ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理任务中,尤其是在多语言情境下所表现出的内容偏差(content effects)问题。其解决方案的关键在于引入一种显式的结构抽象方法,将三段论(syllogisms)转化为规范化的逻辑表示,并通过确定性解析(deterministic parsing)来判定推理的有效性,从而有效降低内容相关偏差,且无需依赖复杂的微调或激活层干预。
链接: https://arxiv.org/abs/2603.02676
作者: Wicaksono Leksono Muhamad,Joanito Agili Lopo,Tack Hwa Wong,Muhammad Ravi Shulthan Habibi,Samuel Cahyawijaya
机构: SEACrowd; Mantera Studio; Kreasof AI; Universiti Teknologi PETRONAS; Universitas Indonesia; Cohere
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.
[NLP-37] Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory ICLR2026
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)评估基准中存在的“捷径问题”(shortcut questions),即部分题目仅依赖单一模态即可作答,导致评估结果无法真实反映模型的跨模态推理能力,且增加了不必要的计算成本。其解决方案的关键在于提出一种多模态多维项目反应理论框架(Multi-modal and Multidimensional Item Response Theory, M3IRT),该框架通过将模型能力与题目难度分解为图像独有、文本独有和跨模态三个分量,从而量化模型的跨模态推理能力及每道题目的跨模态难度,进而筛选出高质量、紧凑的测试子集,确保在保留排名保真度的同时显著降低评估成本。
链接: https://arxiv.org/abs/2603.02663
作者: Shunki Uebayashi,Kento Masui,Kyohei Atarashi,Han Bao,Hisashi Kashima,Naoto Inoue,Mayu Otani,Koh Takeuchi
机构: Kyoto University (京都大学); CyberAgent. (CyberAgent); The Institute of Statistical Mathematics (统计数学研究所); Tohoku University (东北大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 24pages, 20 figures, accepted to ICLR2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question’s cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.
[NLP-38] Real-Time Generation of Game Video Commentary with Multimodal LLM s: Pause-Aware Decoding Approaches LREC2026
【速读】: 该论文旨在解决实时视频解说生成中“说什么”和“何时说”的双重挑战,尤其关注现有基于提示(prompting)的多模态大语言模型(Multimodal Large Language Models, MLLMs)方法在时间节奏控制上的不足。其解决方案的关键在于提出两种无需微调的提示解码策略:一是固定间隔法,二是创新性的动态间隔解码法,后者根据前一语句的预计时长自适应调整下一次预测的时间点,从而实现对停顿敏感的、与人类解说节奏更一致的文本生成。实验表明,动态间隔策略在日语和英语的竞速类与格斗类游戏数据集上显著提升了内容相关性和时间准确性。
链接: https://arxiv.org/abs/2603.02655
作者: Anum Afzal,Yuki Saito,Hiroya Takamura,Katsuhito Sudoh,Shinnosuke Takamichi,Graham Neubig,Florian Matthes,Tatsuya Ishigaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at LREC2026
点击查看摘要
Abstract:Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.
[NLP-39] Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models
【速读】: 该论文旨在解决代理型大语言模型(agentic large language model)工作负载中因重复推理步骤和多调用循环导致的预填充(prefill)成本过高问题,尤其是在缺乏同族小型草稿模型(draft model)的情况下。解决方案的关键在于提出跨家族推测性预填充(cross-family speculative prefill)机制,即利用来自不同模型家族的轻量级草稿模型对目标模型进行提示压缩,通过注意力机制估计token重要性来实现训练-free的prompt压缩。实验证明,尽管草稿模型与目标模型在架构和分词器上存在差异,该方法仍能稳定迁移token重要性信息,在保持90~100%全提示基准性能的同时显著降低首次标记时间(TTFT),表明该方法依赖于任务先验和语义结构,具备良好的泛化能力。
链接: https://arxiv.org/abs/2603.02631
作者: Shubhangi Upasani,Ravi Shanker Raju,Bo Li,Mengmeing Ji,John Long,Chen Wu,Urmish Thakker,Guangtao Wang
机构: SambaNova AI; Microsoft AI; Meta Inc
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad diversity of tasks, we find that attention-based token importance estimation transfers reliably across different model families despite differences in model architectures and tokenizers between draft and target models. Cross-model prompt compression largely retains 90~100% of full-prompt baseline performance and, in some cases, slightly improves accuracy due to denoising effects, while delivering substantial reductions in time to first token (TTFT). These results suggest that speculative prefill depends mainly on task priors and semantic structure, thus serving as a generalizable prompt compression primitive. We discuss the implications of our findings for agentic systems, where repeated long-context inference and heterogeneous model stacks make cross-model prompt compression both necessary and practical.
[NLP-40] hink But Dont Overthink: Reproducing Recursive Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长上下文时面临的计算资源消耗高和推理效率低的问题。其解决方案的关键在于引入递归语言模型(Recursive Language Models, RLMs)框架,通过将提示(prompt)外置于一个外部REPL(Read-Eval-Print Loop)环境来实现上下文的无限扩展,从而避免模型因输入长度限制而失效。该研究进一步发现,虽然适度增加递归深度(如从深度1提升至深度2)可增强复杂推理任务中的准确性,但过度递归会导致模型“过度思考”,反而降低性能并显著增加执行时间和token成本,揭示了递归深度与效率之间存在非线性权衡关系。
链接: https://arxiv.org/abs/2603.02615
作者: Daren Wang
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This project reproduces and extends the recently proposed Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to overthink’'. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: this https URL
[NLP-41] GPUTOK: GPU Accelerated Byte Level BPE Tokenization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理百万token级别长上下文时,CPU文本分词器成为性能瓶颈的问题,因为其串行处理方式导致GPU计算资源闲置。解决方案的关键在于设计并实现了一个基于GPU的字节级BPE(Byte Pair Encoding)分词器,该分词器遵循GPT-2的合并规则,并采用优化策略如cuCollections静态映射、CUB归约和pybind11接口,显著提升分词效率。实验表明,在最长输入下,该GPU分词器比tiktoken快约1.7倍、比HuggingFace GPT-2分词器快约7.6倍,且生成结果与基准分词器在相似性和重叠度指标上差异小于1个百分点,保持了输出质量的同时使长上下文推理更具实用性。
链接: https://arxiv.org/abs/2603.02597
作者: Venu Gopal Kadamba,Kanishkha Jaisankar
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2’s merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer’s outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2603.02597 [cs.CL] (or arXiv:2603.02597v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.02597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-42] ExpGuard: LLM Content Moderation in Specialized Domains ICLR2026
【速读】: 该论文旨在解决当前通用型安全防护模型(guardrail models)在面对金融、医疗和法律等专业领域中包含技术术语和特定概念的有害或对抗性内容时,防护能力不足的问题。解决方案的关键在于提出一种名为ExpGuard的专用化、鲁棒性强的安全防护模型,并构建了一个高质量、细粒度标注的数据集ExpGuardMix,其中包含58,928条标注样本(含拒绝与合规响应),分为训练集ExpGuardTrain和由领域专家标注的测试集ExpGuardTest,从而显著提升模型对领域特异性攻击的识别与防御能力,在多项基准测试中优于现有最优模型如WildGuard。
链接: https://arxiv.org/abs/2603.02588
作者: Minseok Choi,Dongjin Kim,Seungbin Yang,Subin Kim,Youngjun Kwak,Juyoung Oh,Jaegul Choo,Jungmin Son
机构: KAIST AI; Financial Tech Lab, KakaoBank Corp
类目: Computation and Language (cs.CL)
备注: ICLR 2026
点击查看摘要
Abstract:With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
[NLP-43] hrough the Lens of Contrast: Self-Improving Visual Reasoning in VLMs ICLR2026
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在生成推理路径时易产生视觉幻觉(visual hallucinations)的问题,这类幻觉难以被有效验证或修正,从而限制了模型推理能力的提升。解决方案的关键在于利用视觉对比性(visual contrast):通过构建视觉相似但语义一致的VQA对比对(contrastive VQA pair),引导模型更精确地识别相关视觉线索;基于此,作者提出Visual Contrastive Self-Taught Reasoner (VC-STaR) 框架,借助视觉对比机制自监督生成高质量推理路径,并据此构建了大规模视觉推理数据集VisCoR-55K,用于进一步微调VLMs以增强其视觉推理能力。实验表明,该方法优于现有自改进方法,并在性能上超越使用当前最优视觉推理数据集微调的模型,证明了VLM自身具备利用对比能力实现自我提升的潜力。
链接: https://arxiv.org/abs/2603.02556
作者: Zhiyu Pan,Yizheng Wu,Jiashen Hua,Junyi Feng,Shaotian Yan,Bing Deng,Zhiguo Cao,Jieping Ye
机构: Huazhong University of Science and Technology (华中科技大学); Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 9 figures, accepted to ICLR 2026 (oral)
点击查看摘要
Abstract:Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: this https URL.
[NLP-44] CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think
【速读】: 该论文旨在解决连续扩散语言模型(Continuous Diffusion Language Models, DLMs)在生成质量上落后于离散扩散方法的问题,其核心瓶颈被识别为“token rounding”——即从去噪嵌入空间到离散token的最终投影过程。解决方案的关键在于提出CoDAR(Continuous Diffusion with Contextual AutoRegressive Decoder),该框架采用两阶段设计:第一阶段保持扩散过程完全在嵌入空间中进行,第二阶段引入一个上下文感知的自回归解码器(contextual autoregressive decoder),通过交叉注意力机制对去噪嵌入序列进行条件化token映射,从而实现更精准的离散化。实验表明,CoDAR在LM1B和OpenWebText数据集上显著优于潜在扩散模型,并达到与强离散DLM相当的性能,同时提供温度参数以灵活调控流畅性与多样性之间的权衡。
链接: https://arxiv.org/abs/2603.02547
作者: Junzhe Shen,Jieru Zhao,Ziwei He,Zhouhan Lin
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token–recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two–stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context–conditional discretizer: an autoregressive Transformer decoder that cross–attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder–temperature knob to navigate the fluency–diversity trade off.
[NLP-45] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models ACL2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全评估与红队测试仍以文本为中心、缺乏系统性跨模态(音频、图像、视频)对齐泛化能力验证的问题。其核心解决方案是提出MUSE(Multimodal Unified Safety Evaluation)平台,该平台集成自动跨模态载荷生成、三种多轮攻击算法(Crescendo、PAIR、Violent Durian)、无厂商依赖的模型路由机制以及基于五级安全分类的LLM判别器,并引入双指标框架区分硬性攻击成功率(仅合规)与软性攻击成功率(含部分合规),从而更全面捕捉信息泄露风险。关键创新在于引入跨轮次模态切换(Inter-Turn Modality Switching, ITMS),通过每轮攻击中动态变换输入模态,揭示不同模型家族在跨模态一致性上的差异,实验证明该策略虽不提升饱和基线的最终攻击成功率,但能加速收敛并暴露早期防御脆弱性,凸显了需针对不同提供方模型开展定制化的跨模态安全测试。
链接: https://arxiv.org/abs/2603.02482
作者: Zhongxi Wang,Yueqian Lin,Jingyang Zhang,Hai Helen Li,Yiran Chen
机构: Duke University (杜克大学); Virtue AI (美德AI)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ACL 2026 System Demonstration Track
点击查看摘要
Abstract:Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.
[NLP-46] GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR ICASSP2026
【速读】: 该论文旨在解决方言密集场景下自动语音识别(ASR)面临的挑战,即方言间存在强烈区域性差异且标注数据稀缺的问题。其解决方案的关键在于提出一种参数高效适配框架GLoRIA,通过引入地理位置等元数据(metadata)来调控预训练编码器中低秩更新(low-rank adaptation),具体在每个前馈层注入低秩矩阵,并利用门控多层感知机(gating MLP)根据位置信息动态确定各低秩秩-1组件的非负贡献权重,从而实现高效、可解释且泛化能力强的方言ASR模型适配。
链接: https://arxiv.org/abs/2603.02464
作者: Pouya Mehralian,Melissa Farasyn,Anne Breitbarth,Anne-Sophie Ghyselen,Hugo Van hamme
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2026. 5 pages
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framework that leverages metadata (e.g., coordinates) to modulate low-rank updates in a pre-trained encoder. GLoRIA injects low-rank matrices into each feed-forward layer, with a gating MLP determining the non-negative contribution of each LoRA rank-1 component based on location metadata. On the GCND corpus, GLoRIA outperforms geo-conditioned full fine-tuning, LoRA, and both dialect-specific and unified full fine-tuning, achieving state-of-the-art word error rates while updating under 10% of parameters. GLoRIA also generalizes well to unseen dialects, including in extrapolation scenarios, and enables interpretable adaptation patterns that can be visualized geospatially. These results show metadata-gated low-rank adaptation is an effective, interpretable, and efficient solution for dialectal ASR.
[NLP-47] RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks
【速读】: 该论文旨在解决自动语音识别(ASR)在低资源和分布外(out-of-distribution, OOD)场景下泛化能力不足的问题。其解决方案的关键在于构建了一个多样化的罗马尼亚语语音基准数据集 RO-N3WS,涵盖广播新闻、文学有声书、电影对白、儿童故事及对话播客等多种语域,从而支持跨风格领域的鲁棒训练与微调。实验表明,即使仅用少量真实语音进行微调,也能显著优于零样本基线,验证了该数据集在提升 ASR 模型适应性方面的有效性。
链接: https://arxiv.org/abs/2603.02368
作者: Alexandra Diaconu,Mădălina Vînaga,Bogdan Alexe
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:
点击查看摘要
Abstract:We introduce RO-N3WS, a benchmark Romanian speech dataset designed to improve generalization in automatic speech recognition (ASR), particularly in low-resource and out-of-distribution (OOD) conditions. RO-N3WS comprises over 126 hours of transcribed audio collected from broadcast news, literary audiobooks, film dialogue, children’s stories, and conversational podcast speech. This diversity enables robust training and fine-tuning across stylistically distinct domains. We evaluate several state-of-the-art ASR systems (Whisper, Wav2Vec 2.0) in both zero-shot and fine-tuned settings, and conduct controlled comparisons using synthetic data generated with expressive TTS models. Our results show that even limited fine-tuning on real speech from RO-N3WS yields substantial WER improvements over zero-shot baselines. We will release all models, scripts, and data splits to support reproducible research in multilingual ASR, domain adaptation, and lightweight deployment.
[NLP-48] Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLM s
【速读】: 该论文旨在解决由大型语言模型(Large Language Models, LLMs)生成的高质作文对学术诚信构成的挑战,即如何有效识别学生提交的作文是否由AI生成或辅助完成。其解决方案的关键在于系统评估现有AI生成文本检测工具在不同LLM生成内容上的泛化能力,并基于公开GRE写作题目的生成文本进行实证分析,从而为检测器的开发、训练与再训练提供实践指导,提升其在真实场景中的适用性和可靠性。
链接: https://arxiv.org/abs/2603.02353
作者: Jiangang Hao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 2 figures
点击查看摘要
Abstract:Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.
[NLP-49] Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在生成过程中是否存在训练数据记忆现象的问题,这一问题此前因扩散机制与自回归生成动态的本质差异而未被系统研究。解决方案的关键在于提出一个统一的广义概率提取框架,该框架能够将前缀条件解码与任意掩码模式下的扩散生成过程进行理论整合,并通过严格的数学证明(定理4.3)揭示采样分辨率与记忆泄露之间的单调关系:随着采样分辨率提升,精确提取训练数据的概率严格增加,表明自回归解码可视为扩散生成在最大分辨率下的极限情况。实验验证了该理论预测,并进一步在对齐的前缀条件评估下表明,DLMs相较于自回归语言模型(Autoregressive Language Models, ARMs)在个人身份信息(PII)泄露方面具有显著更低的记忆风险。
链接: https://arxiv.org/abs/2603.02333
作者: Xiaoyu Luo,Wenrui Yu,Qiongxiu Li,Johannes Bjerva
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 9 figures
点击查看摘要
Abstract:Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.
[NLP-50] Universal Conceptual Structure in Neural Translation: Probing NLLB-200s Multilingual Geometry
【速读】: 该论文旨在解决神经机器翻译模型是否学习了语言通用的概念表征,还是仅根据表面相似性对语言进行聚类这一核心问题。解决方案的关键在于通过六项实验,结合自然语言处理可解释性与多语种词汇组织的认知科学理论,系统分析Meta的NLLB-200模型(一个支持200种语言的编码器-解码器Transformer)的表示几何结构。研究发现,模型嵌入距离显著与语言谱系距离相关(ρ = 0.13, p = 0.020),且同义概念对(来自CLICS数据库)的嵌入相似性远高于非同义对(U = 42656, p = 1.33 × 10⁻¹¹, d = 0.96),表明模型已内化普遍的概念关联;此外,按语言均值中心化嵌入后,跨概念与内概念距离比提升1.9倍,佐证了类似双语神经影像中识别出的前颞叶枢纽的语言中立概念存储机制;最后,基本概念对(如“男”到“女”、“大”到“小”)之间的语义偏移向量在跨语言间保持高度一致性(平均余弦相似度 = 0.84),说明二阶关系结构在类型学差异大的语言中得以保留。这些结果共同表明,NLLB-200不仅捕捉语言表层特征,还隐式学习了人类语言共有的概念结构。
链接: https://arxiv.org/abs/2603.02258
作者: Kyle Elliott Mathewson
机构: University of Alberta (阿尔伯塔大学); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 figures; code and interactive toolkit available at this https URL
点击查看摘要
Abstract:Do neural machine translation models learn language-universal conceptual representations, or do they merely cluster languages by surface similarity? We investigate this question by probing the representation geometry of Meta’s NLLB-200, a 200-language encoder-decoder Transformer, through six experiments that bridge NLP interpretability with cognitive science theories of multilingual lexical organization. Using the Swadesh core vocabulary list embedded across 135 languages, we find that the model’s embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program ( \rho = 0.13 , p = 0.020 ), demonstrating that NLLB-200 has implicitly learned the genealogical structure of human languages. We show that frequently colexified concept pairs from the CLICS database exhibit significantly higher embedding similarity than non-colexified pairs ( U = 42656 , p = 1.33 \times 10^-11 , d = 0.96 ), indicating that the model has internalized universal conceptual associations. Per-language mean-centering of embeddings improves the between-concept to within-concept distance ratio by a factor of 1.19, providing geometric evidence for a language-neutral conceptual store analogous to the anterior temporal lobe hub identified in bilingual neuroimaging. Semantic offset vectors between fundamental concept pairs (e.g., man to woman, big to small) show high cross-lingual consistency (mean cosine = 0.84), suggesting that second-order relational structure is preserved across typologically diverse languages. We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena, alongside a fully reproducible analysis pipeline.
[NLP-51] Safety Training Persists Through Helpfulness Optimization in LLM Agents
【速读】: 该论文旨在解决多步骤、工具使用场景下大语言模型(LLM)的安全性问题,即如何在保证模型有用性的同时避免其执行有害行为。不同于以往仅关注单轮对话中拒绝有害请求的研究,本文聚焦于“代理型”(agentic)设置下的安全定义——即模型直接采取的有害行动。解决方案的关键在于通过顺序或同时优化安全性与有用性两个目标进行直接偏好优化(DPO),发现即使先训练安全性再训练有用性,前者的效果仍能保持;且所有训练配置最终都趋近于一条线性帕累托前沿(R² = 0.77),表明在当前框架下难以实现“两全其美”的策略,凸显了对后训练动态机制深入理解的重要性。
链接: https://arxiv.org/abs/2603.02229
作者: Benjamin Plaut
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under submission
点击查看摘要
Abstract:Safety post-training has been studied extensively in single-step “chat” settings where safety typically refers to refusing harmful requests. We study an “agentic” (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike prior work, we find that safety training persists through subsequent helpfulness training. We also find that all training configurations end up near a linear Pareto frontier with R^2 = 0.77 . Even post-training on both metrics simultaneously simply results in another point on the frontier rather than finding a “best of both worlds” strategy, despite the presence of such strategies in our DPO dataset. Overall, our findings underscore the need for better understanding of post-training dynamics.
[NLP-52] Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat
【速读】: 该论文旨在解决生成式 AI 中稀疏注意力机制(sparse attention)训练过程中“路由吸收”(routing absorption)问题,即模型的查询(Q)、键(K)、值(V)投影参数会与稀疏掩码(mask)共适应,导致门控网络(gate)无法有效学习到真正重要的注意力条目,从而使得端到端训练的稀疏注意力性能接近随机门控。其关键解决方案在于提出后处理式(post-hoc)稀疏化方法,通过将表示学习与稀疏化过程解耦,避免门控网络与 Q/K/V 参数之间的共适应,从而实现高效且稳定的稀疏注意力机制。
链接: https://arxiv.org/abs/2603.02227
作者: Keston Aquino-Michaels
机构: No Way Labs
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 4 figures
点击查看摘要
Abstract:Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model’s Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.
[NLP-53] Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在自进化(self-evolution)循环中普遍存在的性能瓶颈问题,即许多现有方案本质上是自对弈(self-play),但往往在初期快速达到平台期,其根本原因在于生成的数据未带来可学习信息的持续增长。解决方案的关键在于构建一个具有可学习信息递增能力的自合成数据流水线,并从三元角色视角设计系统:Proposer(任务生成者)、Solver(求解者)和Verifier(验证者)。通过异构协同演化(asymmetric co-evolution)打破弱-强-弱循环、容量增长(capacity growth)匹配可学习信息上升趋势,以及主动信息探索(proactive information seeking)引入外部上下文与新任务源以避免饱和,这三项机制共同构成可度量的系统级路径,将脆弱的自对弈动态转变为可持续的自进化能力。
链接: https://arxiv.org/abs/2603.02218
作者: Wei Liu,Siya Qi,Yali Du,Yulan He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 10 pages, 6 figures, 7 formulas
点击查看摘要
Abstract:Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.
[NLP-54] A Zipf-preserving long-range correlated surrogate for written language and other symbolic sequences
【速读】: 该论文旨在解决现有替代模型无法同时保留符号序列的频次分布与长程相关性特征的问题。传统方法通常只能维持其中一种特性,而无法兼顾,限制了对复杂系统中结构机制的深入分析。其解决方案的关键在于提出一种基于分数高斯噪声(fractional Gaussian noise, FGN)的映射策略:通过频率保持的赋值方式将FGN映射到原始序列的经验直方图上,从而生成既保留原序列符号频率(一阶统计量),又能再现其长程相关结构(由去趋势波动分析 DFA 指数量化)的替代序列。该方法有效分离了短程依赖与长程记忆效应,为语言、基因组等符号系统的结构解析和尺度律起源假说检验提供了可控制的建模工具。
链接: https://arxiv.org/abs/2603.02213
作者: Marcelo A. Montemurro,Mirko Degli Esposti
机构: 未知
类目: Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Genomics (q-bio.GN)
备注:
点击查看摘要
Abstract:Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf’s law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies. We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains. Subjects: Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Genomics (q-bio.GN) Cite as: arXiv:2603.02213 [cs.CL] (or arXiv:2603.02213v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.02213 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Physica A 683 (2026) 131227 Related DOI: https://doi.org/10.1016/j.physa.2025.131227 Focus to learn more DOI(s) linking to related resources
[NLP-55] ParamΔ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost ICLR2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)后训练(post-training)阶段中存在的高成本与低效率问题,包括对高质量数据的依赖、过拟合风险以及因每次基础模型更新后需重复进行后训练和评估所带来的巨大计算开销。其解决方案的关键在于提出一种名为 Param\Delta 的新方法,该方法通过计算已后训练模型权重(Θpost)与基础模型权重(Θbase)之间的差值,并将其加到更新后的基础模型权重(Θbase′)上,从而得到一个无需额外训练即可继承后训练能力的新模型——Param\Delta 模型(\Theta_\text{Param\Delta} = \Theta_\text{post} - \Theta_\text{base} + \Theta'_\text{base})。实验证明,该方法在多个主流模型(如 Llama3、Qwen 和 DeepSeek-distilled)上能有效复现传统后训练效果,平均达到原后训练模型约 95% 的性能水平,为开源社区中频繁迭代的基础模型提供了零成本的知识迁移框架,显著加速了模型开发周期。
链接: https://arxiv.org/abs/2504.21023
作者: Sheng Cao,Mingrui Wu,Karthik Prasad,Yuandong Tian,Zechun Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a conference paper at ICLR 2025
点击查看摘要
Abstract:The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces Param\Delta , a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ( \Theta_\textpost ) and base model weights ( \Theta_\textbase ), and adding this to the updated base model ( \Theta’\textbase ), we define Param\Delta Model as: \Theta\textParam\Delta = \Theta_\textpost - \Theta_\textbase + \Theta’_\textbase . This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate Param\Delta Model effectively replicates traditional post-training. For example, the Param\Delta Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95% of Llama3.1-inst model’s performance on average. Param\Delta brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.
[NLP-56] Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features
【速读】: 该论文旨在解决自监督学习(Self-Supervised Learning, SSL)语音模型中,个体特征维度是否编码了具体语音属性(如说话人信息)的问题。以往研究多关注信息在不同网络层的分布,而忽视了单个维度上的语义表征能力。本文的关键解决方案是采用主成分分析(Principal Component Analysis, PCA)对话语平均表示进行降维,发现第一主成分主要编码基频(pitch)及其相关特征(如性别),其他主成分则分别对应响度、噪声水平、第二共振峰(formant)及高频特性等。进一步的语音合成实验表明,通过调整这些主成分对应的维度,可有效控制合成语音的相应声学特征,从而为语音合成中的可控性提供了简单且有效的手段。
链接: https://arxiv.org/abs/2603.03096
作者: Kyle Janse van Rensburg,Benjamin van Niekerk,Herman Kamper
机构: Stellenbosch University (斯泰伦博斯大学); Concordia University (康考迪亚大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 7 figures, submitted to IEEE Signal Processing Letters
点击查看摘要
Abstract:How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. Using WavLM, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. Finally, in synthesis experiments we show that most characteristics can be controlled by changing the corresponding dimensions. This provides a simple method to control characteristics of the output voice in synthesis applications.
信息检索
[IR-0] he Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment
【速读】:该论文旨在解决学术数据在多个孤立数据库中分散存储、元数据不一致且缺乏关联性的问题,从而阻碍了跨源的统一分析与研究。其解决方案的关键在于构建一个名为“Science Data Lake”的本地可部署基础设施,基于DuckDB和Parquet文件格式,通过DOI标准化整合八个开放数据源(如Semantic Scholar、OpenAlex等),同时保留各来源的原始Schema;并通过基于BGE-large句子嵌入的语义对齐方法,将OpenAlex的4,516个主题映射到13个科学本体(约130万术语),实现高精度的知识图谱融合(F1=0.77,阈值≥0.85),显著优于TF-IDF、BM25和Jaro-Winkler等基线方法,为多源异构科研数据的统一建模与智能分析提供了可扩展、可验证的基础资源。
链接: https://arxiv.org/abs/2603.03126
作者: Jonas Wilinski
机构: Hamburg University of Technology (TUHH)
类目: Digital Libraries (cs.DL); Databases (cs.DB); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: 18 pages, 8 figures, 7 tables. Dataset DOI: https://doi.org/10.57967/hf/7850 . Code: this https URL
点击查看摘要
Abstract:Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ( \geq 0.65 threshold) with F1 = 0.77 at the recommended \geq 0.85 operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson r = 0.76 - 0.87 ), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.
[IR-1] Proactive Guiding Strategy for Item-side Fairness in Interactive Recommendation
【速读】:该论文旨在解决交互式推荐系统中长尾物品(long-tail items)公平曝光不足的问题。现有方法通过直接将长尾物品纳入推荐结果来提升其曝光,但这种做法导致用户偏好与推荐内容之间产生错位,进而削弱长期用户参与度和推荐效果。解决方案的关键在于提出一种主动的公平引导策略——HRL4PFG,该框架基于分层强化学习(hierarchical reinforcement learning, HRL),在宏观层面根据多步反馈生成公平引导目标,在微观层面实时调整推荐内容以兼顾目标导向与用户偏好演化,从而实现对长尾物品的渐进式引导,同时维持用户满意度。
链接: https://arxiv.org/abs/2603.03094
作者: Chongjun Xia,Xiaoyu Shi,Hong Xie,Xianzhi Wang,yun lu,Mingsheng Shang
机构: Chinese Academy of Sciences (中国科学院); University of Technology Sydney (悉尼科技大学); University of Science and Technology of China (中国科学技术大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Item-side fairness is crucial for ensuring the fair exposure of long-tail items in interactive recommender systems. Existing approaches promote the exposure of long-tail items by directly incorporating them into recommended results. This causes misalignment between user preferences and the recommended long-tail items, which hinders long-term user engagement and reduces the effectiveness of recommendations. We aim for a proactive fairness-guiding strategy, which actively guides user preferences toward long-tail items while preserving user satisfaction during the interactive recommendation process. To this end, we propose HRL4PFG, an interactive recommendation framework that leverages hierarchical reinforcement learning to guide user preferences toward long-tail items progressively. HRL4PFG operates through a macro-level process that generates fairness-guided targets based on multi-step feedback, and a micro-level process that fine-tunes recommendations in real time according to both these targets and evolving user preferences. Extensive experiments show that HRL4PFG improves cumulative interaction rewards and maximum user interaction length by a larger margin when compared with state-of-the-art methods in interactive recommendation environments.
[IR-2] Reproducing and Comparing Distillation Techniques for Cross-Encoders
【速读】:该论文旨在解决当前信息检索(Information Retrieval, IR)领域中关于交叉编码器(cross-encoder)训练策略的不明确性问题,特别是知识蒸馏(knowledge distillation)与监督学习目标选择对模型性能的影响缺乏系统性比较。此前研究虽表明通过合适策略可使传统交叉编码器达到大语言模型(LLM)重排序器的效果,但未在统一实验环境下对比不同教师模型(如LLM或交叉编码器集成)和多种损失函数的有效性,且未涵盖BERT之后的主流预训练模型架构(如RoBERTa、ELECTRA、DeBERTa-v3及ModernBERT)。解决方案的关键在于:在受控环境中,系统地复现基于LLM的蒸馏策略,并将其与基于交叉编码器集成教师的蒸馏方法及其他监督目标(如点对点、成对和列表级损失)进行公平比较,结果表明强调相对排序关系的损失函数(如成对MarginMSE和列表级InfoNCE)在所有骨干网络和评估场景下均显著优于点对点基线,且其优化效果可媲美模型架构规模扩展带来的收益。
链接: https://arxiv.org/abs/2603.03010
作者: Victor Morand,Mathias Vast,Basile Van Cooten,Laure Soulier,Josiane Mothe,Benjamin Piwowarski
机构: Sorbonne Université, CNRS, ISIR (索邦大学,法国国家科学研究中心,机器人与智能系统研究所); Sinequa by ChapsVision (Sinequa by ChapsVision); University of Toulouse, IRIT (图卢兹大学,信息与推理技术研究所)
类目: Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Recent advances in Information Retrieval have established transformer-based cross-encoders as a keystone in IR. Recent studies have focused on knowledge distillation and showed that, with the right strategy, traditional cross-encoders could reach the level of effectiveness of LLM re-rankers. Yet, comparisons with previous training strategies, including distillation from strong cross-encoder teachers, remain unclear. In addition, few studies cover a similar range of backbone encoders, while substantial improvements have been made in this area since BERT. This lack of comprehensive studies in controlled environments makes it difficult to identify robust design choices. In this work, we reproduce \citetschlattRankDistiLLMClosingEffectiveness2025 LLM-based distillation strategy and compare it to \citethofstatterImprovingEfficientNeural2020 approach based on an ensemble of cross-encoder teachers, as well as other supervised objectives, to fine-tune a large range of cross-encoders, from the original BERT and its follow-ups RoBERTa, ELECTRA and DeBERTa-v3, to the more recent ModernBERT. We evaluate all models on both in-domain (TREC-DL and MS~MARCO dev) and out-of-domain datasets (BEIR, LoTTE, and Robust04). Our results show that objectives emphasizing relative comparisons – pairwise MarginMSE and listwise InfoNCE – consistently outperform pointwise baselines across all backbones and evaluation settings, and that objective choice can yield gains comparable to scaling the backbone architecture.
[IR-3] OneRanker: Unified Generation and Ranking with One Model in Industrial Advertising Recommendation
【速读】:该论文旨在解决生成式广告推荐系统在实际部署中面临的三大核心问题:兴趣目标与业务价值之间的错位、生成过程的无目标性(target-agnostic)以及生成与排序阶段的脱节。现有方案往往陷入单阶段融合导致优化冲突或阶段解耦造成不可逆信息损失的两难境地。其解决方案的关键在于提出OneRanker架构,实现生成与排序的架构级深度融合:首先设计基于任务token序列和因果掩码的价值感知多任务解耦结构,在共享表示空间内分离兴趣覆盖与价值优化区域,缓解目标冲突;其次构建粗粒度到细粒度的协同目标感知机制,利用Fake Item Tokens实现生成阶段的隐式目标意识,结合排序解码器完成候选集层面的显式价值对齐;最后通过输入输出双侧一致性保障,引入Key/Value传递机制与分布一致性(Distribution Consistency, DC)约束损失,实现生成与排序端到端协同优化。
链接: https://arxiv.org/abs/2603.02999
作者: Dekai Sun,Yiming Liu,Jiafan Zhou,Xun Liu,Chenchen Yu,Yi Li,Huan Yu,Jun Zhang
机构: Tencent Inc.(腾讯公司)
类目: Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:The end-to-end generative paradigm is revolutionizing advertising recommendation systems, driving a shift from traditional cascaded architectures towards unified modeling. However, practical deployment faces three core challenges: the misalignment between interest objectives and business value, the target-agnostic limitation of generative processes, and the disconnection between generation and ranking stages. Existing solutions often fall into a dilemma where single-stage fusion induces optimization tension, while stage decoupling causes irreversible information loss. To address this, we propose OneRanker, achieving architectural-level deep integration of generation and ranking. First, we design a value-aware multi-task decoupling architecture. By leveraging task token sequences and causal mask, we separate interest coverage and value optimization spaces within shared representations, effectively alleviating target conflicts. Second, we construct a coarse-to-fine collaborative target awareness mechanism, utilizing Fake Item Tokens for implicit awareness during generation and a ranking decoder for explicit value alignment at the candidate level. Finally, we propose input-output dual-side consistency guarantees. Through Key/Value pass-through mechanisms and Distribution Consistency (DC) Constraint Loss, we achieve end-to-end collaborative optimization between generation and ranking. The full deployment on Tencent’s WeiXin channels advertising system has shown a significant improvement in key business metrics (GMV - Normal +1.34%), providing a new paradigm with industrial feasibility for generative advertising recommendations.
[IR-4] mehash: Hierarchical Time Indexing for Efficient Business Hours Search VLDB2026
【速读】:该论文旨在解决大规模搜索系统中时间范围过滤(temporal range filtering)的效率与存储成本之间的矛盾问题,尤其针对基于位置的服务中按营业时间筛选商户的需求。传统方法要么因粗粒度索引导致查询精度下降(scope filtering),要么因细粒度分钟级索引引发索引空间爆炸(minute-level indexing)。其解决方案的核心是提出一种名为Timehash的分层时间索引算法,通过灵活的多分辨率策略构建可定制层级结构,在保持100%精确性的同时实现索引规模的显著压缩——实验证明其相比分钟级索引减少99.1%的索引项(平均每个文档仅5.6个索引项),且支持复杂场景如休息时段和不规则排班,同时具备从10万到1260万POI的线性可扩展性。
链接: https://arxiv.org/abs/2603.02941
作者: Jinoh Kim,Jaewon Son
机构: Naver Corporation(纳维亚公司)
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: 12 pages, 2 figures, 8 tables. Submitted to VLDB 2026 Industry Track
点击查看摘要
Abstract:Temporal range filtering is a critical operation in large-scale search systems, particularly for location-based services that need to filter businesses by operating hours. Traditional approaches either suffer from poor query performance (scope filtering) or index size explosion (minute-level indexing). We present Timehash, a novel hierarchical time indexing algorithm that achieves over 99% reduction in index size compared to minute-level indexing while maintaining 100% precision. Timehash employs a flexible multi-resolution strategy with customizable hierarchical levels. Through empirical analysis on distributions from 12.6 million business records of a production location search service, we demonstrate a data-driven methodology for selecting optimal hierarchies tailored to specific data distributions. We evaluated Timehash on up to 12.6 million synthetic POIs generated from production distributions. Experimental results show that a five-level hierarchy reduces index terms to 5.6 per document (99.1% reduction versus minute-level indexing), with zero false positives and zero false negatives. Scalability benchmarks confirm constant per-document cost from 100K to 12.6M POIs, while supporting complex scenarios such as break times and irregular schedules. Our approach is generalizable to various temporal filtering problems in search systems, e-commerce, and reservation platforms. Comments: 12 pages, 2 figures, 8 tables. Submitted to VLDB 2026 Industry Track Subjects: Databases (cs.DB); Information Retrieval (cs.IR) ACMclasses: H.3.1; H.3.3 Cite as: arXiv:2603.02941 [cs.DB] (or arXiv:2603.02941v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.02941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] Model Editing for New Document Integration in Generative Information Retrieval WWW
【速读】:该论文旨在解决生成式检索(Generative Retrieval, GR)模型在面对新增文档时泛化能力差的问题,即模型难以正确生成新文档的标识符(docIDs)。现有方法如增量训练虽可缓解此问题,但存在计算开销大、资源消耗高及灾难性遗忘等局限。解决方案的关键在于识别并优化模型中影响docID映射的核心层,并通过一种名为DOME(docID-oriented model editing)的方法实现高效、精准的参数调整。DOME的核心创新是采用混合标签自适应训练策略,结合软标签保留查询语义差异以区分编辑向量,以及硬标签确保精确的映射修改,从而在不显著增加计算成本的前提下,大幅提升GR模型对未见文档的适应能力。
链接: https://arxiv.org/abs/2603.02773
作者: Zhen Zhang,Zihan Wang,Xinyu Ma,Shuaiqiang Wang,Dawei Yin,Xin Xin,Pengjie Ren,Maarten de Rijke,Zhaochun Ren
机构: Shandong University (山东大学); University of Amsterdam (阿姆斯特丹大学); Baidu Inc. (百度公司); Leiden University (莱顿大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to The Web Conference (WWW) 2026
点击查看摘要
Abstract:Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs). Despite its promise, existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs. While incremental training offers a straightforward remedy, it is computationally expensive, resource-intensive, and prone to catastrophic forgetting, thereby limiting the scalability and practicality of GR. In this paper, we identify the core bottleneck as the decoder’s ability to map hidden states to the correct docIDs of newly added documents. Model editing, which enables targeted parameter modifications for docID mapping, represents a promising solution. However, applying model editing to current GR models is not trivial, which is severely hindered by indistinguishable edit vectors across queries, due to the high overlap of shared docIDs in retrieval results. To address this, we propose DOME (docID-oriented model editing), a novel method that effectively and efficiently adapts GR models to unseen documents. DOME comprises three stages: (1) identification of critical layers, (2) optimization of edit vectors, and (3) construction and application of updates. At its core, DOME employs a hybrid-label adaptive training strategy that learns discriminative edit vectors by combining soft labels, which preserve query-specific semantics for distinguishable updates, with hard labels that enforce precise mapping modifications. Experiments on widely used benchmarks, including NQ and MS MARCO, show that our method significantly improves retrieval performance on new documents while maintaining effectiveness on the original collection. Moreover, DOME achieves this with only about 60% of the training time required by incremental training, considerably reducing computational cost and enabling efficient, frequent model updates.
[IR-6] APAO: Adaptive Prefix-Aware Optimization for Generative Recommendation
【速读】:该论文旨在解决生成式推荐(Generative Recommendation)中训练与推理阶段的不一致性问题。现有方法通常使用基于token级别的似然目标(如交叉熵损失)进行训练,但在推理时采用多步束搜索(beam search)生成候选物品序列,导致模型在训练时假设始终能访问真实历史序列,而束搜索会因低概率前缀被剪枝,从而提前丢弃正确物品。为缓解此问题,作者提出自适应前缀感知优化(Adaptive Prefix-Aware Optimization, APAO)框架,其核心在于引入前缀级别的优化损失,使训练目标更贴近束搜索的推理机制;进一步设计了自适应最差前缀优化策略,动态聚焦于最易被剪枝的脆弱前缀,提升模型在束搜索约束下保留正确候选的能力。理论分析与大量实验验证了该方法的有效性与通用性。
链接: https://arxiv.org/abs/2603.02730
作者: Yuanqing Yu,Yifan Wang,Weizhi Ma,Zhiqiang Guo,Min Zhang
机构: Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Generative recommendation has recently emerged as a promising paradigm in sequential recommendation. It formulates the task as an autoregressive generation process, predicting discrete tokens of the next item conditioned on user interaction histories. Existing generative recommendation models are typically trained with token-level likelihood objectives, such as cross-entropy loss, while employing multi-step beam search during inference to generate ranked item candidates. However, this leads to a fundamental training-inference inconsistency: standard training assumes ground-truth history is always available, ignoring the fact that beam search prunes low-probability branches during inference. Consequently, the correct item may be prematurely discarded simply because its initial tokens (prefixes) have low scores. To address this issue, we propose the Adaptive Prefix-Aware Optimization (APAO) framework, which introduces prefix-level optimization losses to better align the training objective with the inference setting. Furthermore, we design an adaptive worst-prefix optimization strategy that dynamically focuses on the most vulnerable prefixes during training, thereby enhancing the model’s ability to retain correct candidates under beam search constraints. We provide theoretical analyses to demonstrate the effectiveness and efficiency of our framework. Extensive experiments on multiple datasets further show that APAO consistently alleviates the training-inference inconsistency and improves performance across various generative recommendation backbones. Our codes are publicly available at this https URL.
[IR-7] S2CDR: Smoothing-Sharpening Process Model for Cross-Domain Recommendation WWW’2026
【速读】:该论文旨在解决推荐系统中的用户冷启动(user cold-start)问题,尤其是现有基于扩散模型(diffusion models, DMs)的跨域推荐(cross-domain recommendation, CDR)方法因忽略源域与目标域间物品关联性、且在前向扩散过程中引入高斯噪声导致用户个性化偏好受损的问题。解决方案的关键在于提出一种新颖的平滑-锐化过程模型(Smoothing-Sharpening Process Model for CDR, S2CDR),其核心是构建一个基于常微分方程(ODE)求解的腐蚀-恢复架构:平滑过程通过在物品-物品相似图上引入热方程(heat equation)以捕捉跨域物品相关性,并设计低通滤波器(low-pass filter)去除高频噪声,从而提取用户内在偏好;锐化过程则迭代重构冷启动用户的未知交互行为,实现无需训练即可有效迁移用户偏好。
链接: https://arxiv.org/abs/2603.02725
作者: Xiaodong Li,Juwei Yue,Xinghua Zhang,Jiawei Sheng,Wenyuan Zhang,Taoyu Su,Zefeng Zhang,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, UCAS (中国科学院大学网络空间安全学院); Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR)
备注: This paper is accepted by WWW’2026
点击查看摘要
Abstract:User cold-start problem is a long-standing challenge in recommendation systems. Fortunately, cross-domain recommendation (CDR) has emerged as a highly effective remedy for the user cold-start challenge, with recently developed diffusion models (DMs) demonstrating exceptional performance. However, these DMs-based CDR methods focus on dealing with user-item interactions, overlooking correlations between items across the source and target domains. Meanwhile, the Gaussian noise added in the forward process of diffusion models would hurt user’s personalized preference, leading to the difficulty in transferring user preference across domains. To this end, we propose a novel paradigm of Smoothing-Sharpening Process Model for CDR to cold-start users, termed as S2CDR which features a corruption-recovery architecture and is solved with respect to ordinary differential equations (ODEs). Specifically, the smoothing process gradually corrupts the original user-item/item-item interaction matrices derived from both domains into smoothed preference signals in a noise-free manner, and the sharpening process iteratively sharpens the preference signals to recover the unknown interactions for cold-start users. Wherein, for the smoothing process, we introduce the heat equation on the item-item similarity graph to better capture the correlations between items across domains, and further build the tailor-designed low-pass filter to filter out the high-frequency noise information for capturing user’s intrinsic preference, in accordance with the graph signal processing (GSP) theory. Extensive experiments on three real-world CDR scenarios confirm that our S2CDR significantly outperforms previous SOTA methods in a training-free manner.
[IR-8] AlphaFree: Recommendation Free from Users IDs and GNNs WWW
【速读】:该论文旨在解决当前推荐系统中对用户嵌入(user embeddings)、原始ID特征(raw IDs)和图神经网络(GNNs)的强依赖所引发的问题,包括高内存开销、冷启动与过平滑(over-smoothing)现象以及在未见交互场景下的泛化能力不足。其解决方案的关键在于提出AlphaFree框架,实现“三无”设计:通过动态推断偏好而无需存储用户嵌入(user-free),用预训练语言模型生成的语言表示(language representations, LRs)替代原始ID以降低偏差并增强语义理解(ID-free),并通过相似物品增强与对比学习捕获协同信号,从而摆脱对GNN的依赖(GNN-free)。实验证明该方法在多个真实数据集上显著优于现有方法,并大幅降低GPU内存占用。
链接: https://arxiv.org/abs/2603.02653
作者: Minseo Jeon,Junwoo Jung,Daewon Gwak,Jinhong Jung
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages, The Web Conference (WWW) 2026
点击查看摘要
Abstract:Can we design effective recommender systems free from users, IDs, and GNNs? Recommender systems are central to personalized content delivery across domains, with top-K item recommendation being a fundamental task to retrieve the most relevant items from historical interactions. Existing methods rely on entrenched design conventions, often adopted without reconsideration, such as storing per-user embeddings (user-dependent), initializing features from raw IDs (ID-dependent), and employing graph neural networks (GNN-dependent). These dependencies incur several limitations, including high memory costs, cold-start and over-smoothing issues, and poor generalization to unseen interactions. In this work, we propose AlphaFree, a novel recommendation method free from users, IDs, and GNNs. Our main ideas are to infer preferences on-the-fly without user embeddings (user-free), replace raw IDs with language representations (LRs) from pre-trained language models (ID-free), and capture collaborative signals through augmentation with similar items and contrastive learning, without GNNs (GNN-free). Extensive experiments on various real-world datasets show that AlphaFree consistently outperforms its competitors, achieving up to around 40% improvements over non-LR-based methods and up to 5.7% improvements over LR-based methods, while significantly reducing GPU memory usage by up to 69% under high-dimensional LRs. Comments: 13 pages, The Web Conference (WWW) 2026 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.02653 [cs.IR] (or arXiv:2603.02653v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.02653 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3774904.3792355 Focus to learn more DOI(s) linking to related resources
[IR-9] FlashEvaluator: Expanding Search Space with Parallel Evaluation
【速读】:该论文旨在解决生成式评估框架(Generator-Evaluator, G-E)中传统评估器存在的两大问题:一是缺乏显式的跨序列比较,导致选择准确率不足;二是评估过程并行性差,计算复杂度为线性O(K),造成资源利用效率低下,影响系统吞吐量和延迟。解决方案的关键在于提出FlashEvaluator,其通过在单次前向传播中实现跨序列的token信息共享,从而获得亚线性计算复杂度,显著提升系统效率,并支持直接的序列间对比,进而提高最终选中的序列质量。
链接: https://arxiv.org/abs/2603.02565
作者: Chao Feng,Yuanhao Pu,Chenghao Zhang,Shanqi Liu,Shuchang Liu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 2 figures
点击查看摘要
Abstract:The Generator-Evaluator (G-E) framework, i.e., evaluating K sequences from a generator and selecting the top-ranked one according to evaluator scores, is a foundational paradigm in tasks such as Recommender Systems (RecSys) and Natural Language Processing (NLP). Traditional evaluators process sequences independently, suffering from two major limitations: (1) lack of explicit cross-sequence comparison, leading to suboptimal accuracy; (2) poor parallelization with linear complexity of O(K), resulting in inefficient resource utilization and negative impact on both throughput and latency. To address these challenges, we propose FlashEvaluator, which enables cross-sequence token information sharing and processes all sequences in a single forward pass. This yields sublinear computational complexity that improves the system’s efficiency and supports direct inter-sequence comparisons that improve selection accuracy. The paper also provides theoretical proofs and extensive experiments on recommendation and NLP tasks, demonstrating clear advantages over conventional methods. Notably, FlashEvaluator has been deployed in online recommender system of Kuaishou, delivering substantial and sustained revenue gains in practice.
[IR-10] SOLAR: SVD-Optimized Lifelong Attention for Recommendation
【速读】:该论文旨在解决Transformer中注意力机制在长序列建模时面临的高时间和内存复杂度问题(O(N²d)),这通常导致模型不得不截断序列或采用启发式策略,从而影响推荐系统的性能。其解决方案的关键在于提出SVD-Attention,该方法利用推荐系统中用户行为矩阵普遍具有低秩结构的归纳偏置,通过奇异值分解(SVD)实现对注意力计算的理论无损压缩,在保持softmax归一化机制的同时将复杂度降至O(Ndr),其中r为矩阵的低秩近似维度。基于此,作者进一步构建了SOLAR框架,支持十万级行为序列与数千项候选集的端到端建模,无需过滤,显著提升了在线推荐效果。
链接: https://arxiv.org/abs/2603.02561
作者: Chenghao Zhang,Chao Feng,Yuanhao Pu,Xunyong Yang,Wenhui Yu,Xiang Li,Yongqi Liu,Lantao Hu,Kaiqiao Zhan,Han Li,Kun Gai
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 4 figures
点击查看摘要
Abstract:Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its O(N^2 d) time and memory cost in sequence length N makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to O(N d^2) by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from O(N^2 d) to O(Ndr) . With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou’s online recommendation scenario, SOLAR delivers a 0.68% Video Views gain together with additional business metrics improvements.
[IR-11] Relevance Matters: A Multi-Task and Multi-Stage Large Language Model Approach for E-commerce Query Rewriting ICDE2026
【速读】:该论文旨在解决电子商务搜索中因用户查询与商品描述之间存在词汇鸿沟而导致的检索相关性不足和用户转化率低的问题。其核心挑战在于如何通过查询重写(Query Rewriting)策略同时提升搜索结果的相关性和用户购买转化率。解决方案的关键在于提出一个基于大语言模型(Large Language Models, LLMs)的多任务、多阶段查询重写框架:首先通过多任务监督微调(Multi-task Supervised Fine-Tuning, SFT),联合优化查询生成任务与查询-重写相关性标注任务,从而在初始阶段就注入相关性建模能力;随后采用分组相对策略优化(Group Relative Policy Optimization, GRPO)对模型目标进行对齐,进一步增强相关性并促进用户转化行为。该方法已在某中国头部电商平台上线应用,显著提升了搜索相关性和单位用户购买转化率(UCVR)。
链接: https://arxiv.org/abs/2603.02555
作者: Aijun Dai,Jixiang Zhang,Haiqing Hu,Guoyu Tang,Lin Liu,Ziguang Cheng
机构: JD.com(京东); Tsinghua University(清华大学)
类目: Information Retrieval (cs.IR)
备注: Accepted for publication at ICDE 2026
点击查看摘要
Abstract:For e-commerce search, user experience is measured by users’ behavioral responses to returned products, like click-through rate and conversion rate, as well as the relevance between returned products and search queries. Consequently, relevance and user conversion constitute the two primary objectives in query rewriting, a strategy to bridge the lexical gap between user expressions and product descriptions. This research proposes a multi-task and multi-stage query rewriting framework grounded in large language models (LLMs). Critically, in contrast to previous works that primarily emphasized rewritten query generation, we inject the relevance task into query rewriting. Specifically, leveraging a pretrained model on user data and product information from this http URL, the approach initiates with multi-task supervised fine-tuning (SFT) comprising of the rewritten query generation task and the relevance tagging task between queries and rewrites. Subsequently, we employ Group Relative Policy Optimization (GRPO) for the model’s objective alignment oriented toward enhancing the relevance and stimulating user conversions. Through offline evaluation and online A/B test, our framework illustrates substantial improvements in the effectiveness of e-commerce query rewriting, resulting in elevating the search results’ relevance and boosting the number of purchases made per user (UCVR). Since August 2025, our approach has been implemented on this http URL, one of China’s leading online shopping platforms.
[IR-12] Agent ic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling
【速读】:该论文旨在解决多源混合模态虚假信息检测(Multi-Source Multi-Modal Misinformation Detection, M3D)任务中,单一视觉语言模型(Vision-Language Model, VLM)能力不足的问题,尤其是在零样本场景下,现有方法难以有效识别源自文本、图像或两者不一致的复杂虚假信息。解决方案的关键在于提出一个名为AgentM3D的多代理框架,其核心创新包括:引入自适应测试时扩展(adaptive test-time scaling)范式,使每个模态特异性VLM代理采用Best-of-N机制并结合任务对齐评分的评判代理(critic agent),从而增强推理鲁棒性;通过级联的模态特定决策链减少冗余计算并控制错误传播;同时,规划代理动态确定推理路径数量,自适应停止机制避免单个代理过度推理,显著提升了VLM在复杂M3D任务中的推理能力和可扩展性。
链接: https://arxiv.org/abs/2603.02519
作者: Wei Jiang,Tong Chen,Wei Yuan,Quoc Viet Hung Nguyen,Hongzhi Yin
机构: The University of Queensland(昆士兰大学); Griffith University(格里菲斯大学)
类目: Multimedia (cs.MM); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM’s capacity falls short in the more complex mixed-source multi-modal misinformation detection (M3D) task. Taking captioned images as an example, in M3D, false information can originate from untruthful texts, forged images, or mismatches between the two modalities. Although recent agentic systems can handle zero-shot M3D by connecting modality-specific VLM agents, their effectiveness is still bottlenecked by their architecture. In existing agentic M3D solutions, for any input sample, each agent performs only one forward reasoning pass, making decisions prone to model randomness and reasoning errors in challenging cases. Moreover, the lack of exploration over alternative reasoning paths prevents modern VLMs from fully utilizing their reasoning capacity. In this work, we present AgentM3D, a multi-agent framework for zero-shot M3D. To amplify the reasoning capability of VLMs, we introduce an adaptive test-time scaling paradigm in which each modality-specific VLM agent applies a Best-of-N mechanism, coupled with a critic agent for task-aligned scoring. The agents are organized in a cascading, modality-specific decision chain to reduce unnecessary computation and limit error propagation. To ensure scalability, a planning agent dynamically determines the maximum number of reasoning paths based on sample difficulty, and an adaptive stopping mechanism prevents excessive reasoning within each agent. Extensive experiments on two M3D benchmarks demonstrate that AgentM3D achieves state-of-the-art zero-shot detection performance compared with various VLM-based and agentic baselines.
[IR-13] HELIOS: Harmonizing Early Fusion Late Fusion and LLM Reasoning for Multi-Granular Table-Text Retrieval ACL2025
【速读】:该论文旨在解决表格-文本检索(table-text retrieval)任务中现有方法在早期融合(early fusion)和晚期融合(late fusion)策略中存在的局限性,包括早期融合因预对齐形成“星型结构”(star)而引入无关上下文、忽略查询相关关系,以及晚期融合虽能动态对齐但易遗漏关键信息的问题;同时,两类方法均难以支持高级推理任务(如列级聚合与多跳推理)。其解决方案的关键在于提出HELIOS框架:首先通过基于边的二分子图检索识别表格片段与文本段落间的细粒度关联边,避免冗余上下文;其次通过查询相关的节点扩展机制动态挖掘最具潜力的节点及其关联边,构建更具针对性的子图以减少漏检风险;最后利用星型结构进行大语言模型(LLM)层面的逻辑推理,从而有效支撑复杂推理任务。
链接: https://arxiv.org/abs/2603.02248
作者: Sungho Park,Joohyung Yun,Jongwuk Lee,Wook-Shin Han
机构: POSTECH, Republic of Korea; Sungkyunkwan University, Republic of Korea
类目: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 9 pages, 6 figures. Accepted at ACL 2025 main. Project page: this https URL
点击查看摘要
Abstract:Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming “stars,” which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose HELIOS, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star graph level rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that HELIOS outperforms state-of-the-art models with a significant improvement up to 42.6% and 39.9% in recall and nDCG, respectively, on the OTT-QA benchmark.
人机交互
[HC-0] MIBURI: Towards Expressive Interactive Gesture Synthesis CVPR2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的对话代理缺乏具身性(embodiment)以及自然、多样化的共言语手势(co-speech gestures)的问题。现有方法或生成僵硬且低多样性动作,或依赖未来语音上下文导致无法实时运行。其解决方案的关键在于提出MIBURI——首个在线、因果性的框架,用于生成与实时口语对话同步的全身姿态和面部表情。该框架采用体部感知的手势编码器(body-part aware gesture codecs),将多层次运动细节编码为离散token,并通过二维因果机制自回归生成这些token,条件输入为LLM提取的语音-文本嵌入;同时引入辅助目标以增强动作的表现力和多样性,避免收敛至静态姿态,从而实现高效、自然且语境一致的具身交互。
链接: https://arxiv.org/abs/2603.03282
作者: M. Hamza Mughal,Rishabh Dabral,Vera Demberg,Christian Theobalt
机构: Max Planck Institute for Informatics (马普所信息学研究所); Saarland University (萨尔兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: CVPR 2026. Project page: this https URL
点击查看摘要
Abstract:Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on this https URL.
[HC-1] Deception by Design: A Temporal Dark Patterns Audit of McDonalds Self-Ordering Kiosk Flow
【速读】:该论文试图解决的问题是:自服务点餐机(Self-ordering kiosks, SOKs)在提升效率和订单价值的同时,可能通过“暗模式”(dark patterns)对用户进行操纵性设计,从而影响消费者的决策过程。解决方案的关键在于采用时间分析暗模式框架(Temporal Analysis of Dark Patterns, TADP),对麦当劳德国门店的SOK进行结构化审计,通过模拟时间压力下的用户场景,识别出跨页面、系统层级中累积出现的高阶策略(如增加步骤、虚假层级、默认设置偏差等)与低阶模式(如视觉突出、确认羞辱、稀缺性 framing、前馈模糊、情感感官操控及分拆定价)。研究揭示了这些暗模式如何在直线式任务流程和物理环境约束下被放大,强调混合型物理-数字消费界面需纳入未来监管讨论范畴。
链接: https://arxiv.org/abs/2603.03218
作者: Aditya Kumar Purohit,Yuwei Liu,Manon Berney,Hendrik Heuer,Adrian Holzer
机构: Center for Advanced Internet Studies (CAIS)(高级互联网研究中心); University of Neuchâtel (纳沙泰尔大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted at the CHI’26 workshop on Bridge Over Troubled Water, April 16, 2026, Barcelona, Spain
点击查看摘要
Abstract:Self-ordering kiosks (SOKs) are widely deployed in fast food restaurants, transforming food ordering into digitally mediated, self-navigated interactions. While these systems enhance efficiency and average order value, they also create opportunities for manipulative interface design practices known as dark patterns. This paper presents a structured audit of the McDonald’s self-ordering kiosk in Germany using the Temporal Analysis of Dark Patterns (TADP) framework. Through a scenario-based walkthrough simulating a time-pressured user, we reconstructed and analyzed 12 interface steps across intra-page, inter-page, and system levels. We identify recurring high-level strategies implemented through meso-level patterns such as adding steps, false hierarchy, bad defaults, hiding information, and pressured selling, and low-level patterns including visual prominence, confirmshaming, scarcity framing, feedforward ambiguity, emotional sensory manipulation, and partitioned pricing. Our findings demonstrate how these patterns accumulate across the interaction flow and may be amplified by the kiosk’s linear task structure and physical context. These findings suggest that hybrid physical–digital consumer interfaces warrant closer scrutiny within emerging regulatory discussions on dark patterns.
[HC-2] How to Model AI Agents as Personas?: Applying the Persona Ecosystem Playground to 41300 Posts on Moltbook for Behavioral Insights
【速读】:该论文旨在解决当前对AI代理(AI agents)在社交媒体平台上的行为多样性理解不足的问题,尤其是缺乏有效方法来识别和分析不同类型的AI代理及其在共享话题中的互动模式。解决方案的关键在于引入Persona Ecosystem Playground(PEP)框架,在Moltbook平台上基于41,300条帖子数据,利用k-means聚类与检索增强生成(retrieval-augmented generation, RAG)技术构建并验证对话人格(conversational personas)。通过跨人格验证发现,每个个体人格与其所属簇的语义相似度显著高于与其他簇的相似度(t(61) = 17.85, p < .001, d = 2.20),且在九轮结构化讨论中,模拟消息能被准确归因于其来源人格(二项检验 p < .001),表明基于人格的生态系统建模可有效刻画AI代理群体的行为多样性。
链接: https://arxiv.org/abs/2603.03140
作者: Danial Amin,Joni Salminen,Bernard J. Jansen
机构: University of Vaasa (瓦萨大学); Qatar Computing Research Institute, HBKU (卡塔尔计算研究研究所,HBKU)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:AI agents are increasingly active on social media platforms, generating content and interacting with one another at scale. Yet the behavioral diversity of these agents remains poorly understood, and methods for characterizing distinct agent types and studying how they engage with shared topics are largely absent from current research. We apply the Persona Ecosystem Playground (PEP) to Moltbook, a social platform for AI agents, to generate and validate conversational personas from 41,300 posts using k-means clustering and retrieval-augmented generation. Cross-persona validation confirms that personas are semantically closer to their own source cluster than to others (t(61) = 17.85, p .001, d = 2.20; own-cluster M = 0.71 vs. other-cluster M = 0.35). These personas are then deployed in a nine-turn structured discussion, and simulation messages were attributed to their source persona significantly above chance (binomial test, p .001). The results indicate that persona-based ecosystem modeling can represent behavioral diversity in AI agent populations.
[HC-3] Surveillance Spacing Screaming and Scabbing: How Digital Technology Facilitates Union Busting
【速读】:该论文试图解决的问题是:尽管美国员工对工会的认同度高且组织意愿增强,但他们在争取集体谈判协议时仍面临显著障碍,其中关键因素之一是雇主采取的反工会行动(counter-organizing),包括规则变更、报复和干扰行为。为有效应对这些策略,需设计社会技术工具与方法,而其前提是深入理解计算技术在反工会行动中的作用机制。论文的关键解决方案在于通过分析亚马逊(Amazon)、星巴克(Starbucks)和大学(university)三起典型案例,识别出四种反复出现的技术手段:监控(surveillance)、空间隔离(spacing)、情绪化信息传播(screaming)和替代劳动力部署(scabbing),并揭示这些手段在不同场景下的数字维度与战略运用方式,从而为数字化工作场所中的劳工组织提供理论依据与实践方向。
链接: https://arxiv.org/abs/2603.03130
作者: Frederick Reiber,Nathan Kim,Allison McDonald,Dana Calacci
机构: Boston University (波士顿大学); University of Michigan (密歇根大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Human-Computer Interaction (cs.HC)
备注: To appear in CHI 2026
点击查看摘要
Abstract:Despite high approval ratings for unions and growing worker interest in organizing, employees in the United States still face significant barriers to securing collective bargaining agreements. A key factor is employer counter-organizing: efforts to suppress unionization through rule changes, retaliation, and disruption. Designing sociotechnical tools and strategies to resist these tactics requires a deeper understanding of the role computing technologies play in counter-organizing against unionization. In this paper, we examine three high-profile organizing efforts – at Amazon, Starbucks, and \university – using publicly available sources to identify four recurring technological tactics: surveillance, spacing, screaming and scabbing. We analyze how these tactics operate across contexts, highlighting their digital dimensions and strategic deployment. We conclude with implications for organizing in digitally-mediated workplaces, directions for future research, and emergent forms of worker resistance.
[HC-4] Design Generative AI for Practitioners: Exploring Interaction Approaches Aligned with Creative Practice
【速读】:该论文旨在解决生成式 AI(Generative AI)在专业设计实践中因交互方式单一而导致的意图对齐问题,即传统基于提示词或整图操作的交互模式难以匹配设计师的非线性、反思性创作流程,迫使视觉思维者转向语言推理或事后修正。解决方案的关键在于提出三种分布式的交互范式(来自DesignPrompt、FusAIn和DesignTrace),通过在意图(intent)、输入(input)和过程(process)三个维度上分配控制权,使设计师能够在不同创作阶段动态引导AI输出,实现意图与结果之间的持续对齐。研究进一步指出,这种对齐本质上是设计师与AI之间的一种动态协商过程,其中AI根据设计者的工具性需求与灵感需求,在主动或被动角色间灵活切换。
链接: https://arxiv.org/abs/2603.03074
作者: Xiaohan Peng,Wendy E. Mackay,Janin Koch
机构: Inria; Université Paris-Saclay, CNRS; Univ. Lille, Inria, CNRS, Centrale Lille
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to ACM CHI 2026 Workshop on Bidirectional Human-AI Alignment
点击查看摘要
Abstract:Design is a non-linear, reflective process in which practitioners engage with visual, semantic, and other expressive materials to explore, iterate, and refine ideas. As Generative AI (GenAI) becomes integrated into professional design practice, traditional interaction approaches focusing on prompts or whole-image manipulation can misalign AI output with designers’ intent, forcing visual thinkers into verbal reasoning or post-hoc adjustments. We present three interaction approaches from DesignPrompt, FusAIn, and DesignTrace that distribute control across intent, input, and process, enabling designers to guide AI alignment at different stages of interaction. We further argue that alignment is a dynamic negotiation, with AI adopting proactive or reactive roles according to designers’ instrumental and inspirational needs and the creative stage.
[HC-5] An HCI Perspective on Sustainable GenAI Integration in Architectural Design Education
【速读】:该论文试图解决的问题是:在建筑教育中引入生成式 AI(Generative AI)时,存在一个悖论——对生成式 AI 的批判性理解往往需要更深入地使用这些工具,但教学环境中却缺乏有效方法来估算其环境成本。解决方案的关键在于借助人机交互(HCI)的方法论视角,提出三个方向以实现更可持续的整合:一是提供情境化的生态反馈(contextual eco-feedback),帮助学习者直观感知工具使用的环境影响;二是通过参与式利益相关者范围界定(participatory stakeholder scoping)明确多方需求与责任;三是将数据中心重新定义为跨学科关注焦点(reframing data centres as an interdisciplinary focus),推动从技术、社会和环境多维度理解生成式 AI 的影响。由此,论文主张将生成式 AI 视为一种社会技术过程,而不仅是设计工具,要求建筑教育乃至整个设计教育领域对其进行批判性介入。
链接: https://arxiv.org/abs/2603.03059
作者: Alex Binh Vinh Duc Nguyen
机构: University of Antwerp (安特卫普大学); KU Leuven (鲁汶大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Generative AI (genAI) is increasingly influencing architectural design practice and is expected to affect, or even transform, the profession, even though its benefits and costs remain unresolved. In response, design schools are increasingly integrating genAI into their curricula. Yet this integration creates a paradox: critical engagement with genAI often requires increased use of the tools in question, despite limited methods for estimating their environmental cost in teaching contexts. In this paper, we argue that HCI offers a useful methodological lens for addressing this tension. We propose three HCI-informed directions for more sustainable genAI integration in architectural education: contextual eco-feedback, participatory stakeholder scoping, and reframing data centres as an interdisciplinary focus. We therefore argue that genAI should be understood not only as a new architectural design tool, but also as a socio-technical process that architectural education, and design education in general, must engage with critically.
[HC-6] Architectural HRI: Towards a Robotic Paradigm Shift in Human-Building Interaction
【速读】:该论文试图解决的问题是:当前人类-建筑交互(Human-Building Interaction, HBI)研究仍局限于自适应建筑服务与立面控制,难以实现建筑空间的物理层面协同演化,且缺乏对多层建筑系统在时间、空间和社会维度上的整合性理解,导致其在支持使用者需求与可持续发展目标方面存在局限。解决方案的关键在于提出一种范式转变——通过机器人家具、群体机器人和可变形空间等技术,使建筑各层级(如墙体、地板、家具)能够同步物理响应并动态重构空间,从而更全面地满足人与环境的交互需求;同时,该方案强调跨学科整合(包括人机交互、环境心理学、认知科学与建筑学),以系统性地统一机器人驱动建筑形态的“为什么”(why)、“是什么”(what)与“如何做”(how)。
链接: https://arxiv.org/abs/2603.03052
作者: Alex Binh Vinh Duc Nguyen
机构: University of Antwerp (安特卫普大学); KU Leuven (鲁汶大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Recent advances in sensing, communication, interfaces, control, and robotics are expanding Human-Building Interaction (HBI) beyond adaptive building services and facades toward the physical actuation of architectural space. In parallel, research in robotic furniture, swarm robotics, and shape-changing spaces shows that architectural elements can now be robotically augmented to move, reconfigure, and adapt space. We propose that these advances promise a paradigm shift in HBI, in which multiple building layers physically adapt in synchrony to support occupant needs and sustainability goals more holistically. Conversely, we argue that this emerging paradigm also provides an ideal case for transferring HRI knowledge to unconventional robotic morphologies, including the interpretation of the robot as multiple architectural layers or even as a building. However, this research agenda remains challenged by the temporal, spatial, and social complexity of architectural HRI, and by fragmented knowledge across HCI, environmental psychology, cognitive science, and architecture. We therefore call for interdisciplinary research that unifies the why, what, and how of robotic actuation in architectural forms.
[HC-7] Changing Pedagogical Paradigms: Integrating Generative AI in Mathematics to Enhance Digital Literacy through Mathematical Battles with AI CCS2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在数学教育中可能引发的过度依赖问题,即学生因盲目信任AI输出而削弱批判性思维与验证能力。其解决方案的关键在于设计一种名为“Math Battles with AI”的竞赛机制,通过引入故意提高幻觉概率的AI代理(AI agent),在特定模式下促使学生主动验证结果,从而培养数字卫生(digital hygiene)和提示工程(prompt engineering)等核心技能。该方案依托三阶段锦标赛结构与专门的评估系统,强调对批判性验证行为的奖励,而非单纯依赖AI答案,有效重塑了AI在教学中的角色定位。
链接: https://arxiv.org/abs/2603.02955
作者: Maria Moskalenko,Alexander Trifanov,Roman Popkov,Arina Tabieva,Maria Smirnova,Konstantin Pravdin,Daniil Bakalin
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Submitted for the conference ICCS 2026
点击查看摘要
Abstract:This paper introduces `Math Battles with AI’, an innovative competitive format designed at ITMO University to redefine the role of generative AI in mathematics education. Moving away from a purely defensive stance, the authors propose an AI agent with intentionally increased hallucination likelihood in specific modes to train verification skills. We describe the three-stage tournament structure and a specialized assessment system that rewards critical verification over blind reliance. Initial results indicate a significant shift in student mindsets, fostering essential skills in digital hygiene and prompt engineering. This work serves as a practical guide for academic institutions aiming to leverage AI for enhancing, rather than undermining, intellectual development.
[HC-8] Speech recognition assisted by large language models to command software orally – Application to an augmented and virtual reality web app for immersive molecular graphics
【速读】:该论文旨在解决在沉浸式分子图形应用中,用户因使用手势操作分子而无法同时通过传统鼠标和键盘进行界面控制的问题。解决方案的关键在于构建一个基于语音的语音用户界面(Voice User Interface, VUI),其核心由两部分组成:一是采用Chrome原生Speech API实现稳定、快速且可靠的自动语音识别(Automated Speech Recognition, ASR),以克服OpenAI Whisper v3在科学术语上“幻觉”问题;二是利用大语言模型(Large Language Model, LLM)驱动的函数调用机制(基于GPT-4o-mini),将自然语言指令转化为预定义函数调用,而非生成复杂代码,从而提升安全性、效率与可靠性。最终系统实现了通过自然语言指令高效控制AR/VR环境下的分子操作功能。
链接: https://arxiv.org/abs/2603.02901
作者: Fabio Cortes Rodriguez,Luciano Abriata
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:This project successfully developed, evaluated and integrated a Voice User Interface (VUI) into a web application that we are developing for immersive molecular graphics. Said app provides augmented and virtual reality (AR and VR) environments where users manipulate molecules with their hands, but this means the hands can’t be used to control the app through a regular mouse- and keyboard-based GUI. The speech-based VUI system developed here alleviates this problem, making it easy to control the app via natural spoken (or typed) commands. To achieve this VUI we evaluated two distinct Automated Speech Recognition (ASR) systems: Chrome’s native Speech API and OpenAI’s Whisper v3. While Whisper offered broader browser compatibility, its tendency to “hallucinate” with specialized scientific jargon proved very problematic. Consequently, we selected Chrome’s ASR for its stability, speed, and reliability. For translating transcribed speech into software commands, we tested two Large Language Model (LLM)-driven approaches: either generating executable code, or calling predefined functions. The function call method, powered by OpenAI’s GPT-4o-mini, was ultimately adopted due to its superior safety, efficiency, and reliability over the more complex and error-prone code-generation approach. The resulting VUI is then based on an integration of Chrome’s ASR with our LLM-based function-calling module, enabling users to command the application using natural language as shown in a video linked inside this report. We provide links to live examples demonstrating all the intermediate components, and details on how we crafted the LLM’s prompt in order to teach it the function calls as well as ways to clean up the transcribed speech and to explain itself while generating function calls. For best demonstration of the final system, we provide a video example.
[HC-9] “Its Messy…But I Feel Balanced”: Unpacking Flexible Workers Rhythm-Making Practices Using an Asset-Based Approach
【速读】:该论文旨在解决当前人机交互(Human-Computer Interaction, HCI)研究中对灵活工作(flexible work)的理解过于侧重生产效率和组织视角,而忽视其与劳动者个人及家庭责任之间复杂互动的问题。解决方案的关键在于引入“资产导向视角”(asset-based lens),揭示灵活工作并非静态福利,而是一种持续的“节奏建构”(rhythm-making)实践:参与者通过调动时间与空间资源、在关系与制度动态中协商,并借助自我关怀与积极重构等内在资源来维持工作与照护责任之间的平衡。研究将模糊边界重新定义为可利用的资源而非干扰,为支持灵活工作者日常节奏管理的技术设计提供了新思路。
链接: https://arxiv.org/abs/2603.02841
作者: Tse Pei Ng,Daniel Campos-Muniz,Yiyang He,Ker Wey Aw,Jung-Joo Lee,Janghee Cho
机构: National University of Singapore(新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Flexible work is increasingly pursued as a means of achieving work-life balance, particularly as growing caregiving responsibilities for children and aging family members shape workers’ lives. Yet most HCI research has examined flexibility primarily through productivity and organizational perspectives, with less attention to how it intersects with workers’ personal and family responsibilities. To address this gap, we conducted a qualitative study with 20 workers in Singapore engaging in flexible arrangements to manage paid work and care responsibilities. Using an asset-based lens, we show that flexibility is not a static benefit but a continual practice of rhythm-making. Participants maintained rhythms by drawing on temporal and spatial assets, negotiated them through relational and institutional dynamics, and sustained them through intrapersonal assets such as self-care and positive reframing. Our study reframes blurred boundaries as resources rather than disruptions and offers design implications for technologies that support flexible workers’ everyday rhythm-making practices.
[HC-10] Causal Learning Should Embrace the Wisdom of the Crowd
【速读】:该论文旨在解决从观测数据中学习因果结构(causal structures)这一长期挑战性问题,尤其是在图结构组合爆炸和观测数据固有模糊性背景下,如何有效恢复全局因果图(DAG)。其解决方案的关键在于提出一种融合人类因果知识与人工智能能力的新范式:将因果发现建模为分布式决策任务,利用可扩展的众包平台收集数据、交互式知识提取技术建模专家意见、稳健聚合方法整合不一致的专家判断,并借助大语言模型(Large Language Model, LLM)进行模拟以增强信息获取。通过系统性整合人类专家与LLM代理的碎片化、不完美知识,该框架实现了单个个体无法达成的全局因果结构恢复,从而推动因果学习迈向人机协同的新研究前沿。
链接: https://arxiv.org/abs/2603.02678
作者: Ryan Feng Lin,Yuantao Wei,Huiling Liao,Xiaoning Qian,Shuai Huang
机构: 未知
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Learning causal structures typically represented by directed acyclic graphs (DAGs) from observational data is notoriously challenging due to the combinatorial explosion of possible graphs and inherent ambiguities in observations. This paper argues that causal learning is now ready for the emergence of a new paradigm supported by rapidly advancing technologies, fulfilling the long-standing vision of leveraging human causal knowledge. This paradigm integrates scalable crowdsourcing platforms for data collection, interactive knowledge elicitation for expert opinion modeling, robust aggregation techniques for expert reconciliation, and large language model (LLM)-based simulation for augmenting AI-driven information acquisition. In this paper, we focus on DAG learning for causal discovery and frame the problem as a distributed decision-making task, recognizing that each participant (human expert or LLM agent) possesses fragmented and imperfect knowledge about different subsets of the variables of interest in the causal graph. By proposing a systematic framework to synthesize these insights, we aim to enable the recovery of a global causal structure unachievable by any individual agent this http URL advocate for a new research frontier and outline a comprehensive framework for new research thrusts that range from eliciting, modeling, aggregating, and optimizing human causal knowledge contributions.
[HC-11] Behavior-Aware Anthropometric Scene Generation for Human-Usable 3D Layouts
【速读】:该论文旨在解决现有3D场景生成方法过度关注视觉和语义合理性,而忽视人体行为可操作性的问题,即场景是否支持用户舒适地行走、就坐或操作物体。其解决方案的关键在于提出一种行为感知的人体尺度场景生成框架(Behavior-Aware Anthropometric Scene Generation),通过视觉语言模型(Vision-Language Models, VLMs)解析物体与行为之间的关系,将空间需求转化为适配用户特定人体测量数据的参数化布局约束,从而实现更符合人类实际使用习惯的室内场景生成。
链接: https://arxiv.org/abs/2603.02662
作者: Semin Jin,Donghyuk Kim,Jeongmin Ryu,Kyung Hoon Hyun
机构: Hanyang University (汉阳大学); Design Informatics Lab (设计信息学实验室); Human-Centered AI Design Institute (以人为中心的人工智能设计研究所)
类目: Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Well-designed indoor scenes should prioritize how people can act within a space rather than merely what objects to place. However, existing 3D scene generation methods emphasize visual and semantic plausibility, while insufficiently addressing whether people can comfortably walk, sit, or manipulate objects. To bridge this gap, we present a Behavior-Aware Anthropometric Scene Generation framework. Our approach leverages vision-language models (VLMs) to analyze object-behavior relationships, translating spatial requirements into parametric layout constraints adapted to user-specific anthropometric data. We conducted comparative studies with state-of-the-art models using geometric metrics and a user perception study (N=16). We further conducted in-depth human-scale studies (individuals, N=20; groups, N=18). The results showed improvements in task completion time, trajectory efficiency, and human-object manipulation space. This study contributes a framework that bridges VLM-based interaction reasoning with anthropometric constraints, validated through both technical metrics and real-scale human usability studies.
[HC-12] How Controllable Are Large Language Models ? A Unified Evaluation across Behavioral Granularities
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会敏感场景中部署时因行为不可预测而带来的风险问题,包括意图错位和个性不一致等。其解决方案的关键在于提出SteerEval——一个分层基准测试框架,用于系统评估LLM在语言特征、情感和人格三个维度上的可控性,每个维度包含三个层级(L1:表达什么;L2:如何表达;L3:如何实例化),从而将高层行为意图与具体文本输出相连接,实现对模型行为的可解释且结构化的控制评估。
链接: https://arxiv.org/abs/2603.02578
作者: Ziwen Xu,Kewei Xu,Haoming Xu,Haiwen Hong,Longtao Huang,Hui Xue,Ningyu Zhang,Yongliang Shen,Guozhou Zheng,Huajun Chen,Shumin Deng
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); National University of Singapore, NUS-NCS Joint Lab, Singapore (新加坡国立大学,NUS-NCS联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Work in progress
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
[HC-13] An LLM -Assisted Toolkit for Inspectable Multimodal Emotion Data Annotation
【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中细粒度、基于证据的标注难以规模化的问题,尤其是在跨模态线索动态变化且存在时间错位的情况下。其解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)辅助的可解释、事件中心化的标注工具包:首先对异构多模态数据进行预处理与对齐,并在交互式共享时间轴上可视化所有模态信号;随后检测候选事件并打包同步的关键帧与时间窗口为带有源数据指针的事件包;最终通过集成针对不同模态的工具和提示模板,由LLM生成结构化标注供分析人员验证与编辑,从而实现高效、一致且可追溯的多模态情感标注流程。
链接: https://arxiv.org/abs/2603.02569
作者: Zheyuan Kuang,Weiwei Jiang,Nicholas Koemel,Matthew Ahmadi,Emmanuel Stamatakis,Benjamin Tag,Anusha Withana,Zhanna Sarsenbayeva
机构: The University of Sydney(悉尼大学); Nanjing University of Information Science and Technology(南京信息工程大学); Charles Perkins Centre, Faculty of Medicine and Health(查尔斯·珀金斯中心,医学院与健康学院); Turner Institute for Brain and Mental Health(特纳脑与心理健康研究所); University of New South Wales(新南威尔士大学)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 1 figure
点击查看摘要
Abstract:Multimodal Emotion Recognition (MER) increasingly depends on fine grained, evidence grounded annotations, yet inspection and label construction are hard to scale when cues are dynamic and misaligned across modalities. We present an LLM-assisted toolkit that supports multimodal emotion data annotation through an inspectable, event centered workflow. The toolkit preprocesses and aligns heterogeneous recordings, visualizes all modalities on an interactive shared timeline, and renders structured signals as video tracks for cross modal consistency checks. It then detects candidate events and packages synchronized keyframes and time windows as event packets with traceable pointers to the source data. Finally, the toolkit integrates an LLM with modality specific tools and prompt templates to draft structured annotations for analyst verification and editing. We demonstrate the workflow on multimodal VR emotion recordings with representative examples.
[HC-14] Show It Dont Just Say It: The Complementary Effects of Instruction Multimodality for Software Guidance
【速读】:该论文旨在解决自适应软件学习辅导系统中如何选择合适教学模态(instructional modalities)的问题,以提升学习效果并维持学习者的自主性。其关键解决方案在于通过观察十对师生在图形设计软件教学中的互动行为,发现注释(annotations)能以空间精度补充语音讲解,远程屏幕控制(remote screen control)则进一步提供时空双重精度,但二者均可能侵犯学习者的学习自主权(learner agency)。研究揭示了教师会动态调整教学模态,在保证教学进度的同时平衡学生的认知投入与数字领地感(digital territoriality),从而提出“精度-自主权权衡”(precision-agency trade-off)和“数字领土性”作为自适应软件指导系统设计的新约束条件。
链接: https://arxiv.org/abs/2603.02567
作者: Emran Poh,Yueyue Hou,Tianyi Zhang,Jiannan Li
机构: Singapore Management University (新加坡管理大学)
类目: Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2026 ACM Conference on Human Factors in Computing Systems (CHI '26)
点击查看摘要
Abstract:Designing adaptive tutoring systems for software learning presents challenges in determining appropriate instructional modalities. To inform the design of such systems, we conducted an observational study of ten human teacher-student pairs (N=10), where experienced design software users taught novices two new graphic design software features through multi-step procedures. These lessons were limited to three communication channels (speech, visual annotations, and remote screen control) to mimic possible AI tutor modalities. We found that annotations complement speech with spatial precision and remote control complements it with spatial and temporal precision, but both cause intrusion to learner agency. Teachers adaptively select modalities to balance the need for instruction progress with students’ cognitive engagement and sense of digital territory ownership. Our results provide further support to the contiguity principles and the value of agency in learning, while suggesting precision-agency trade-off and digital territoriality as new design constraints for adaptive software guidance.
[HC-15] Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery ICRA
【速读】:该论文旨在解决手术中器械传递任务对无菌环境干扰大、人工操作易疲劳且效率低的问题,尤其针对现有机器人 scrub nurse 依赖预设路径导致泛化能力差和动态环境中安全性不足的局限。其解决方案的关键在于提出一种具备双臂协同能力的自主手术辅助机器人系统,结合视觉-语言模型(Vision-Language Model)实现零样本(zero-shot)下基于外科医生指令的抓取与递送轨迹生成,并引入实时障碍物最小距离感知方法,集成至统一的二次规划(Quadratic Programming, QP)框架中,从而在动态环境中实现反应式避障与自碰撞防护,保障运动平滑性和安全性。实验表明该系统在器械递送任务中达到83.33%的成功率,且所有测试均无碰撞发生。
链接: https://arxiv.org/abs/2603.02553
作者: Xuejin Luo,Shiquan Sun,Runshi Zhang,Ruizhi Zhang,Junchen Wang
机构: Beihang University (北京航空航天大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 8 pages, 10 figures. Accepted by IEEE International Conference on Robotics and Automation (ICRA), 2026
点击查看摘要
Abstract:During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot’s grasping and delivery trajectories in a zero-shot manner based on surgeons’ instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot’s autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at this https URL.
[HC-16] Orality: A Semantic Canvas for Externalizing and Clarifying Thoughts with Speech
【速读】:该论文试图解决的问题是:语音转文字(Speech-to-text)生成的文本转录往往因不流畅、重复和缺乏组织而难以阅读与理解,从而限制了用户通过口头表达来梳理和深化思维的效果。解决方案的关键在于提出 Orality 系统,该系统通过大语言模型(LLM)对口语内容进行语义分析,提取关键信息并生成可交互的节点-链接图(node-link diagram),使用户能够以图形化方式操作语义簇,并通过自然语言指令重新组织内容;同时提供AI生成的启发式问题和逻辑冲突检测功能,从而显著提升用户在思考过程中的清晰度与深度。
链接: https://arxiv.org/abs/2603.02544
作者: Wengxi Li,Jingze Tian,Can Liu
机构: City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:People speak aloud to externalize thoughts as one way to help clarify and organize them. Although Speech-to-text can capture these thoughts, transcripts can be difficult to read and make sense due to disfluencies, repetitions and potential disorganization. To support thinking through verbalization, we introduce Orality, which extracts key information from spoken content, performs semantic analysis through LLMs to form a node-link diagram in an interactive canvas. Instead of reading and working with transcripts, users could manipulate clusters of nodes and give verbal instructions to re-extract and organize the content in other ways. It also provides AI-generated inspirational questions and detection of logical conflicts. We conducted a lab study with twelve participants comparing Orality against speech interaction with ChatGPT. We found that Orality can better support users in clarifying and developing their thoughts. The findings also identified the affordances of both graphical and conversational thought clarification tools and derived design implications.
[HC-17] Understanding the Effects of Interaction on Emotional Experiences in VR
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)情绪诱发研究中长期存在的问题,即现有研究多聚焦于情绪强度的测量,而忽视了交互行为对情绪体验的影响机制。为填补这一空白,作者通过两个关键扩展改进了一个已验证的VR情绪诱发数据集:首先新增一个高唤醒度、高愉悦度的情绪场景并经由被试内实验(N=24)验证其有效性;其次在每个场景中引入交互元素,构建交互与非交互版本,以系统考察交互对情绪反应的作用。解决方案的核心在于采用多模态评估方法,融合主观评分与生理信号,从而同时捕捉意识层面和无意识层面的情绪响应。实证研究表明,交互不仅增强情绪强度,还能根据情境调节情绪体验——在负面场景中促进应对,在正面场景中提升愉悦感,凸显了场景定制化交互在情绪调控中的重要价值。
链接: https://arxiv.org/abs/2603.02535
作者: Zheyuan Kuang,Tinghui Li,Weiwei Jiang,Sven Mayer,Flora Salim,Benjamin Tag,Anusha Withana,Zhanna Sarsenbayeva
机构: The University of Sydney(悉尼大学); Nanjing University of Information Science and Technology(南京信息工程大学); TU Dortmund University(多特蒙德工业大学); University of New South Wales(新南威尔士大学); School of Computer Science and Engineering(计算机科学与工程学院); School of Computer Science(计算机科学学院)
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 17 figures, to be published in the Proceedings of the 2026 ACM CHI Conference on Human Factors in Computing Systems
点击查看摘要
Abstract:Virtual reality has been effectively used for eliciting emotions, yet most research focuses on the intensity of affective responses rather than on how interaction influences those experiences. To address this gap, we advance a validated VR emotion-elicitation dataset through two key extensions. First, we add a new high-arousal, high-valence scene and validate its effectiveness in a within-subject study (N=24). Second, we incorporate interactive elements into each scene, creating both interactive and non-interactive versions to examine the impact of interaction on emotional responses. We evaluate interaction through a multimodal approach combining subjective ratings and physiological signals to capture both conscious and unconscious affective responses. Our evaluation study (N=84) shows that interaction not only amplifies emotions but modulates them in context, supporting coping in negative scenes and enhancing enjoyment in positive scenes. These findings highlight the potential of scene-tailored interaction for different applications, where regulating emotions is as important as eliciting them.
[HC-18] Probing More-Than-Human Representation in Crisis Resilience Planning : An HCI Researcher Perspective
【速读】:该论文试图解决危机韧性规划中如何将非人类物种和生态系统纳入参与式过程的问题,而当前的参与式实践仍以人类为中心。解决方案的关键在于通过人机交互(Human-Computer Interaction, HCI)领域的设计探针——包括基于语音的对话代理和沉浸式具身原型——来激发对非人类视角呈现方式的持续讨论。研究发现,赋予非人类“声音”并非中立的翻译行为,而是一个涉及合法性、权威性和真实性之间张力的设计挑战,从而揭示了在危机韧性规划中,AI与沉浸技术中介下的多物种表征具有重要的实践意义与伦理维度。
链接: https://arxiv.org/abs/2603.02514
作者: Tram Thi Minh Tran,Adrian Wong,Callum Parker,Carlos Alfredo Tirado Cortes,Marius Hoggenmueller,Soojeong Yoo,Nate Zettna,Joel Fredericks
机构: The University of Sydney (悉尼大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted into CHI 2026 Posters track
点击查看摘要
Abstract:Crisis resilience planning raises urgent questions about how to include non-human species and ecological systems in participatory processes, which remain largely human-centred. This paper reports on a workshop with HCI researchers examining how more-than-human representation is approached in crisis contexts. The workshop combined scenario-based discussion with two design probes – a voice-based conversational agent and an immersive embodied prototype – to support sustained discussion of how emerging technologies shape engagement with non-human perspectives. Participants focused not on system usability, but on deliberating representational choices, such as voice, embodiment, and realism, and their potential role within participatory planning processes. The findings suggest that giving ‘voice’ to non-humans is not a neutral act of translation, but a design challenge that introduces tensions between legitimacy, authority, and authenticity. This paper provides empirical insight into how HCI researchers conceptualise more-than-human representation and positions crisis resilience planning as a critical site for examining AI- and immersion-mediated representation.
[HC-19] Improving Low-Vision Chart Accessibility via On-Cursor Visual Context
【速读】:该论文旨在解决低视力人群(Low-Vision Individuals, LVI)在使用图表时面临的可访问性问题,特别是由于视野受限或依赖放大功能导致难以理解数据点与整体上下文之间的关系。其解决方案的关键在于通过两种基于指针的交互方法提供关键的视觉上下文信息:一是提出一种新颖的“焦点+上下文”交互方式——动态上下文(Dynamic Context),用于增强对数据点的局部访问同时保持全局感知;二是引入小地图(Mini-map)机制,借鉴“概览+细节”原理来提升空间理解能力。实验结果表明,动态上下文显著改善了LVI的访问效率、可用性和操作负担,但增加了视觉负荷;而小地图虽强化了空间认知,但在该任务中用户偏好较低。研究为未来面向LVI的可视化系统设计提供了重要设计洞见,强调在提供上下文支持的同时需平衡视觉负荷。
链接: https://arxiv.org/abs/2603.02498
作者: Yotam Sechayk,Hennes Rave,Max Rädler,Mark Colley,Zhongyi Zhou,Ariel Shamir,Takeo Igarashi
机构: The University of Tokyo(东京大学); University of Münster(明斯特大学); Ulm University(乌尔姆大学); University College London(伦敦大学学院); Google(谷歌); Reichman University(里奇曼大学); The University of Tokyo(东京大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI '26, Barcelona, Spain
点击查看摘要
Abstract:Despite widespread use, charts remain largely inaccessible for Low-Vision Individuals (LVI). Reading charts requires viewing data points within a global context, which is difficult for LVI who may rely on magnification or experience a partial field of vision. We aim to improve exploration by providing visual access to critical context. To inform this, we conducted a formative study with five LVI. We identified four fundamental contextual elements common across chart types: axes, legend, grid lines, and the overview. We propose two pointer-based interaction methods to provide this context: Dynamic Context, a novel focus+context interaction, and Mini-map, which adapts overview+detail principles for LVI. In a study with N=22 LVI, we compared both methods and evaluated their integration to current tools. Our results show that Dynamic Context had significant positive impact on access, usability, and effort reduction; however, worsened visual load. Mini-map strengthened spatial understanding, but was less preferred for this task. We offer design insights to guide the development of future systems that support LVI with visual context while balancing visual load.
[HC-20] he Perceptual Gap: Why We Need Accessible XAI for Assistive Technologies
【速读】:该论文试图解决当前可解释人工智能(Explainable AI, XAI)方法在服务感官障碍人群时存在的显著不足问题,即现有XAI技术大多未考虑残障用户的需求,且其典型解释方式对视觉或听觉障碍者难以理解。解决方案的关键在于推动面向人类中心(human-centered)和无障碍(accessibility-centered)的XAI研究范式,强调从残障用户视角出发设计可访问、易理解的解释机制,并提出未来需开展以残障群体为参与主体的评估与创新,从而实现真正包容性的可解释人工智能系统。
链接: https://arxiv.org/abs/2603.02486
作者: Shadab H. Choudhury
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Human-Computer Interaction (cs.HC)
备注: CHI '26 Poster
点击查看摘要
Abstract:Artificial intelligence systems are widely used by people with sensory disabilities, like loss of vision or hearing, to help perceive or navigate the world around them. This includes tasks like describing an image or object they cannot touch, reading documents, automatically captioning speech, and so on. Presently, models used for these tasks are based on deep neural networks and are thusly black boxes. Explainable AI (XAI) describes methods that can explain why a model gave the output it did. However, existing XAI methodologies are rarely accessible or designed with disabled users in mind. In this paper, we survey existing work in XAI with a focus on human-centered and accessibility-centered approaches or evaluations. We show that there is next-to-no XAI work that accounts for people with sensory disabilities, that many typical explanations are difficult for them to comprehend, and propose possible avenues for future work in Accessible Human-Centered XAI.
[HC-21] Break the Window: Exploring Spatial Decomposition of Webpages in XR
【速读】:该论文旨在解决当前扩展现实(Extended Reality, XR)网络浏览器仍沿用桌面端单窗口布局的问题,这种设计未能充分利用沉浸式空间的特性。其解决方案的关键在于提出“破窗”(Break-the-Window, BTW)原型,将网页内容空间化分解为可移动的UI模块(panels),支持空中与表面贴附放置,并允许直接触控和光线交互,从而实现注意力分散与空间意义建构的新模式。
链接: https://arxiv.org/abs/2603.02471
作者: Chenyang Zhang,Tianjian Wei,Haoyang Yang,Mar Gonzalez-Franco,Yalong Yang,Eric J Gonzalez
机构: Georgia Institute of Technology (佐治亚理工学院); Google (谷歌)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 5 figures. Accepted as a CHI 2026 Extended Abstract (Poster). To appear in Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI '26)
点击查看摘要
Abstract:Most XR web browsers still present webpages as a single floating window, carrying over desktop design assumptions into immersive space. We explore an alternative by breaking the browser window and distributing a webpage into spatial UI chunks within a mixed-reality workspace. We present Break-the-Window (BTW), an exploratory prototype that spatially decomposes live, fully functional webpages into movable panels supporting mid-air and surface-attached placement, as well as direct touch and ray-based interaction. Through a formative study with XR practitioners and an exploratory qualitative study with 15 participants, we observed how spatial decomposition supports distributed attention and spatial meaning-making, while also surfacing challenges around coordination effort, interaction precision, and the lack of shared spatial UI conventions. This work invites discussion on how web interfaces might be reimagined for spatial computing beyond the single-window paradigm.
[HC-22] Safe Whole-Body Loco-Manipulation via Combined Model and Learning-based Control ICRA
【速读】:该论文旨在解决足式机器人在同时进行运动(locomotion)与操作(manipulation)时,如何协调腿部运动与臂部操作以实现安全、柔顺的接触交互问题。解决方案的关键在于提出一种全身控制器(whole-body controller),其核心由两部分组成:一是基于模型的阻抗控制(admittance control)用于机械臂,可将外部作用力(如人机交互产生的 wrench)映射为末端执行器期望速度,从而实现柔顺行为;二是采用强化学习(Reinforcement Learning, RL)策略控制腿部运动,确保动态环境下的鲁棒性。此外,通过引入参考调度器(Reference Governor, RG)保障力控精度与安全性,并利用融合神经网络的卡尔曼滤波提升基座速度估计可靠性,最终实现了6自由度(6-DoF)统一力响应和可靠的安全性能。
链接: https://arxiv.org/abs/2603.02443
作者: Alexander Schperberg,Yeping Wang,Stefano Di Cairano
机构: Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: Accepted to IEEE International Conference on Robotics and Automation (ICRA), June 2026, in Vienna, Austria
点击查看摘要
Abstract:Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenches–such as those applied by a human during physical interaction–into desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.
[HC-23] Learning to Pay Attention: Unsupervised Modeling of Attentive and Inattentive Respondents in Survey Data
【速读】:该论文旨在解决行为与社会科学调查中不专注受访者(inattentive respondents)识别问题,这类受访者提供随机或低努力的回答,严重威胁数据完整性。传统方法如注意力检查(attention checks)存在成本高、反应滞后且一致性差的局限。解决方案的关键在于提出一种统一的、无需标签的检测框架,通过互补的无监督视角——几何重构(Autoencoders)和概率依赖建模(Chow-Liu trees)来评分响应的一致性。其核心贡献是识别出实现无监督质量控制的结构条件:调查工具中具有连贯且重叠的题项组合(item batteries)时,即使使用线性模型也能可靠区分专注与不专注受访者,揭示了“心理测量-机器学习对齐”(Psychometric-ML Alignment)现象——即提升测量信度的设计原则(如内部一致性)同时增强了算法可检测性。该框架为调查平台提供了可扩展、领域无关的数据质量诊断工具,直接将数据质量与问卷设计关联,无需增加被试负担。
链接: https://arxiv.org/abs/2603.02427
作者: Ilias Triantafyllopoulos,Panos Ipeirotis
机构: New York University (纽约大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The integrity of behavioral and social-science surveys depends on detecting inattentive respondents who provide random or low-effort answers. Traditional safeguards, such as attention checks, are often costly, reactive, and inconsistent. We propose a unified, label-free framework for inattentiveness detection that scores response coherence using complementary unsupervised views: geometric reconstruction (Autoencoders) and probabilistic dependency modeling (Chow-Liu trees). While we introduce a “Percentile Loss” objective to improve Autoencoder robustness against anomalies, our primary contribution is identifying the structural conditions that enable unsupervised quality control. Across nine heterogeneous real-world datasets, we find that detection effectiveness is driven less by model complexity than by survey structure: instruments with coherent, overlapping item batteries exhibit strong covariance patterns that allow even linear models to reliably separate attentive from inattentive respondents. This reveals a critical ``Psychometric-ML Alignment’': the same design principles that maximize measurement reliability (e.g., internal consistency) also maximize algorithmic detectability. The framework provides survey platforms with a scalable, domain-agnostic diagnostic tool that links data quality directly to instrument design, enabling auditing without additional respondent burden.
[HC-24] A Directed Graph Model and Experimental Framework for Design and Study of Time-Dependent Text Visualisation
【速读】:该论文旨在解决人类在面对海量动态文本数据(如新闻、社交媒体内容)时,难以有效理解其演化叙事的问题。为提升可视化工具对文本间关系的可解释性,研究提出了一种基于有向图结构的时间依赖文本可视化抽象模型,并从中提炼出描述文本随时间关联模式的“动机”(motif)。解决方案的关键在于:通过现代大语言模型(LLM)生成结构化的合成文本数据集,构建受控用户实验环境(n=30),以测试用户识别预设文本关联模式的能力。结果表明,用户识别任务具有挑战性,且存在多样化的误判原因,揭示了当前可视化方法在通用性上的局限,暗示未来文本 discourse 可视化需向个性化适配方向发展。
链接: https://arxiv.org/abs/2603.02422
作者: Songhai Fan,Simon Angus,Tim Dwyer,Ying Yang,Sarah Goodwin,Helen Purchase
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint version for TVCG submission
点击查看摘要
Abstract:Exponential growth in the quantity of digital news, social media, and other textual sources makes it difficult for humans to keep up with rapidly evolving narratives about world events. Various visualisation techniques have been touted to help people to understand such discourse by exposing relationships between texts (such as news articles) as topics and themes evolve over time. Arguably, the understandability of such visualisations hinges on the assumption that people will be able to easily interpret the relationships in such visual network structures. To test this assumption, we begin by defining an abstract model of time-dependent text visualisation based on directed graph structures. From this model we distill motifs that capture the set of possible ways that texts can be linked across changes in time. We also develop a controlled synthetic text generation methodology that leverages the power of modern LLMs to create fictional, yet structured sets of time-dependent texts that fit each of our patterns. Therefore, we create a clean user study environment (n=30) for participants to identify patterns that best represent a given set of synthetic articles. We find that it is a challenging task for the user to identify and recover the predefined motif. We analyse qualitative data to map an unexpectedly rich variety of user rationales when divergences from expected interpretation occur. A deeper analysis also points to unexpected complexities inherent in the formation of synthetic datasets with LLMs that undermine the study control in some cases. Furthermore, analysis of individual decision-making in our study hints at a future where text discourse visualisation may need to dispense with a one-size-fits-all approach and, instead, should be more adaptable to the specific user who is exploring the visualisation in front of them.
[HC-25] Strategic Shaping of Human Prosociality: A Latent-State POMDP Framework
【速读】:该论文旨在解决机器人在重复交互中如何通过策略性行为影响人类 prosocial(亲社会)状态以提升协作效率的问题。其核心挑战在于人类的亲社会性是一个随时间演变的隐变量(latent state),且机器人只能通过有限观测推断该状态。解决方案的关键在于构建一个带有隐状态的部分可观测马尔可夫决策过程(latent-state POMDP),利用期望最大化(expectation maximization)学习状态转移与观测动态,并设计基于信念(belief-based)的策略,使机器人在任务目标与社交目标之间取得平衡,从而选择能最大化长期合作结果的动作。实验表明,该策略在团队绩效和促进人类合作行为方面均优于基线方法。
链接: https://arxiv.org/abs/2603.02379
作者: Zahra Zahedi,Xinyue Hu,Shashank Mehrotra,Mark Steyvers,Kumar Akash
机构: Honda Research Institute USA, Inc.(本田研究 institute 美国公司); University of California, Irvine(加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO); Systems and Control (eess.SY)
备注: This article has been published in IEEE Robotics and Automation Letters. this https URL
点击查看摘要
Abstract:We propose a decision-theoretic framework in which a robot strategically can shape inferred human’s prosocial state during repeated interactions. Modeling the human’s prosociality as a latent state that evolves over time, the robot learns to infer and influence this state through its own actions, including helping and signaling. We formalize this as a latent-state POMDP with limited observations and learn the transition and observation dynamics using expectation maximization. The resulting belief-based policy balances task and social objectives, selecting actions that maximize long-term cooperative outcomes. We evaluate the model using data from user studies and show that the learned policy outperforms baseline strategies in both team performance and increasing observed human cooperative behavior.
[HC-26] PlayWrite: A Multimodal System for AI Supported Narrative Co-Authoring Through Play in XR
【速读】:该论文旨在解决当前基于文本提示的AI写作工具难以支持叙事创作中空间性与交互性的核心问题,这类工具无法有效捕捉用户通过直接操作和游戏化探索所激发的创意流动。其解决方案的关键在于提出PlayWrite这一混合现实系统,通过多智能体AI流水线将用户对虚拟角色和道具的直接操作转化为结构化的“意图框架”(Intent Frames),以可重排的“故事弹珠”形式呈现于时间轴上,并由大型语言模型将其转化为最终叙事文本,从而实现以直接操纵和游戏化交互为核心的新一代协同创作范式。
链接: https://arxiv.org/abs/2603.02366
作者: Esen K. Tütüncü,Qian Zhou,Frederik Brudy,George Fitzmaurice,Fraser Anderson
机构: University of Barcelona (巴塞罗那大学); Autodesk Research (欧特克研究)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Current AI writing tools, which rely on text prompts, poorly support the spatial and interactive nature of storytelling where ideas emerge from direct manipulation and play. We present PlayWrite, a mixed-reality system where users author stories by directly manipulating virtual characters and props. A multi-agent AI pipeline interprets these actions into Intent Frames -structured narrative beats visualized as rearrangeable story marbles on a timeline. A large language model then transforms the user’s assembled sequence into a final narrative. A user study (N=13) with writers from varying domains found that PlayWrite fosters a highly improvisational and playful process. Users treated the AI as a collaborative partner, using its unexpected responses to spark new ideas and overcome creative blocks. PlayWrite demonstrates an approach for co-creative systems that move beyond text to embrace direct manipulation and play as core interaction modalities.
[HC-27] Pulli Kolam: A Traditional South Indian Craft Practice for Representing Data
【速读】:该论文试图解决如何将数据可视化从数字界面迁移至物理空间,同时不破坏传统手工艺的文化内涵与实践体验的问题。解决方案的关键在于识别并利用Pulli Kolam这一南印度传统粉画技艺中的五种映射策略(点、图案、填充、线条和色彩),将其作为物理数据表示的媒介,在不干扰原有仪式性与创造性实践的前提下,实现数据的嵌入式表达。通过日常福祉追踪的示例场景,论文展示了如何将数据感知融入例行手工活动中,从而拓展了传统技艺在人机交互与可持续数据呈现中的应用潜力。
链接: https://arxiv.org/abs/2603.02343
作者: Shri Harini Ramesh,Fateme Rajabiyazdi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the CHI 2026 Workshop Craft-Based Data Physicalization: Opportunities and Challenges, co-located with the ACM Conference on Human Factors in Computing Systems (CHI 2026)
点击查看摘要
Abstract:This paper introduces Pulli Kolam, a traditional South Indian craft, as a medium for physical data representation. Grounded in its cultural meaning and embodied practice, Pulli Kolam follows structured geometric rules while allowing creative variation. We identify five mapping strategies within Kolam (dots, patterns, fills, lines, and color) that can be used for representing data physically. without disrupting traditional practice. Through an illustrative scenario of daily well-being tracking, we demonstrate how data representation can be embedded within routine craft practice. We conclude by outlining potential material adaptations that extend Kolam beyond its ephemeral form while maintaining its embodied and ritual qualities.
[HC-28] In the Arms of a Robot: Designing Autonomous Hugging Robots with Intra-Hug Gestures
【速读】:该论文旨在解决如何设计具有情感交互能力的拥抱机器人(hugging robot)以实现更自然、愉悦的人机互动问题。其核心解决方案在于提出六条新的交互设计指南,并通过两阶段实验验证了基于感知算法与概率行为模型的闭环控制系统:首先利用压力传感器和麦克风采集用户在拥抱过程中的四种手势(握持、摩擦、拍打、挤压)数据,构建准确率达88%的动作分类算法;随后基于用户对机器人响应的偏好建立概率性行为决策机制,使机器人能实时选择多样化且主动发起互动的手势(如机器人-initiated intra-hug gestures),从而显著提升用户体验的自然度、愉悦感及智能感知。
链接: https://arxiv.org/abs/2202.09935
作者: Alexis E. Block,Hasti Seifi,Otmar Hilliges,Roger Gassert,Katherine J. Kuchenbecker
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 48 pages, 22 figures, 4 supplementary videos
点击查看摘要
Abstract:Hugs are complex affective interactions that often include gestures like squeezes. We present six new guidelines for designing interactive hugging robots, which we validate through two studies with our custom robot. To achieve autonomy, we investigated robot responses to four human intra-hug gestures: holding, rubbing, patting, and squeezing. Thirty-two users each exchanged and rated sixteen hugs with an experimenter-controlled HuggieBot 2.0. The robot’s inflated torso’s microphone and pressure sensor collected data of the subjects’ demonstrations that were used to develop a perceptual algorithm that classifies user actions with 88% accuracy. Users enjoyed robot squeezes, regardless of their performed action, they valued variety in the robot response, and they appreciated robot-initiated intra-hug gestures. From average user ratings, we created a probabilistic behavior algorithm that chooses robot responses in real time. We implemented improvements to the robot platform to create HuggieBot 3.0 and then validated its gesture perception system and behavior algorithm with sixteen users. The robot’s responses and proactive gestures were greatly enjoyed. Users found the robot more natural, enjoyable, and intelligent in the last phase of the experiment than in the first. After the study, they felt more understood by the robot and thought robots were nicer to hug.
[HC-29] he Six Hug Commandments: Design and Evaluation of a Human-Sized Hugging Robot with Visual and Haptic Perception
【速读】:该论文旨在解决人与机器人之间缺乏自然、令人愉悦的社交触觉互动的问题,尤其是如何通过机器人拥抱来提升个体的社会支持感和心理福祉。研究表明,社会触觉(social touch)对个体健康至关重要,而现有机器人在模拟人类拥抱时往往忽视了触觉反馈与环境适应性。解决方案的关键在于提出并验证了六条“行为准则”(tenets),其中前两条(柔软性和温暖性)已有文献支持,后四条为创新性提出:机器人应具备人形尺寸、视觉感知用户、根据用户体型和位置自适应调整拥抱力度,并在用户希望结束时可靠释放。研究基于这些准则开发了HuggieBot 2.0平台,结合视觉与触觉传感实现闭环控制,实验表明加入触觉反应显著提升了用户体验,验证了新四项准则的有效性,为未来具身交互机器人设计提供了重要方向。
链接: https://arxiv.org/abs/2101.07679
作者: Alexis E. Block,Sammy Christen,Roger Gassert,Otmar Hilliges,Katherine J. Kuchenbecker
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 9 pages, 6 Figures, 2 Tables, ACM/IEEE Human-Robot Interaction (HRI) Conference 2021
点击查看摘要
Abstract:Receiving a hug is one of the best ways to feel socially supported, and the lack of social touch can have severe negative effects on an individual’s well-being. Based on previous research both within and outside of HRI, we propose six tenets (“commandments”) of natural and enjoyable robotic hugging: a hugging robot should be soft, be warm, be human sized, visually perceive its user, adjust its embrace to the user’s size and position, and reliably release when the user wants to end the hug. Prior work validated the first two tenets, and the final four are new. We followed all six tenets to create a new robotic platform, HuggieBot 2.0, that has a soft, warm, inflated body (HuggieChest) and uses visual and haptic sensing to deliver closed-loop hugging. We first verified the outward appeal of this platform in comparison to the previous PR2-based HuggieBot 1.0 via an online video-watching study involving 117 users. We then conducted an in-person experiment in which 32 users each exchanged eight hugs with HuggieBot 2.0, experiencing all combinations of visual hug initiation, haptic sizing, and haptic releasing. The results show that adding haptic reactivity definitively improves user perception a hugging robot, largely verifying our four new tenets and illuminating several interesting opportunities for further improvement.
计算机视觉
[CV-0] Utonia: Toward One Encoder for All Point Clouds
【速读】:该论文旨在解决跨域点云数据难以统一建模的问题,即如何在不同传感几何、密度和先验条件下(如遥感、室外LiDAR、室内RGB-D序列、以物体为中心的CAD模型及仅从RGB视频中提取的点云),训练出一个通用的自监督点变换器编码器,从而实现一致的表示空间并提升多场景下的感知能力。其解决方案的关键在于提出Utonia框架,通过联合训练多个异构点云域,使模型学习到跨域共享的特征表示,并在此基础上展现出仅在多域协同训练时才会涌现的新兴行为,例如增强机器人操作能力和改善视觉-语言-动作策略的推理性能,为稀疏3D数据领域的基础模型奠定基础。
链接: https://arxiv.org/abs/2603.03283
作者: Yujia Zhang,Xiaoyang Wu,Yunhan Yang,Xianzhe Fan,Han Li,Yuechen Zhang,Zehao Huang,Naiyan Wang,Hengshuang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: produced by Pointcept, project page: this https URL
点击查看摘要
Abstract:We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.
[CV-1] CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance CVPR2026
【速读】:该论文旨在解决Classifier-Free Guidance (CFG) 在流式扩散模型中因依赖线性控制策略而导致的不稳定性、超调现象及大引导尺度下语义保真度下降的问题。其解决方案的关键在于提出一种基于滑模控制(Sliding Mode Control, SMC)的新型引导机制——SMC-CFG,通过定义指数型滑模面来刻画语义预测误差,并引入非线性切换控制项实现反馈引导的修正,从而确保生成流在有限时间内收敛至快速收敛的滑模流形上;同时,作者提供了李雅普诺夫稳定性分析以理论证明该方法的收敛性与鲁棒性,实验表明其在多类文本到图像生成模型中显著优于传统CFG,在广泛引导尺度下均保持更强的语义对齐能力与稳定性。
链接: https://arxiv.org/abs/2603.03281
作者: Hanyang Wang,Yiyang Liu,Jiawei Chi,Fangfu Liu,Ran Xue,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026; Project Page: this https URL
点击查看摘要
Abstract:Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: this https URL
[CV-2] How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
【速读】:该论文旨在解决自主机器人在执行高难度、接触密集且对力敏感的操作任务(如食品准备、外科手术和手工艺)时面临的挑战,这些问题通常缺乏明确的成功标准,导致定量评估与奖励工程困难。解决方案的关键在于提出一个两阶段学习框架:第一阶段通过力感知的数据采集与模仿学习获得鲁棒的初始策略,实现跨物体变化的泛化;第二阶段利用结合定量任务指标与定性人类反馈的奖励模型进行基于偏好的微调,从而将策略行为与人类对任务质量的主观认知对齐。该方法仅需50–200条操作轨迹即可在黄瓜、苹果和土豆等复杂食材上实现超过90%的平均成功率,并通过偏好微调进一步提升性能达40%,同时展现出对同类别未见实例及分布外类别的零样本泛化能力。
链接: https://arxiv.org/abs/2603.03280
作者: Toru Lin,Shuying Deng,Zhao-Heng Yin,Pieter Abbeel,Jitendra Malik
机构: University of California, Berkeley (加州大学伯克利分校); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Project page can be found at this https URL
点击查看摘要
Abstract:Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their “implicit” success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.
[CV-3] ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
【速读】:该论文旨在解决人形机器人在实现自主且多功能的全身运动与操作(whole-body loco-manipulation)过程中面临的核心挑战:现有方法受限于数据稀缺或质量低、难以扩展至大规模技能库,以及依赖预定义运动参考而非基于感知和高层任务指令生成行为。解决方案的关键在于提出ULTRA框架,其包含两个核心组件:一是基于物理驱动的神经重定向算法(physics-driven neural retargeting algorithm),可将大规模动作捕捉数据高效映射到人形机器人本体并保持接触密集交互中的物理合理性;二是统一的多模态控制器(unified multimodal controller),支持从精确状态到噪声视觉输入的不同感知模式,并通过强化学习微调提升泛化能力和鲁棒性,从而实现仅凭稀疏意图即可协调全身行为,无需测试时的参考运动。
链接: https://arxiv.org/abs/2603.03279
作者: Xialin He,Sirui Xu,Xinyao Li,Runpei Dong,Liuyu Bian,Yu-Xiong Wang,Liang-Yan Gui
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.
[CV-4] her: Autonomous Functional Play with Correspondence-Driven Trajectory Warping ICLR
【速读】:该论文旨在解决机器人自主交互与学习中的核心挑战:如何在无需大量人工示范的情况下,通过自主探索(即“玩”)实现高效、鲁棒的技能习得。具体而言,问题包括(1)设计对多样化、可能分布外环境状态具有鲁棒性的策略,以及(2)构建持续生成高质量机器人经验的机制。解决方案的关键在于提出Tether方法,其核心创新为:首先设计一种基于语义关键点对应关系的开环策略,将少量源示范(≤10次)的动作映射到目标场景中,从而实现极高的数据效率和跨空间与语义变化的鲁棒性;其次,利用视觉-语言模型驱动的闭环流程(任务选择、执行、评估与改进),在真实世界中实现多任务自主功能性玩耍,生成高质量、多样化的数据集,显著提升闭合回路模仿策略的性能,最终获得超过1000条专家级轨迹,并训练出媲美人类示范所学策略的模型。
链接: https://arxiv.org/abs/2603.03278
作者: William Liang,Sam Wang,Hung-Ju Wang,Osbert Bastani,Yecheng Jason Ma,Dinesh Jayaraman
机构: University of Pennsylvania (宾夕法尼亚大学); University of California, Berkeley (加州大学伯克利分校); Dyna Robotics (Dyna Robotics)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Learning Representations (ICLR), 2026. Project website and code: this https URL
点击查看摘要
Abstract:The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such “play” requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.
[CV-5] Beyond Language Modeling: An Exploration of Multimodal Pretraining
【速读】:该论文旨在解决当前基础模型在多模态(尤其是视觉与语言融合)预训练过程中设计空间不清晰、缺乏系统性实证研究的问题,特别是在无语言预训练干扰的前提下,厘清视觉与语言模态协同作用机制及高效扩展策略。其解决方案的关键在于采用Transfusion框架进行从头预训练实验,通过分离视觉和语言的建模方式(语言使用next-token预测,视觉采用扩散模型),并引入Representation Autoencoder(RAE)构建统一视觉表征,同时利用Mixture-of-Experts(MoE)架构实现高效且具有模态特化的多模态扩展。该方案不仅揭示了视觉与语言数据的互补性与协同效应,还发现视觉对数据量的需求显著高于语言,而MoE能有效调和这种尺度不对称性,从而推动真正统一的多模态基础模型发展。
链接: https://arxiv.org/abs/2603.03276
作者: Shengbang Tong,David Fan,John Nguyen,Ellis Brown,Gaoyue Zhou,Shengyi Qian,Boyang Zheng,Théophane Vallaeys,Junlin Han,Rob Fergus,Naila Murray,Marjan Ghazvininejad,Mike Lewis,Nicolas Ballas,Amir Bar,Michael Rabbat,Jakob Verbeek,Luke Zettlemoyer,Koustuv Sinha,Yann LeCun,Saining Xie
机构: Google(谷歌); Meta; Stability.AI; Anthropic; Character.ai; Claude; Stanford University (斯坦福大学); MIT (麻省理工学院); University of California, Berkeley (加州大学伯克利分校); University of Oxford (牛津大学); University of Cambridge (剑桥大学); ETH Zurich (苏黎世联邦理工学院); Tsinghua University (清华大学); Peking University (北京大学); National University of Singapore (新加坡国立大学); University of Tokyo (东京大学); University of Toronto (多伦多大学); University of Montreal (蒙特利尔大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Allen Institute for AI (艾伦人工智能研究所); Microsoft Research (微软研究院); Facebook AI Research (FAIR); Google DeepMind (谷歌大脑); OpenAI; Meta AI; NVIDIA; Apple; Amazon; IBM; Huawei; Baidu; Alibaba; Tencent; Xiaomi; Samsung; Intel; AMD; Qualcomm; NVIDIA; Tesla; Waymo; Uber; Lyft; Cruise; Zoox; Nuro; Aurora; Argo AI; Motional; Pony.ai; Momenta; Horizon Robotics; Hikvision; Dahua Technology; SenseTime; Megvii; Yitu Technology; Cloudwalk Technology; iFlytek; Baidu Research; Alibaba DAMO Academy; Tencent AI Lab; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI; Huawei AI; Baidu AI; Alibaba AI; Tencent AI; Xiaomi AI; Samsung AI; Intel AI; AMD AI; Qualcomm AI; Tesla AI; Waymo AI; Uber AI; Lyft AI; Cruise AI; Zoox AI; Nuro AI; Aurora AI; Argo AI; Motional AI; Pony.ai AI; Momenta AI; Horizon Robotics AI; Hikvision AI; Dahua Technology AI; SenseTime AI; Megvii AI; Yitu Technology AI; Cloudwalk Technology AI; iFlytek AI; Baidu Research AI; Alibaba DAMO Academy AI; Tencent AI Lab AI; Microsoft AI; Google AI; Meta AI; NVIDIA AI; Apple AI; Amazon AI; IBM AI;
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL
点击查看摘要
Abstract:The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.
[CV-6] LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
【速读】:该论文旨在解决当前前馈式几何基础模型在处理分钟级长视频时面临的两大瓶颈问题:一是由于二次注意力复杂度导致的计算效率低下,二是递归设计中有效记忆容量有限。其核心解决方案是提出LoGeR(Long-context Geometric Reconstruction)架构,关键创新在于引入一种基于学习的混合记忆模块——该模块由参数化测试时训练(Test-Time Training, TTT)记忆和非参数化滑动窗口注意力(Sliding Window Attention, SWA)机制组成,分别用于锚定全局坐标系以防止尺度漂移、并保留未压缩的上下文信息以实现高精度相邻帧对齐,从而实现无需后优化即可在数千帧长序列上进行高质量稠密3D重建。
链接: https://arxiv.org/abs/2603.03269
作者: Junyi Zhang,Charles Herrmann,Junhwa Hur,Chen Sun,Ming-Hsuan Yang,Forrester Cole,Trevor Darrell,Deqing Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
点击查看摘要
Abstract:Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods–reducing ATE on KITTI by over 74%–and achieves robust, globally consistent reconstruction over unprecedented horizons.
[CV-7] DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction CVPR2026
【速读】:该论文旨在解决从噪声大或观测不完整的非约束视频中恢复世界坐标系下的人体运动问题,其核心挑战在于如何在多样且嘈杂的视频输入中实现泛化能力的同时保持全局运动一致性。解决方案的关键在于提出一种双扩散模型(DuoMo)架构:首先通过相机坐标系模型估计视频中的运动,再由世界坐标系模型将初始估计提升至世界坐标并进行全局一致性优化,从而实现跨场景和轨迹的高精度运动重建。该方法直接生成网格顶点的运动,跳过参数化建模,显著提升了重建精度与鲁棒性。
链接: https://arxiv.org/abs/2603.03265
作者: Yufu Wang,Evonne Ng,Soyong Shin,Rawal Khirodkar,Yuan Dong,Zhaoen Su,Jinhyung Park,Kris Kitani,Alexander Richard,Fabian Prada,Michael Zollhofer
机构: Meta Reality Labs (Meta); University of Pennsylvania; Carnegie Mellon University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL
点击查看摘要
Abstract:We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: this https URL
[CV-8] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
【速读】:该论文旨在解决统一多模态模型中生成式能力(generation)是否以及在何种条件下能够提升理解性能(understanding)这一关键问题。现有基准测试缺乏对生成到理解(G2U)任务的系统性划分与评估,导致对生成机制如何影响理解的认知不足。为应对这一挑战,作者提出了UniG2U-Bench,一个涵盖7类任务和30个子任务的综合性评测基准,要求模型在不同层次上进行隐式或显式的视觉转换。解决方案的关键在于通过大规模模型评估揭示三方面核心发现:(1) 统一模型通常弱于基础视觉-语言模型(VLMs),且“先生成再回答”(GtA)推理模式往往劣于直接推理;(2) 在空间智能、视觉错觉和多轮推理等子任务中,生成过程能显著增强性能,尤其受益于空间感知能力和中间图像状态的构建;(3) 具有相似推理结构的任务与共享架构的模型表现出行为相关性,表明生成-理解耦合会诱导类别一致的归纳偏置(inductive biases),从而提示未来需引入更丰富的训练数据和创新范式以充分释放统一多模态建模潜力。
链接: https://arxiv.org/abs/2603.03241
作者: Zimo Wen,Boxiu Li,Wanbo Zhang,Junxiang Lei,Xiaoyu Chen,Yijia Fan,Qi Zhang,Yujiang Wang,Lili Qiu,Bo Li,Ziwei Liu,Caihua Shan,Yifan Yang,Yifei Shen
机构: Microsoft Research Asia (微软亚洲研究院); Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学); Fudan University (复旦大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.
[CV-9] COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data – Generation Stochastic by Design
【速读】:该论文旨在解决多模态地球观测(Earth Observation, EO)数据融合中因条件映射非单射性(non-injective)而导致的不确定性建模难题,即相同条件信息可能对应多个物理上合理的观测结果,而传统确定性模型往往仅输出条件均值,无法有效捕捉这种多样性与不确定性,从而限制了其在数据补全、跨传感器翻译等任务中的表现。解决方案的关键在于提出COP-GEN——一种基于潜在扩散Transformer的多模态生成模型,通过将跨模态映射参数化为条件概率分布而非确定性函数,实现了任意模态间的灵活条件生成(包括零样本模态翻译、光谱波段填补及部分/缺失输入下的生成),且无需针对特定任务重新训练,从而在保持高保真度的同时生成多样且物理一致的实现实例。
链接: https://arxiv.org/abs/2603.03239
作者: Miguel Espinosa,Eva Gmelich Meijling,Valerio Marsocci,Elliot J. Crowley,Mikolaj Czerkawski
机构: European Space Agency (ESA); School of Engineering, University of Edinburgh; Asterisk Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// this http URL
[CV-10] Specificity-aware reinforcement learning for fine-grained open-world classification CVPR2026
【速读】:该论文旨在解决开放世界场景下细粒度图像分类中模型预测过于泛化的问题,即如何在不牺牲准确性的前提下提升模型预测的特异性(specificity)。其核心挑战在于:尽管推理型大视觉语言模型(reasoning Large Multimodal Models, LMMs)具备内在的细粒度领域知识,但现有方法难以有效引导其输出更具体、区分度更高的类别标签。解决方案的关键在于提出一种新颖的特异性感知强化学习框架 SpeciaRL,该框架通过引入基于验证器的动态奖励信号,以在线滚动过程中最优预测为锚点,既鼓励模型生成更具特异性的预测,又避免因过度追求特异性而导致错误预测,从而在正确性(correctness)与特异性之间实现更优平衡。
链接: https://arxiv.org/abs/2603.03197
作者: Samuele Angheben,Davide Berasi,Alessandro Conti,Elisa Ricci,Yiming Wang
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model’s capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at this https URL.
[CV-11] Chain of World: World Model Thinking in Latent Motion CVPR2026
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在具身智能任务中对视觉动态的预测能力和时序因果结构建模不足的问题。现有方法如世界模型VLA虽能预测未来帧,但冗余重建背景造成计算资源浪费;而潜在动作VLA虽压缩了帧间变化表示,却缺乏连续时序动态建模与世界知识表达。其解决方案的关键在于提出CoWVLA(Chain-of-World VLA)新范式,通过预训练视频变分自编码器(video VAE)显式分离视频片段为结构与运动潜变量,并在预训练阶段让VLA从指令和初始帧推断连续潜运动链并预测终帧;随后在联合微调阶段,将该潜动态与离散动作预测对齐,通过统一自回归解码器联合建模稀疏关键帧与动作序列,从而兼顾世界模型的时序推理能力与潜在动作的紧凑性及可解释性,实现高效视觉运动学习。
链接: https://arxiv.org/abs/2603.03195
作者: Fuxiang Yang,Donglin Di,Lulu Tang,Xuancheng Zhang,Lei Fan,Hao Li,Chen Wei,Tonghua Su,Baorui Ma
机构: Harbin Institute of Technology (哈尔滨工业大学); Li Auto (小鹏汽车); Beijing Academy of Artificial Intelligence (北京人工智能研究院); University of New South Wales (新南威尔士大学); Chongqing Research Institute of HIT (哈尔滨工业大学重庆研究院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by CVPR2026. Project page: this https URL
点击查看摘要
Abstract:Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new “Chain of World” paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment’s terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at this https URL.
[CV-12] ProSMA-UNet: Decoder Conditioning for Proximal-Sparse Skip Feature Selection
【速读】:该论文旨在解决医学图像分割中传统U-Net类架构因跳接路径(skip connections)引入低级纹理、背景杂波和采集噪声,导致无关信息绕过深层语义过滤的问题,尤其在低对比度临床成像场景下影响显著。解决方案的关键在于提出ProSMA-UNet(Proximal-Sparse Multi-Scale Attention U-Net),其将跳接门控重构为解码器条件下的稀疏特征选择问题:首先利用轻量级深度可分离空洞卷积构建多尺度兼容性场以捕捉局部与上下文尺度的相关性;其次通过ℓ₁近端算子施加显式稀疏性,结合可学习的通道级阈值实现闭式软阈值门控,从而有效移除噪声响应;此外,进一步引入由全局解码器上下文驱动的通道门控机制以抑制语义无关通道,显著提升分割精度,尤其在复杂3D任务中性能提升达约20%。
链接: https://arxiv.org/abs/2603.03187
作者: Chun-Wun Cheng,Yanqi Cheng,Peiyuan Jing,Guang Yang,Carola-Bibiane Schönlieb,Angelica I. Aviles-Rivero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical image segmentation commonly relies on U-shaped encoder-decoder architectures such as U-Net, where skip connections preserve fine spatial detail by injecting high-resolution encoder features into the decoder. However, these skip pathways also propagate low-level textures, background clutter, and acquisition noise, allowing irrelevant information to bypass deeper semantic filtering – an issue that is particularly detrimental in low-contrast clinical imaging. Although attention gates have been introduced to address this limitation, they typically produce dense sigmoid masks that softly reweight features rather than explicitly removing irrelevant activations. We propose ProSMA-UNet (Proximal-Sparse Multi-Scale Attention U-Net), which reformulates skip gating as a decoder-conditioned sparse feature selection problem. ProSMA constructs a multi-scale compatibility field using lightweight depthwise dilated convolutions to capture relevance across local and contextual scales, then enforces explicit sparsity via an \ell_1 proximal operator with learnable per-channel thresholds, yielding a closed-form soft-thresholding gate that can remove noisy responses. To further suppress semantically irrelevant channels, ProSMA incorporates decoder-conditioned channel gating driven by global decoder context. Extensive experiments on challenging 2D and 3D benchmarks demonstrate state-of-the-art performance, with particularly large gains ( \approx20 %) on difficult 3D segmentation tasks. Project page: this https URL
[CV-13] Conditioned Activation Transport for T2I Safety Steering
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在推理阶段易产生不安全和有害内容的问题,同时避免因干预措施导致良性提示生成的图像质量下降。其核心挑战在于如何在不损害正常生成效果的前提下实现对潜在有害内容的有效抑制。解决方案的关键在于提出Conditioned Activation Transport (CAT)框架,该框架基于构建的SafeSteerDataset(包含2300对高余弦相似度的安全与不安全提示对),采用几何条件机制和非线性传输映射,仅在检测到不安全激活区域时触发干预,从而最小化对良性查询的干扰,实现在降低攻击成功率的同时保持图像保真度。
链接: https://arxiv.org/abs/2603.03163
作者: Maciej Chrabąszcz,Aleksander Szymczyk,Jan Dubiński,Tomasz Trzciński,Franziska Boenisch,Adam Dziedzic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.
[CV-14] Kling-MotionControl Technical Report
【速读】:该论文旨在解决当前生成式AI在角色动画(Character Animation)中面临的挑战,即如何实现高保真、鲁棒且具表现力的全身角色动作迁移,同时确保跨身份泛化能力和外观一致性。其核心解决方案是提出Kling-MotionControl,一个基于DiT(Diffusion Transformer)的统一框架,采用“分而治之”的策略,针对身体、面部和手部等不同部位设计异构运动表示,从而在保持大尺度结构稳定性的同时实现精细关节动作的表达性控制;此外,通过自适应的身份无关学习(identity-agnostic learning)增强跨身份运动重定向能力,并结合身份注入与融合机制及主体库(subject library)机制保障外观忠实还原;最后,引入多阶段蒸馏加速框架,使推理速度提升超10倍,显著提升实用性。
链接: https://arxiv.org/abs/2603.03160
作者: Kling Team:Jialu Chen,Yikang Ding,Zhixue Fang,Kun Gai,Kang He,Xu He,Jingyun Hua,Mingming Lao,Xiaohan Li,Hui Liu,Jiwen Liu,Xiaoqiang Liu,Fan Shi,Xiaoyu Shi,Peiqin Sun,Songlin Tang,Pengfei Wan,Tiancheng Wen,Zhiyong Wu,Haoxian Zhang,Runze Zhao,Yuanxing Zhang,Yan Zhou
机构: Kuaishou Technology(快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Access: this https URL
点击查看摘要
Abstract:Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.
[CV-15] Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing
【速读】:该论文旨在解决3D编辑中多视角一致性难以维持的问题,尤其是在缺乏足够3D一致编辑配对数据的情况下,传统监督微调(SFT)方法难以实施。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)框架,并设计基于3D基础模型VGGT的新型奖励机制:利用VGGT从海量真实世界数据中学到的强先验知识,通过分析编辑图像输出的置信度图和姿态估计误差作为奖励信号,引导2D编辑先验锚定在3D一致性流形上,从而实现高效且稳定的多视角一致性编辑效果。
链接: https://arxiv.org/abs/2603.03143
作者: Jiyuan Wang,Chunyu Lin,Lei Sun,Zhi Cao,Yuyang Yin,Lang Nie,Zhenlong Yuan,Xiangxiang Chu,Yunchao Wei,Kang Liao,Guosheng Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures
点击查看摘要
Abstract:Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbfRL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT’s robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.
[CV-16] AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis ICASSP2026
【速读】:该论文旨在解决肺部超声(Lung Ultrasound, LUS)图像数据稀缺问题,从而限制了机器学习方法在图像解读和疾病监测中的应用。现有生成式增强方法(如生成对抗网络 GAN 和扩散模型)常因降采样导致细微诊断特征(如B线和胸膜不规则)的丢失。其解决方案的关键在于提出一种基于扩散机制的增强框架——A trous Wavelet Diffusion (AWDiff),该框架通过引入a trous小波变换来保留细尺度结构,避免破坏性下采样;同时结合BioMedCLIP这一在大规模生物医学语料上训练的视觉-语言基础模型进行语义条件控制,确保生成图像与临床有意义标签对齐,从而在保持结构保真度的同时提升临床多样性。
链接: https://arxiv.org/abs/2603.03125
作者: Maryam Heidari(1),Nantheera Anantrasirichai(1),Steven Walker(2),Rahul Bhatnagar(2),Alin Achim(1) ((1) University of Bristol, UK, (2) Bristol Medical School, University of Bristol, UK)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages5 pages, 4 figures. Accepted to ICASSP 2026
点击查看摘要
Abstract:Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.
[CV-17] MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection CVPR2026
【速读】:该论文旨在解决零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)中如何在保持CLIP模型强大泛化能力的同时,实现对异常检测任务的精细化适配问题。现有方法因采用无Patch感知的设计,将所有图像块统一处理,难以捕捉不同区域的局部特征差异。其解决方案的关键在于提出MoECLIP架构,通过引入Mixture-of-Experts (MoE)机制,基于每张图像块的独特特征动态路由至对应的低秩适配(Low-Rank Adaptation, LoRA)专家模块,从而实现patch级自适应;同时设计Frozen Orthogonal Feature Separation (FOFS)和等角紧框架(simplex equiangular tight frame, ETF)损失函数,以抑制专家间功能冗余并优化表示分布,提升模型判别力。
链接: https://arxiv.org/abs/2603.03101
作者: Jun Yeong Park,JunYoung Seo,Minji Kang,Yu Rang Park
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:The CLIP model’s outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP’s powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose \textbfMoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at this https URL.
[CV-18] nyIceNet: Low-Power SAR Sea Ice Segmentation for On-Board FPGA Inference
【速读】:该论文旨在解决极地海域中海冰制图的实时性与可靠性问题,传统地面处理方式受限于下行链路带宽、延迟及能耗,难以满足快速变化冰情下的导航需求。解决方案的关键在于提出一种面向星载部署的轻量化语义分割网络TinyIceNet,其通过SAR感知的架构简化与低精度量化技术,在严格硬件和功耗约束下实现高精度海冰分级(Stage of Development, SOD)识别;同时采用高层次综合(High-Level Synthesis)方法将其部署于Xilinx Zynq UltraScale+ FPGA平台,实现了近实时推理并降低2倍能耗,验证了芯片级软硬件协同设计在空间边缘AI系统中的可行性与优势。
链接: https://arxiv.org/abs/2603.03075
作者: Mhd Rashed Al Koutayni,Mohamed Selim,Gerd Reis,Alain Pagani,Didier Stricker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: undergoing publication at CVC 2026
点击查看摘要
Abstract:Accurate sea ice mapping is essential for safe maritime navigation in polar regions, where rapidly changing ice conditions require timely and reliable information. While Sentinel-1 Synthetic Aperture Radar (SAR) provides high-resolution, all-weather observations of sea ice, conventional ground-based processing is limited by downlink bandwidth, latency, and energy costs associated with transmitting large volumes of raw data. On-board processing, enabled by dedicated inference chips integrated directly within the satellite payload, offers a transformative alternative by generating actionable sea ice products in orbit. In this context, we present TinyIceNet, a compact semantic segmentation network co-designed for on-board Stage of Development (SOD) mapping from dual-polarized Sentinel-1 SAR imagery under strict hardware and power constraints. Trained on the AI4Arctic dataset, TinyIceNet combines SAR-aware architectural simplifications with low-precision quantization to balance accuracy and efficiency. The model is synthesized using High-Level Synthesis and deployed on a Xilinx Zynq UltraScale+ FPGA platform, demonstrating near-real-time inference with significantly reduced energy consumption. Experimental results show that TinyIceNet achieves 75.216% F1 score on SOD segmentation while reducing energy consumption by 2x compared to full-precision GPU baselines, underscoring the potential of chip-level hardware-algorithm co-design for future spaceborne and edge AI systems.
[CV-19] EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education
【速读】:该论文旨在解决生成式 AI (Generative AI) 在教育场景中,特别是面向幼儿数学概念(如数字和几何)的视觉叙事学习方面的应用潜力尚未被充分挖掘的问题。其核心挑战在于现有文本到视频(Text-to-Video, T2V)模型生成的内容在感知质量和与教学提示的语义对齐度方面缺乏系统评估手段。解决方案的关键在于提出首个专门针对教育用途的AI生成视频(AIGV)基准数据集EduAIGV-1k及其配套的多维质量评估框架EduVQA:前者包含1,130个由十种先进T2V模型生成的教学视频,并附带细粒度的空间/时间感知质量与提示对齐度(词级和句级标注)双重维度标注;后者引入结构化二维混合专家(Structured 2D Mixture-of-Experts, S2D-MoE)模块,通过共享专家和动态二维门控矩阵增强整体质量与各子维度之间的依赖关系,从而实现更精准、可解释的AIGV质量评估。
链接: https://arxiv.org/abs/2603.03066
作者: Baoliang Chen,Xinlong Bu,Lingyu Zhu,Hanwei Zhu,Xiangjie Sui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.
[CV-20] IoUCert: Robustness Verification for Anchor-based Object Detectors
【速读】:该论文旨在解决在目标检测任务中实现形式化鲁棒性验证(formal robustness verification)的难题,尤其是在基于锚框(anchor-based)的目标检测架构中,由于非线性坐标变换和交并比(Intersection-over-Union, IoU)度量的复杂性,传统方法难以提供可靠的鲁棒性保证。其解决方案的关键在于提出了一种名为IoUCert的新颖形式化验证框架,通过设计一种新的坐标变换方式,绕过对非线性边界框预测函数进行精度损失的松弛处理,从而能够直接以锚框偏移量为优化变量,构建出一种新型区间边界传播(Interval Bound Propagation)方法,进而推导出最优的IoU边界。这一方法首次实现了对真实场景下SSD、YOLOv2和YOLOv3等典型锚框模型在多种输入扰动下的鲁棒性验证。
链接: https://arxiv.org/abs/2603.03043
作者: Benedikt Brückner,Alejandro Mercado,Yanghao Zhang,Panagiotis Kouvaros,Alessio Lomuscio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While formal robustness verification has seen significant success in image classification, scaling these guarantees to object detection remains notoriously difficult due to complex non-linear coordinate transformations and Intersection-over-Union (IoU) metrics. We introduce \sc \sf IoUCert, a novel formal verification framework designed specifically to overcome these bottlenecks in foundational anchor-based object detection architectures. Focusing on the object localisation component in single-object settings, we propose a coordinate transformation that enables our algorithm to circumvent precision-degrading relaxations of non-linear box prediction functions. This allows us to optimise bounds directly with respect to the anchor box offsets which enables a novel Interval Bound Propagation method that derives optimal IoU bounds. We demonstrate that our method enables, for the first time, the robustness verification of realistic, anchor-based models including SSD, YOLOv2, and YOLOv3 variants against various input perturbations.
[CV-21] BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology
【速读】:该论文旨在解决通用病理基础模型(Pathology Foundation Models, PFMs)在特定器官系统(乳腺病理)中临床任务性能不足的问题,尤其是缺乏针对单一器官的大规模验证队列和能够将广泛组织形态学知识有效转化为器官特异性专家级解读能力的训练范式。解决方案的关键在于提出BRIGHT——首个专为乳腺病理设计的基础模型,其采用协同通用-专科框架(collaborative generalist-specialist framework),在约21000万张来自超51000例患者、覆盖19家医院的乳腺全切片图像(Whole-Slide Images, WSIs)上进行预训练,并通过构建迄今为止最大的多中心乳腺病理验证队列(涵盖10家医院、超过25000张WSI及24项临床任务)来全面评估模型性能。实验表明,BRIGHT在21/24项内部验证任务和5/10项外部验证任务中达到当前最优(SOTA)水平,且具备优异的热图可解释性,从而验证了该协同框架在提升器官特异性病理AI模型性能方面的有效性与可扩展性。
链接: https://arxiv.org/abs/2603.03030
作者: Xiaojing Guo,Jiatai Lin,Yumian Jia,Jingqi Huang,Zeyan Xu,Weidong Li,Longfei Wang,Jingjing Chen,Qin Li,Weiwei Wang,Lifang Cui,Wen Yue,Zhiqiang Cheng,Xiaolong Wei,Jianzhong Yu,Xia Jin,Baizhou Li,Honghong Shen,Jing Li,Chunlan Li,Yanfen Cui,Yi Dai,Yiling Yang,Xiaolong Qian,Liu Yang,Yang Yang,Guangshen Gao,Yaqing Li,Lili Zhai,Chenying Liu,Tianhua Zhang,Zhenwei Shi,Cheng Lu,Xingchen Zhou,Jing Xu,Miaoqing Zhao,Fang Mei,Jiaojiao Zhou,Ning Mao,Fangfang Liu,Chu Han,Zaiyi Liu
机构: Tianjin Medical University Cancer Institute Hospital (天津医科大学肿瘤医院); National Clinical Research Center for Cancer (国家临床研究中心); Key Laboratory of Breast Cancer Prevention and Therapy (乳腺癌防治重点实验室); Tianjin Medical University (天津医科大学); Tianjin’s Clinical Research Center for Cancer (天津临床研究中心); Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application (广东省人工智能医学图像分析与应用重点实验室); Guangdong Provincial People’s Hospital (广东省级人民医院); Guangdong Academy of Medical Sciences (广东省医学科学院); Southern Medical University (南方医科大学); Department of Radiology (放射科); School of Biomedical Engineering (生物医学工程学院); Yunnan Cancer Hospital (云南省肿瘤医院); The Third Affiliated Hospital of Kunming Medical University (昆明医科大学第三附属医院); Peking University Cancer Hospital Yunnan (北京大学肿瘤医院云南医院); Qingdao University Hospital (青岛大学附属医院); Weifang Hospital of Traditional Chinese Medicine (潍坊市中医院); Affiliated Hospital of Jining Medical University (济宁医学院附属医院); China-Japan Friendship Hospital (中日友好医院); The Obstetrics and Gynecology Hospital of FuDan University (复旦大学妇产科医院); ShenZhen Third People’s Hospital (深圳市第三人民医院); Cancer Hospital of ShanTou University Medical College (汕头大学医学院附属肿瘤医院); DongGuan People’s Hospital (东莞市人民医院); Hunan Cancer Hospital (湖南省肿瘤医院); The Affiliated Cancer Hospital of Xiangya School of Medicine, Central South University (中南大学湘雅医学院附属肿瘤医院); The Second Affiliated Hospital Zhejiang University School of Medicine (浙江大学医学院附属第二医院); The Fourth Affiliated Hospital, Zhejiang University School of Medicine (浙江大学医学院附属第四医院); Second hospital of ShanXi Medical University (山西医科大学第二医院); ZiBo Central Hospital (淄博市中心医院); QiLu Hospital of Shandong University (山东大学齐鲁医院); Shanxi Province Cancer Hospital (山西省肿瘤医院); Cancer Hospital Affiliated to Shanxi Medical University (山西医科大学附属肿瘤医院); Peking University Shenzhen Hospital (北京大学深圳医院); The Second Hospital, Cheeloo College of Medicine, Shandong University (山东大学齐鲁医学院第二医院); Qingdao Central Hospital, University of Health and Rehabilitation Sciences (青岛中心医院,健康与康复科学大学); Shandong First Medical University Affiliated Tumor Hospital (山东第一医科大学附属肿瘤医院); Peking University Third Hospital, School of Basic Medical Sciences, Peking University Health Science Center (北京大学第三医院,基础医学院,北京大学医学部); the Second Affiliated Hospital, Zhejiang University School of Medicine (浙江大学医学院附属第二医院); Yantai Yuhuangding Hospital, Qingdao University (烟台毓璜顶医院,青岛大学); Shandong Provincial Key Medical and Health Laboratory of Intelligent Diagnosis and Treatment for Women’s Diseases (山东省智能诊疗妇女疾病重点实验室); Faculty of Applied Sciences, Macao Polytechnic University (澳门理工大学应用科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Generalist pathology foundation models (PFMs), pretrained on large-scale multi-organ datasets, have demonstrated remarkable predictive capabilities across diverse clinical applications. However, their proficiency on the full spectrum of clinically essential tasks within a specific organ system remains an open question due to the lack of large-scale validation cohorts for a single organ as well as the absence of a tailored training paradigm that can effectively translate broad histomorphological knowledge into the organ-specific expertise required for specialist-level interpretation. In this study, we propose BRIGHT, the first PFM specifically designed for breast pathology, trained on approximately 210 million histopathology tiles from over 51,000 breast whole-slide images derived from a cohort of over 40,000 patients across 19 hospitals. BRIGHT employs a collaborative generalist-specialist framework to capture both universal and organ-specific features. To comprehensively evaluate the performance of PFMs on breast oncology, we curate the largest multi-institutional cohorts to date for downstream task development and evaluation, comprising over 25,000 WSIs across 10 hospitals. The validation cohorts cover the full spectrum of breast pathology across 24 distinct clinical tasks spanning diagnosis, biomarker prediction, treatment response and survival prediction. Extensive experiments demonstrate that BRIGHT outperforms three leading generalist PFMs, achieving state-of-the-art (SOTA) performance in 21 of 24 internal validation tasks and in 5 of 10 external validation tasks with excellent heatmap interpretability. By evaluating on large-scale validation cohorts, this study not only demonstrates BRIGHT’s clinical utility in breast oncology but also validates a collaborative generalist-specialist paradigm, providing a scalable template for developing PFMs on a specific organ system.
[CV-22] Any Resolution Any Geometry: From Multi-View To Multi-Patch
【速读】:该论文旨在解决单目高分辨率深度图与表面法向量联合估计中面临的挑战,即如何在保持局部细节精细度的同时确保全局几何一致性。其核心解决方案是提出超分辨率几何Transformer(URGT),该方法将视觉几何基础Transformer(VGGT)改造为统一的多块Transformer架构,通过将输入图像划分为多个补丁并引入预训练模型提供的粗粒度深度和法向先验信息,在单一前向传播中实现几何输出的精细化预测。关键创新在于利用跨补丁注意力机制增强全局一致性,并结合GridMix补丁采样策略提升空间鲁棒性与泛化能力,从而在多个基准测试上显著优于现有方法,同时具备良好的零样本迁移能力和高分辨率扩展性。
链接: https://arxiv.org/abs/2603.03026
作者: Wenqing Cui,Zhenyu Li,Mykola Lavreniuk,Jian Shi,Ramzi Idoughi,Xiangjun Tang,Peter Wonka
机构: KAUST (King Abdullah University of Science and Technology); Space Research Institute NASU-SSAU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
点击查看摘要
Abstract:Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth–normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.
[CV-23] VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats
【速读】:该论文旨在解决基于3D Gaussian Splatting (3DGS) 的场景在保持视点依赖效应(如高光)的同时,实现高效且逼真的颜色编辑问题。现有方法缺乏对场景外观进行快速、精准编辑的能力,尤其难以同时保留复杂的光照响应特性。解决方案的关键在于提出一种新颖的架构,将颜色分解为漫反射(diffuse)与视点依赖(view-dependent)两个独立分量,并引入多视角训练策略,利用来自多个视角的图像块增强重建精度和编辑鲁棒性。此外,该方法仅需用户输入一张手动编辑图像,通过微调单个MLP权重并结合单次分割模块,即可在两秒内完成整个场景的颜色传播,实现交互式实时编辑并控制视点依赖效果的强度。
链接: https://arxiv.org/abs/2603.02986
作者: Alessio Mazzucchelli,Ivan Ojeda-Martin,Fernando Rivas-Manzaneque,Elena Garces,Adrian Penate-Sanchez,Francesc Moreno-Noguer
机构: Arquimea Research Center (阿基米德研究中心); Universidad Politécnica de Catalunya (加泰罗尼亚理工大学); Volinga AI (Volinga AI); Universidad Politécnica de Madrid (马德里理工大学); Universidad Rey Juan Carlos (雷伊胡安卡洛斯大学); IUSANI, Universidad de Las Palmas de Gran Canaria (拉斯帕尔马斯大加那利大学); Institut de Robòtica i Informàtica Industrial (IRI), CSIC-UPC (工业机器人与信息研究所,CSIC-加泰罗尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: IEEE Transactions on Pattern Analysis and Machine Intelligence. 2026 Feb 24
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has recently transformed the fields of novel view synthesis and 3D reconstruction due to its ability to accurately model complex 3D scenes and its unprecedented rendering performance. However, a significant challenge persists: the absence of an efficient and photorealistic method for editing the appearance of the scene’s content. In this paper we introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS while preserving view-dependent effects such as specular highlights. Key to our method are a novel architecture that separates color into diffuse and view-dependent components, and a multi-view training strategy that integrates image patches from multiple viewpoints. Improving over the conventional single-view batch training, our 3DGS representation provides more accurate reconstruction and serves as a solid representation for the recoloring task. For 3DGS recoloring, we then introduce a rapid scheme requiring only one manually edited image of the scene from the end-user. By fine-tuning the weights of a single MLP, alongside a module for single-shot segmentation of the editable area, the color edits are seamlessly propagated to the entire scene in just two seconds, facilitating real-time interaction and providing control over the strength of the view-dependent effects. An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors based on Neural Radiance Fields representations.
[CV-24] he Dresden Dataset for 4D Reconstruction of Non-Rigid Abdominal Surgical Scenes
【速读】:该论文旨在解决在真实手术条件下对变形腹腔软组织进行三维重建的挑战,尤其是针对非刚性运动、大形变及视场外更新等复杂场景下的算法鲁棒性问题。解决方案的关键在于构建了一个高质量的D4D数据集,该数据集包含经光学跟踪与人工迭代配准方法精确对齐的内窥镜视频和结构光几何信息,涵盖三种不同类型的序列(整体变形、增量变形和移动相机片段),并提供逐帧工具掩膜、立体深度图、结构光点云、优化后的相机位姿与内参等丰富标注信息,从而支持对非刚性SLAM、4D重建和深度估计方法的定量几何评估与视觉合成基准测试。
链接: https://arxiv.org/abs/2603.02985
作者: Reuben Docea,Rayan Younis,Yonghao Long,Maxime Fleury,Jinjing Xu,Chenyang Li,André Schulze,Ann Wierick,Johannes Bender,Micha Pfeiffer,Qi Dou,Martin Wagner,Stefanie Speidel
机构: National Center for Tumor Diseases (NCT), NCT/UCC Dresden, a partnership between DKFZ, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, and Helmholtz-Zentrum Dresden-Rossendorf (HZDR); Technische Universität Dresden; The Chinese University of Hong Kong; Centre for Tactile Internet with Human-in-the-Loop (CeTI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, accompanying data descriptor for dataset, submitted to Scientific Data
点击查看摘要
Abstract:The D4D Dataset provides paired endoscopic video and high-quality structured-light geometry for evaluating 3D reconstruction of deforming abdominal soft tissue in realistic surgical conditions. Data were acquired from six porcine cadaver sessions using a da Vinci Xi stereo endoscope and a Zivid structured-light camera, registered via optical tracking and manually curated iterative alignment methods. Three sequence types - whole deformations, incremental deformations, and moved-camera clips - probe algorithm robustness to non-rigid motion, deformation magnitude, and out-of-view updates. Each clip provides rectified stereo images, per-frame instrument masks, stereo depth, start/end structured-light point clouds, curated camera poses and camera intrinsics. In postprocessing, ICP and semi-automatic registration techniques are used to register data, and instrument masks are created. The dataset enables quantitative geometric evaluation in both visible and occluded regions, alongside photometric view-synthesis baselines. Comprising over 300,000 frames and 369 point clouds across 98 curated recordings, this resource can serve as a comprehensive benchmark for developing and evaluating non-rigid SLAM, 4D reconstruction, and depth estimation methods.
[CV-25] Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection
【速读】:该论文旨在解决当前基于DINO模型的无监督异常检测(Unsupervised Anomaly Detection, UAD)方法中存在的两个关键问题:一是现有方法通常独立建模patch嵌入,忽略了patch之间的空间和邻域关系;二是正常分布的建模多依赖于内存库或原型聚类,导致推理时存在显著的内存和计算开销。解决方案的关键在于提出一种简单高效的框架,通过二维自回归(2D Autoregressive, AR)模型显式建模patch嵌入间的空间与上下文依赖关系,并采用卷积神经网络(CNN)学习一个紧凑的参数化正常分布模型,从而在测试阶段仅需一次前向传播即可完成异常检测,实现快速且内存高效的推理。
链接: https://arxiv.org/abs/2603.02974
作者: Ertunc Erdil,Nico Schulthess,Guney Tombak,Ender Konukoglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal’’ images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: this https URL.
[CV-26] agaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
【速读】:该论文旨在解决大型视觉语言模型(VLM)在视觉-语言导航(VLN)任务中因架构不匹配而导致的性能瓶颈问题,即VLM主要基于静态、非具身的视觉-语言预训练任务,难以适应导航所需的动态、具身及空间结构化特性。解决方案的关键在于提出TagaVLM框架,通过显式注入拓扑结构增强模型的空间推理能力:一是引入Spatial Topology Aware Residual Attention(STAR-Att),将拓扑边信息直接嵌入VLM的自注意力机制以实现内在空间推理;二是设计Interleaved Navigation Prompt,强化节点级视觉与文本对齐;最终借助嵌入的拓扑图实现全局动作推理,提升路径修正能力。实验证明,该方法在R2R基准上显著优于现有大模型方法,表明针对小型开源VLM的针对性改进比单纯扩大模型规模更有效。
链接: https://arxiv.org/abs/2603.02972
作者: Jiaxing Liu,Zexi Zhang,Xiaoyan Li,Boyue Wang,Yongli Hu,Baocai Yin
机构: Beijing University of Technology (北京工业大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM’s self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon this http URL page: this https URL
[CV-27] Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention
【速读】:该论文旨在解决工业异常检测中因异常样本稀缺及现实场景下异常复杂性导致的检测性能瓶颈问题。其核心解决方案是提出一种基于基础模型的异常合成流水线(Foundation Model-based Anomaly Synthesis pipeline, FMAS),无需微调或类别特定训练即可生成高度逼真的异常样本;关键创新在于引入小波域注意力模块(Wavelet Domain Attention Module, WDAM),通过自适应子带处理增强异常特征提取能力,从而显著提升检测灵敏度并保持计算效率。
链接: https://arxiv.org/abs/2603.02964
作者: Wensheng Wu,Zheming Lu,Ziqian Lu,Zewei He,Xuecheng Sun,Zhao Wang,Jungong Han,Yunlong Yu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Industrial anomaly detection faces significant challenges due to the scarcity of anomalous samples and the complexity of real-world anomalies. In this paper, we propose a foundation model-based anomaly synthesis pipeline (FMAS) that generates highly realistic anomalous samples without fine-tuning or class-specific training. Motivated by the distinct frequency-domain characteristics of anomalies, we introduce aWavelet Domain Attention Module (WDAM), which exploits adaptive sub-band processing to enhance anomaly feature extraction. The combination of FMAS and WDAM significantly improves anomaly detection sensitivity while maintaining computational efficiency. Comprehensive experiments on MVTec AD and VisA datasets demonstrate that WDAM, as a plug-and-play module, achieves substantial performance gains against existing baselines.
[CV-28] Semi-Supervised Few-Shot Adaptation of Vision-Language Models
【速读】:该论文旨在解决医学影像领域中在极低样本量(extremely low-shot regimes)下视觉语言模型(Vision-Language Models, VLMs)因类别不平衡导致的性能下降问题。其解决方案的关键在于引入一种高效的半监督求解器,通过在少样本适配过程中传播由文本信息引导的伪标签(text-informed pseudo-labels),从而利用未标注数据增强模型泛化能力,显著降低标注成本,在低样本场景下可减少50%的标注工作量。
链接: https://arxiv.org/abs/2603.02959
作者: Julio Silva-Rodríguez,Ender Konukoglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
点击查看摘要
Abstract:Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by 50% in low-shot regimes.
[CV-29] Leverag ing Label Proportion Prior for Class-Imbalanced Semi-Supervised Learning
【速读】:该论文旨在解决半监督学习(Semi-supervised Learning, SSL)在类别不平衡场景下的性能下降问题,尤其是伪标签(pseudo-labeling)会放大多数类偏差并抑制少数类表现的局限性。其解决方案的关键在于首次将来自标签比例学习(Learning from Label Proportions, LLP)的Proportion Loss引入SSL框架作为正则化项,该损失函数通过约束模型预测与全局类别分布的一致性,有效缓解了多数类与少数类之间的偏差;同时,为进一步稳定训练过程,作者提出了一个考虑小批量样本组成波动的随机变体,从而提升算法鲁棒性。实验表明,该方法在长尾CIFAR-10基准上显著优于基线模型,并在标签稀缺条件下展现出竞争力。
链接: https://arxiv.org/abs/2603.02957
作者: Kohki Akiba,Shinnosuke Matsuo,Shota Harada,Ryoma Bise
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Semi-supervised learning (SSL) often suffers under class imbalance, where pseudo-labeling amplifies majority bias and suppresses minority performance. We address this issue with a lightweight framework that, to our knowledge, is the first to introduce Proportion Loss from learning from label proportions (LLP) into SSL as a regularization term. Proportion Loss aligns model predictions with the global class distribution, mitigating bias across both majority and minority classes. To further stabilize training, we formulate a stochastic variant that accounts for fluctuations in mini-batch composition. Experiments on the Long-tailed CIFAR-10 benchmark show that integrating Proportion Loss into FixMatch and ReMixMatch consistently improves performance over the baselines across imbalance severities and label ratios, and achieves competitive or superior results compared to existing CISSL methods, particularly under scarce-label conditions.
[CV-30] CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
【速读】:该论文旨在解决GUI持续学习(Continual GUI Learning, CGL)中的灾难性遗忘问题,即在适应新GUI应用任务时容易丢失旧任务的知识。其核心挑战在于如何在快速适应新任务与保留历史交互逻辑之间取得平衡。解决方案的关键在于提出一种动态协同SFT(监督微调)与RL(强化学习)的框架:首先通过策略熵引导的SFT比例调整机制,实时优化SFT与RL训练阶段的权重分配;其次设计了一种专用梯度手术(gradient surgery)策略,将探索性SFT梯度投影到GRPO(Generalized Reward Policy Optimization)锚定梯度上,显式裁剪与GRPO冲突的SFT梯度分量,从而缓解显式梯度干扰。这一方法显著提升了模型在多任务连续学习场景下的适应效率与技能保留能力。
链接: https://arxiv.org/abs/2603.02951
作者: Zhenquan Yao,Zitong Huang,Yihan Zeng,Jianhua Han,Hang Xu,Chun-Mei Feng,Jianwei Ma,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University College Dublin (都柏林大学); Peking University (北京大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbfContinual \textbfGUI \textbfLearning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.
[CV-31] C-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration CVPR2026
【速读】:该论文旨在解决扩散模型在低步数(20–30步)采样时因特征缓存技术失效而导致的生成质量下降问题。现有方法如基于多项式的外推器(如TaylorSeer)在步长增大时易出现误差累积和轨迹漂移,而传统缓存策略未考虑不同去噪阶段的动力学差异。解决方案的关键在于提出轨迹一致的Padé近似(Trajectory-Consistent Padé approximation, TC-Padé),其通过有理函数建模特征演化,更准确捕捉渐近与过渡行为;并引入两个核心机制:(1) 自适应系数调制,利用历史缓存残差检测微小轨迹变化;(2) 步长感知预测策略,针对早期、中期和晚期采样阶段的差异化动力学设计专用预测方式,从而实现高效且稳定的低步数采样。
链接: https://arxiv.org/abs/2603.02943
作者: Benlei Cui,Shaoxuan He,Bukun Huang,Zhizeng Ye,Yunyun Sun,Longtao Huang,Hui Xue,Yang Yang,Jingqun Tang,Zhou Zhao,Haiwen Hong
机构: Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学); Zhejiang Gongshang University (浙江工商大学); ByteDance Intelligent Creation (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
点击查看摘要
Abstract:Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.
[CV-32] RACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval
【速读】:该论文旨在解决通用多模态检索(Universal Multimodal Retrieval)中如何有效建模复杂用户意图的问题,特别是现有基于静态编码器的多模态大语言模型(Multimodal Large Language Models, MLLMs)难以应对需要逻辑推理而非表面模式匹配的复杂查询。解决方案的关键在于提出TRACE(Task-adaptive Reasoning And Compressing Embeddings),其核心机制是将生成式推理与判别式表征学习统一:首先通过Chain-of-Thought(CoT)生成结构化推理路径以显式解析查询意图,随后利用专用标记将该推理过程压缩为紧凑嵌入向量。该框架在M-BEIR-CoT数据集上训练,并展现出隐式路由能力——自动识别并激活复杂查询的推理模块,同时跳过简单查询的冗余步骤,在检索准确率与推理吞吐量之间实现最优平衡,且具备出色的零样本迁移能力。
链接: https://arxiv.org/abs/2603.02929
作者: Xiangzhao Hao,Shijie Wang,Tianyu Yang,Tianyue Wang,Haiyun Guo,JinQiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.
[CV-33] GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights
【速读】:该论文旨在解决肾小球病理学中两个核心挑战:一是肾小球病变评估的准确性与泛化能力不足,二是缺乏从组织层面形态参数到患者临床指标的系统性关联发现。针对这些问题,作者提出了一种以实体为中心的预训练基础模型GloPath,其关键创新在于利用超过一百万个肾小球图像(来自14,049例肾活检标本)进行多尺度、多视角的自监督学习训练,从而实现对细微病变模式的精准识别和跨模态诊断性能的显著提升。在52项任务上的实证表明,GloPath在42项任务中优于现有最优方法(80.8%),且在真实世界场景中达到91.51%的ROC-AUC,同时系统揭示了224对肾小球形态学变量与临床指标之间的统计学显著关联,展现出连接组织病理与临床结局的强大潜力。
链接: https://arxiv.org/abs/2603.02926
作者: Qiming He,Jing Li,Tian Guan,Yifei Ma,Zimo Zhao,Yanxia Wang,Hongjing Chen,Yingming Xu,Shuang Ge,Yexing Zhang,Yizhi Wang,Xinrui Chen,Lianghui Zhu,Yiqing Liu,Qingxia Hou,Shuyan Zhao,Xiaoqin Wang,Lili Ma,Peizhen Hu,Qiang Huang,Zihan Wang,Zhiyuan Shen,Junru Cheng,Siqi Zeng,Jiurun Chen,Zhen Song,Chao He,Zhe Wang,Yonghong He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Glomerular pathology is central to the diagnosis and prognosis of renal diseases, yet the heterogeneity of glomerular morphology and fine-grained lesion patterns remain challenging for current AI approaches. We present GloPath, an entity-centric foundation model trained on over one million glomeruli extracted from 14,049 renal biopsy specimens using multi-scale and multi-view self-supervised learning. GloPath addresses two major challenges in nephropathology: glomerular lesion assessment and clinicopathological insights discovery. For lesion assessment, GloPath was benchmarked across three independent cohorts on 52 tasks, including lesion recognition, grading, few-shot classification, and cross-modality diagnosis-outperforming state-of-the-art methods in 42 tasks (80.8%). In the large-scale real-world study, it achieved an ROC-AUC of 91.51% for lesion recognition, demonstrating strong robustness in routine clinical settings. For clinicopathological insights, GloPath systematically revealed statistically significant associations between glomerular morphological parameters and clinical indicators across 224 morphology-clinical variable pairs, demonstrating its capacity to connect tissue-level pathology with patient-level outcomes. Together, these results position GloPath as a scalable and interpretable platform for glomerular lesion assessment and clinicopathological discovery, representing a step toward clinically translatable AI in renal pathology.
[CV-34] HDINO: A Concise and Efficient Open-Vocabulary Detector
【速读】:该论文旨在解决当前开放词汇目标检测(Open-Vocabulary Object Detection)方法对人工精心构建的细粒度训练数据集及资源密集型分层跨模态特征提取的高度依赖问题。其核心解决方案在于提出一种简洁高效的检测框架HDINO,采用两阶段训练策略:第一阶段通过将噪声样本视为额外正样本,构建视觉与文本模态间的“一对多语义对齐机制”(One-to-Many Semantic Alignment Mechanism, O2M),并设计基于初始检测难度的加权分类损失(Difficulty Weighted Classification Loss, DWCL)以挖掘难例样本,从而提升语义对齐效果;第二阶段引入轻量级特征融合模块增强对语言语义的敏感性。该方法无需人工数据标注或 grounding 数据,在仅使用220万张图像的情况下即达到优于Grounding DINO-T和T-Rex2的性能,验证了其高效性与可扩展性。
链接: https://arxiv.org/abs/2603.02924
作者: Hao Zhang,Yiqun Wang,Qinran Lin,Runze Fan,Yong Li
机构: Chongqing University(重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf0.8 mAP and \textbf2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf56.4 mAP and \textbf59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at this https URL.
[CV-35] Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers CVPR2026
【速读】:该论文旨在解决视频扩散变换器(Video Diffusion Transformers, DiTs)在将运动类文本描述转化为高质量视频时,其内部如何具体实现运动语义理解的问题,尤其是缺乏对运动相关行为的可解释性分析。解决方案的关键在于提出一种无需梯度计算或参数更新即可生成可解释性更强的显著图的方法:首先引入GramCol模块以自适应地生成针对任意文本概念(包括运动与非运动)的逐帧显著图;其次设计了一种运动特征选择算法,构建出能够同时在空间和时间维度上定位运动行为的可解释运动注意力图(Interpretable Motion-Attentive Map, IMAP),从而有效揭示DiTs中运动概念的具体表达机制。
链接: https://arxiv.org/abs/2603.02919
作者: Youngjun Jun,Seil Kang,Woojung Han,Seong Jae Hwang
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026
点击查看摘要
Abstract:Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.
[CV-36] Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement ICLR2026
【速读】:该论文旨在解决现有方法在重建、分割独立运动部件及分析关节结构时对先验知识(如部件数量)的高度依赖问题,以及在物体在不同状态中不可见或遮挡情况下鲁棒性不足的局限。其关键解决方案是提出一种名为Articulation in Motion (AiM)的新框架,通过结合用户-物体交互视频与初始状态扫描,利用双高斯场景表示(dual-Gaussian scene representation)从运动线索中自动推断部件分解与关节位置;进一步采用鲁棒的顺序RANSAC算法,在无需部件级结构先验的前提下,将运动基元聚类为刚性部件并估计其运动学参数,同时自动确定部件数量,从而实现高质量的部件分割与可交互3D数字孪生重建。
链接: https://arxiv.org/abs/2603.02910
作者: Hao Ai,Wenjie Chang,Jianbo Jiao,Ales Leonardis,Ofek Eyal
机构: University of Birmingham (伯明翰大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026. Project Page: this https URL
点击查看摘要
Abstract:Articulated objects are ubiquitous in daily life. Our goal is to achieve a high-quality reconstruction, segmentation of independent moving parts, and analysis of articulation. Recent methods analyse two different articulation states and perform per-point part segmentation, optimising per-part articulation using cross-state correspondences, given a priori knowledge of the number of parts. Such assumptions greatly limit their applications and performance. Their robustness is reduced when objects cannot be clearly visible in both states. To address these issues, in this paper, we present a new framework, Articulation in Motion (AiM). We infer part-level decomposition, articulation kinematics, and reconstruct an interactive 3D digital replica from a user-object interaction video and a start-state scan. We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object and a video that shows the movement of separate parts. It uses motion cues to segment the object into parts and assign articulation joints. Subsequently, a robust, sequential RANSAC is employed to achieve part mobility analysis without any part-level structural priors, which clusters moving primitives into rigid parts and estimates kinematics while automatically determining the number of parts. The proposed approach separates the object into parts, each represented as a 3D Gaussian set, enabling high-quality rendering. Our approach yields higher quality part segmentation than previous methods, without prior knowledge. Extensive experimental analysis on both simple and complex objects validates the effectiveness and strong generalisation ability of our approach. Project page: this https URL.
[CV-37] Harmonic Beltrami Signature Network: a Shape Prior Module in Deep Learning Framework
【速读】:该论文旨在解决从二值图像中高效提取具有几何不变性的形状表示问题,特别是如何将形状先验信息嵌入到计算机视觉模型中以提升性能。其核心解决方案是提出Harmonic Beltrami Signature Network (HBSN),该网络通过三个关键模块实现:预空间变换网络(pre-STN)用于形状归一化,基于UNet的主干网络用于预测Harmonic Beltrami Signature (HBS),以及后置空间变换网络(post-STN)进行角度正则化。HBSN利用神经网络的函数逼近能力,实现了对复杂形状的高精度HBS计算,并可作为通用模块集成至现有分割模型中,从而有效引入几何形状先验信息,提升模型性能。
链接: https://arxiv.org/abs/2603.02907
作者: Chenran Lin,Lok Ming Lui
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents the Harmonic Beltrami Signature Network (HBSN), a novel deep learning architecture for computing the Harmonic Beltrami Signature (HBS) from binary-like images. HBS is a shape representation that provides a one-to-one correspondence with 2D simply connected shapes, with invariance to translation, scaling, and rotation. By exploiting the function approximation capacity of neural networks, HBSN enables efficient extraction and utilization of shape prior information. The proposed network architecture incorporates a pre-Spatial Transformer Network (pre-STN) for shape normalization, a UNet-based backbone for HBS prediction, and a post-STN for angle regularization. Experiments show that HBSN accurately computes HBS representations, even for complex shapes. Furthermore, we demonstrate how HBSN can be directly incorporated into existing deep learning segmentation models, improving their performance through the use of shape priors. The results confirm the utility of HBSN as a general-purpose module for embedding geometric shape information into computer vision pipelines.
[CV-38] ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization
【速读】:该论文旨在解决生成式图像压缩(Generative Image Compression, GIC)中模型规模庞大、结构僵化导致的灵活性不足与低比特率场景下实用性差的问题。其解决方案的关键在于提出一种基于残差向量量化(Residual Vector Quantization, RVQ)的渐进式生成图像压缩(Progressive Generative Image Compression, ProGIC)框架,通过多阶段编码器逐层压缩残差并生成粗到细的重建结果和可分段传输的比特流,同时结合轻量化骨干网络(采用深度可分离卷积与小尺寸注意力模块),在保证感知质量的同时显著提升编码/解码效率,并支持在GPU和仅CPU设备上的实际部署。
链接: https://arxiv.org/abs/2603.02897
作者: Hao Cao,Chengbin Liang,Wenqi Guo,Zhijin Qin,Jungong Han
机构: Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); State Key Laboratory of Space Network and Communications (空间网络与通信国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.
[CV-39] 3D-DRES: Detailed 3D Referring Expression Segmentation AAAI2026
【速读】:该论文旨在解决当前3D视觉定位任务仅支持句子级检测或分割,难以利用自然语言表达中丰富的组合性上下文推理的问题。其核心解决方案是提出了一种新的任务——详细3D指代表达分割(Detailed 3D Referring Expression Segmentation, 3D-DRES),该任务通过建立短语到3D实例的映射关系,提升细粒度的3D视觉语言理解能力。关键创新在于构建了DetailRefer数据集,采用开创性的短语-实例标注范式,将每个被引用的名词短语显式映射至对应的3D元素;同时设计了轻量高效的DetailBase基线架构,支持句子与短语双模式分割,实验证明该方案不仅在新任务上表现优异,还显著提升了传统3D-RES基准的表现。
链接: https://arxiv.org/abs/2603.02896
作者: Qi Chen,Changli Wu,Jiayi Ji,Yiwei Ma,Liujuan Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026
点击查看摘要
Abstract:Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.
[CV-40] Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
【速读】:该论文旨在解决从单张图像中重建高质量三维人体模型(3D human reconstruction)的问题,尤其针对现有方法在生成过程中易出现结构扁平化、过度平滑等伪影,以及在真实场景下泛化能力不足的挑战。其解决方案的关键在于提出MVD-HuGaS框架,通过一个增强的多视角扩散模型(multi-view diffusion model)从单图生成多视角图像,并利用高精度3D人体数据集微调以引入几何先验和人体结构先验;进一步设计对齐模块联合优化3D高斯表示与相机位姿,提升重建准确性;同时引入基于深度的面部畸变缓解模块(depth-based Facial Distortion Mitigation module)优化面部区域,从而实现高保真自由视角渲染。
链接: https://arxiv.org/abs/2603.02893
作者: Kaiqiang Xiong,Rui Peng,Jiahao Wu,Zhanke Wang,Jie Liang,Xiaoyun Zheng,Feng Gao,Ronggang Wang
机构: Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室); Migu Culture Technology Co., Ltd (咪咕文化科技有限公司); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textite.g. flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emphMVD-HuGaS, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.
[CV-41] LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval AAAI2026
【速读】:该论文旨在解决视频检索系统在面对多样化、大规模视频数据时,难以实现多模态理解、自适应推理及领域知识融合的问题。其核心解决方案是提出LLandMark框架,该框架采用模块化多智能体架构,在查询解析与规划、地标推理、多模态检索和重排序答案合成四个阶段协同工作;其中关键创新在于引入Landmark Knowledge Agent,能够识别文化或空间地标并将其转化为描述性视觉提示,从而增强CLIP(Contrastive Language–Image Pre-training)模型在越南场景下的语义匹配能力;此外,通过集成大语言模型(Gemini 2.5 Flash)驱动的图像到图像检索管道,实现了无需人工输入图像即可自动检测地标、生成搜索查询、获取代表性图像并进行视觉相似度匹配,显著提升了系统的自动化与泛化能力。
链接: https://arxiv.org/abs/2603.02888
作者: Minh-Chi Phung,Thien-Bao Le,Cam-Tu Tran-Thi,Thu-Dieu Nguyen-Thi,Vu-Hung Dao
机构: 1: University of Science, Ho Chi Minh City (胡志明市科学技术大学); 2: Vietnam National University, Ho Chi Minh City (胡志明市国家大学); 3: Institute of Information Technology, Vietnam Academy of Science and Technology (越南科学技术院信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 Workshop on New Frontiers in Information Retrieval
点击查看摘要
Abstract:The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.
[CV-42] Generalized non-exponential Gaussian splatting
【速读】:该论文旨在解决3D Gaussian splatting (3DGS) 中因传统指数衰减透射率模型导致的过度绘制(overdraw)问题,从而限制渲染效率的问题。其解决方案的关键在于将图像形成模型从经典的指数透射率推广至非指数形式,具体通过引入二次透射率函数定义了亚线性、线性和超线性版本的3DGS,这些新变体在保持与原始3DGS相当的重建质量的同时,实现了更快的衰减特性,显著减少了过量绘制,在基于光线追踪的渲染器中实现在复杂真实场景下高达4倍的速度提升。
链接: https://arxiv.org/abs/2603.02887
作者: Sébastien Speierer,Adrian Jarabo
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures, 4 tables
点击查看摘要
Abstract:In this work we generalize 3D Gaussian splatting (3DGS) to a wider family of physically-based alpha-blending operators. 3DGS has become the standard de-facto for radiance field rendering and reconstruction, given its flexibility and efficiency. At its core, it is based on alpha-blending sorted semitransparent primitives, which in the limit converges to the classic radiative transfer function with exponential transmittance. Inspired by recent research on non-exponential radiative transfer, we generalize the image formation model of 3DGS to non-exponential regimes. Based on this generalization, we use a quadratic transmittance to define sub-linear, linear, and super-linear versions of 3DGS, which exhibit faster-than-exponential decay. We demonstrate that these new non-exponential variants achieve similar quality than the original 3DGS but significantly reduce the number of overdraws, which result on speed-ups of up to 4\times in complex real-world captures, on a ray-tracing-based renderer.
[CV-43] StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting
【速读】:该论文旨在解决面部伪造检测(Face Forgery Detection, FFD)模型在客户端-服务器架构中面临的数据隐私泄露问题。现有隐私保护方法如匿名化、加密或失真处理虽能部分缓解泄露风险,但常引入显著语义畸变,易被攻击者察觉并诱发更激进的攻击策略,同时导致图像质量下降,干扰FFD模型对细微伪造痕迹的感知能力。解决方案的关键在于提出基于隐写术(Steganography)的面部分析框架StegaFFD,其核心创新包括:低频感知分解(Low-Frequency-Aware Decomposition, LFAD)以抑制低频覆盖语义干扰,空间-频率差异注意力机制(Spatial-Frequency Differential Attention, SFDA)增强隐藏人脸特征的感知能力,以及隐写域对齐(Steganographic Domain Alignment, SDA)将隐藏人脸表示与原始图像特征对齐,从而在不引发可疑行为的前提下实现高保真隐私保护,并维持FFD模型的检测精度。
链接: https://arxiv.org/abs/2603.02886
作者: Guoqing Ma,Xun Lin,Hui Ma,Ajian Liu,Yizhong Liu,Wenzhong Tang,Shan Yu,Chenqi Kong,Yi Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Machine Intelligence Research
点击查看摘要
Abstract:Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Steganography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model’s ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers’ suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.
[CV-44] SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
【速读】:该论文旨在解决扩散Transformer(Diffusion Transformer, DiT)在视频生成任务中因高内存与计算成本而难以部署到边缘设备的问题。现有量化方法在面对高激活值变化时易导致视频质量下降,且难以维持语义和时间一致性。其解决方案的关键在于提出SemanticDialect框架,通过构建带查找表的格式库(formatbook)实现低开销的逐块最优格式选择(即“方言”),并引入激活分解机制以减少量化误差,同时采用语义感知的方言分配(SeDA)策略,通过共享子格式库提升语义相关token间的量化值一致性。该方法在VDiT模型上显著优于现有视频量化方案,并在Open-Sora 2.0上逼近FP16精度。
链接: https://arxiv.org/abs/2603.02883
作者: Wonsuk Jang,Thierry Tambe
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiT) achieve strong video generation quality, but their memory and compute costs hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality under high activation variation and the need to preserve semantic/temporal coherence. We propose SemanticDialect, which advances recent block-wise mixed-format quantization-selecting a per-block optimal format (a dialect) from multiple candidates (a formatbook)-by scaling the formatbook with lookup tables for quantization error and quantized values, enabling efficient per-block format selection and quantization at low online cost. We also introduce activation decomposition that reduces quantization error by re-quantizing and adding back residual errors, with attention-guided salient token selection. We further propose semantic-aware dialect assignment (SeDA) to improve quantized value consistency by sharing a sub-formatbook among semantically correlated tokens. Experiments on video DiT (VDiT) models show that SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.
[CV-45] SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion ICLR2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 视频内容中水印技术的两大核心问题:一是现有内生水印方法(in-generation watermarking)因需存储所有消息-密钥对并采用模板匹配提取,导致在大规模应用时计算开销过高;二是这些方法在现代因果3D变分自编码器(causal 3D Variational Autoencoders, VAEs)驱动的视频扩散模型中对时间扰动鲁棒性极弱。解决方案的关键在于提出SIGMark框架,其核心创新为:(1) 引入全局帧级伪随机编码键(Global set of Frame-wise PseudoRandom Coding keys, GF-PRC),实现盲提取(blind extraction),显著降低存储与计算成本,同时保持初始噪声分布和多样性以实现无失真水印;(2) 设计面向因果3D VAE结构的段组排序模块(Segment Group-Ordering module, SGO),增强水印在时间扰动下的恢复能力,从而保障提取准确率。实验表明,SIGMark在时空扰动下仍能保持高比特准确性且开销极低,具备良好的可扩展性和鲁棒性。
链接: https://arxiv.org/abs/2603.02882
作者: Xinjie Zhu,Zijing Zhao,Hui Jin,Qingxiao Guo,Yilong Ma,Yunhao Wang,Xiaobing Guo,Weifeng Zhang
机构: Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at this https URL.
[CV-46] hink-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在处理视频流时存在的推理延迟问题,即现有方法多假设完整视频在推理前已全部可用,这与现实世界中视频信息按时间顺序逐帧到达的流式特性不匹配。为实现更贴近真实场景的流式推理,作者提出了一种名为“Think-as-You-See”(TaYS)的统一框架,其核心创新在于:通过并行化的Chain-of-Thought(CoT)生成、受限于流式输入的训练策略、以及流式并行推理机制,结合时序对齐的推理单元、流式注意力掩码与位置编码,以及双KV缓存结构(decoupling visual encoding from textual reasoning),实现了真正的并发推理。该方案显著降低了首次标记时间(Time-to-First-Token, TTFT)和整体推理延迟,同时保持了优异的推理性能。
链接: https://arxiv.org/abs/2603.02872
作者: Jialiang Zhang,Junlong Tong,Junyan Lin,Hao Wu,Yirong Sun,Yunpu Ma,Xiaoyu Shen
机构: Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Ocean University of China; Shanghai Jiao Tong University; The Hong Kong Polytechnic University; The University of Nottingham Ningbo China; Munich Center for Machine Learning, LMU Munich; Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbfThink-as-You-See (TaYS), a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \hrefthis https URLthis repository.
[CV-47] Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis
【速读】:该论文旨在解决稀疏视图新视角合成(sparse-view novel view synthesis)中因视图缺失导致的重建质量下降问题,特别是如何在有限观测条件下有效提升3D场景的细节恢复能力与鲁棒性。其解决方案的关键在于提出了一种多模态先验引导的重要性采样机制(multimodal-prior-guided importance sampling),该机制融合了光度渲染残差、语义先验和几何先验,生成一个可靠的局部可恢复性估计,从而精准指导细粒度高斯(fine Gaussians)的注入位置。通过此采样核心构建的分层3D高斯泼溅(hierarchical 3D Gaussian Splatting, 3DGS)框架,在粗到精的高斯表示中优先保留由多模态一致性支持的区域,同时避免在欠约束区域过早剔除新增高斯点,显著缓解了由纹理诱导的过拟合及姿态/外观不一致带来的噪声问题,最终在多个稀疏视图基准测试上实现了当前最优的重建性能。
链接: https://arxiv.org/abs/2603.02866
作者: Kaiqiang Xiong,Zhanke Wang,Ronggang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present multimodal-prior-guided importance sampling as the central mechanism for hierarchical 3D Gaussian Splatting (3DGS) in sparse-view novel view synthesis. Our sampler fuses complementary cues – photometric rendering residuals, semantic priors, and geometric priors – to produce a robust, local recoverability estimate that directly drives where to inject fine Gaussians. Built around this sampling core, our framework comprises (1) a coarse-to-fine Gaussian representation that encodes global shape with a stable coarse layer and selectively adds fine primitives where the multimodal metric indicates recoverable detail; and (2) a geometric-aware sampling and retention policy that concentrates refinement on geometrically critical and complex regions while protecting newly added primitives in underconstrained areas from premature pruning. By prioritizing regions supported by consistent multimodal evidence rather than raw residuals alone, our method alleviates overfitting texture-induced errors and suppresses noise from pose/appearance inconsistencies. Experiments on diverse sparse-view benchmarks demonstrate state-of-the-art reconstructions, with up to +0.3 dB PSNR on DTU.
[CV-48] Scale-invariant Gaussian derivative residual networks
【速读】:该论文旨在解决深度神经网络在图像尺度泛化(scale generalisation)方面的根本性挑战,即模型在训练时未见过的尺度下性能显著下降的问题(out-of-distribution problem)。其解决方案的关键在于提出可证明尺度不变的高斯导数残差网络(GaussDerResNets),该架构由级联的尺度协变高斯导数残差块构成,并通过引入残差跳跃连接(residual skip connections)提升网络深度与精度,同时保持优异的尺度不变性。理论证明了其在任意维度下的尺度协变性和尺度不变性特性,实验表明该方法在STL-10、Fashion-MNIST和CIFAR-10的多尺度重缩放版本上均展现出强大的尺度泛化能力和尺度选择能力。
链接: https://arxiv.org/abs/2603.02843
作者: Andrzej Perzanowski,Tony Lindeberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 39 pages, 23 figures, 5 tables
点击查看摘要
Abstract:Generalisation across image scales remains a fundamental challenge for deep networks, which often fail to handle images at scales not seen during training (the out-of-distribution problem). In this paper, we present provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing this problem. By adding residual skip connections to the previous notion of Gaussian derivative layers, deeper networks with substantially increased accuracy can be constructed, while preserving very good scale generalisation properties at the higher level of accuracy. Explicit proofs are provided regarding the underlying scale-covariant and scale-invariant properties in arbitrary dimensions. To analyse the ability of GaussDerResNets to generalise to new scales, we apply them on the new rescaled version of the STL-10 dataset, where training is done at a single fixed scale and evaluation is performed on multiple copies of the test set, each rescaled to a single distinct spatial scale, with scale factors extending over a range of 4. We also conduct similar systematic experiments on the rescaled versions of Fashion-MNIST and CIFAR-10 datasets. Experimentally, we demonstrate that the GaussDerResNets have strong scale generalisation and scale selection properties on all the three rescaled datasets. In our ablation studies, we investigate different architectural variants of GaussDerResNets, demonstrating that basing the architecture on depthwise-separable convolutions allows for decreasing both the number of parameters and the amount of computations, with reasonably maintained accuracy and scale generalisation. Comments: 39 pages, 23 figures, 5 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2603.02843 [cs.CV] (or arXiv:2603.02843v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.02843 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrzej Perzanowski [view email] [v1] Tue, 3 Mar 2026 10:39:41 UTC (2,155 KB)
[CV-49] oward Early Quality Assessment of Text-to-Image Diffusion Models
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成系统在“生成-选择”(generate–then–select)模式下资源消耗过高、效率低下的问题,尤其是在大量候选图像需通过数十至数百次去噪步骤生成并依赖后验评估指标(如CLIPScore和ImageReward)进行筛选时。解决方案的关键在于提出Probe-Select模块,该模块利用扩散模型或流匹配模型中早期去噪器激活特征所蕴含的稳定粗略结构、物体布局与空间排列信息,这些特征与最终图像保真度高度相关;通过直接从早期激活预测最终质量得分,实现对低质量种子的早期终止,从而大幅减少采样成本(超过60%),同时提升保留图像的质量。此方法无需修改原有生成模型,仅作为插件模块嵌入现有流程即可实现高效选择性生成。
链接: https://arxiv.org/abs/2603.02829
作者: Huanlei Guo,Hongxin Wei,Bingyi Jing
机构: Southern University of Science and Technology (南方科技大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); Shenzhen Loop Area Institute (深圳环区研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate–then–select’’ mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement–that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at this https URL.
[CV-50] BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation
【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)模型中品牌无缝植入的问题,即在不破坏用户意图语义的前提下,自动将广告商品牌嵌入由提示词生成的视频中。这一任务面临三大核心挑战:保持提示词忠实度、确保品牌可识别性以及实现上下文自然的融合。解决方案的关键在于提出BrandFusion——一种多智能体框架,包含离线阶段(面向广告商)和在线阶段(面向用户)。离线阶段构建品牌知识库,通过探测模型先验并采用轻量级微调适应新品牌;在线阶段则由五个协同智能体通过迭代优化用户提示,结合共享知识库与实时上下文追踪,保障品牌可见性和语义一致性,从而显著提升品牌识别度、语义保真度及整合自然性。
链接: https://arxiv.org/abs/2603.02816
作者: Zihao Zhu,Ruotong Wang,Siwei Lyu,Min Zhang,Baoyuan Wu
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shenzhen Loop Area Institute (深圳环区研究院); State University of New York at Buffalo (纽约州立大学布法罗分校); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.
[CV-51] ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink
【速读】:该论文旨在解决数字墨迹(Digital Ink)缺乏统一表示的问题,现有连续向量表示易产生长序列且训练不稳定,而传统标记化方法则面临词汇表过大、词外(out-of-vocabulary)问题以及识别性能低于向量表示的局限。其解决方案的关键在于提出ScribeTokens,一种将笔迹运动分解为单位像素步长的标记化方法,辅以两个笔状态标记构成固定10个标记的基础词汇表,能够完整表示任意数字墨迹并支持高效的BPE压缩。该设计显著提升了生成与识别任务的表现,尤其在无需预训练的情况下,ScribeTokens是唯一优于向量表示的标记方法,并通过引入“下一墨迹标记预测”作为自监督预训练策略,进一步加速收敛并提升识别准确率。
链接: https://arxiv.org/abs/2603.02805
作者: Douglass Wang
机构: Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Digital ink – the coordinate stream captured from stylus or touch input – lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).
[CV-52] Structure-Aware Text Recognition for Ancient Greek Critical Editions
【速读】:该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在理解历史学术文本复杂版式语义方面的局限性,特别是针对古希腊批判性版本中密集的参考层级和大量边缘注释的结构化特征。其解决方案的关键在于构建两个创新资源:一是基于TEI/XML源生成的包含185,000张页面图像的大规模合成语料库,具有受控的字体与版式变化;二是涵盖一个多世纪编辑与排版实践的真实扫描版本的精心标注基准数据集。通过在这两个数据集上评估三种前沿VLMs在零样本和微调两种场景下的表现,研究揭示了现有VLM架构在处理高度结构化历史文档时的显著不足,同时指出Qwen3VL-8B模型在真实扫描图像上达到1.0%的中位字符错误率(Character Error Rate, CER),展现出结构感知文本识别的巨大潜力。
链接: https://arxiv.org/abs/2603.02803
作者: Nicolas Angleraud,Antonia Karamolegkou,Benoît Sagot,Thibault Clérice
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.
[CV-53] NOVA: Sparse Control Dense Synthesis for Pair-Free Video Editing CVPR2026
【速读】:该论文旨在解决当前视频编辑模型依赖大规模成对数据(paired datasets)的瓶颈问题,尤其在局部视频编辑场景中,获取自然对齐的成对数据极为困难。现有方法通过全局运动控制将图像编辑迁移至视频,但难以保证背景一致性和时间连贯性。其解决方案的关键在于提出NOVA框架——一种“稀疏控制 + 密集合成”(Sparse Control \ Dense Synthesis)的新范式:稀疏分支利用用户编辑的关键帧提供语义引导,密集分支持续融合原始视频的运动与纹理信息以保持高保真度和一致性;同时引入退化模拟训练策略,在人工退化视频上训练模型学习运动重建与时间一致性,从而无需真实配对数据即可实现高质量视频编辑。
链接: https://arxiv.org/abs/2603.02802
作者: Tianlin Pan,Jiayi Dai,Chenpu Yuan,Zhengyao Lv,Binxin Yang,Hubery Yin,Chen Li,Jing Lyu,Caifeng Shan,Chenyang Si
机构: Nanjing University (南京大学); University of Chinese Academy of Sciences (中国科学院大学); WeChat, Tencent (微信,腾讯); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \ Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.
[CV-54] R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在处理野外场景时缺乏光照建模能力、难以支持再照明(relighting)任务的问题,以及在非约束拍摄条件下因光照变化导致的重建质量下降问题。解决方案的关键在于提出R3GW方法,通过将场景分离为可再照明的前景与非反射背景(天空),分别使用两组独立的高斯表示,并结合基于物理的渲染(Physically Based Rendering, PBR)来显式建模前景中的视图依赖性光照效应,从而实现任意光照条件下的真实感新视角合成,同时提升天空-前景边界处的深度重建精度和渲染质量。
链接: https://arxiv.org/abs/2603.02801
作者: Margherita Lea Corona,Wieland Morgenstern,Peter Eisert,Anna Hilsmann
机构: Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所); Humboldt Universität zu Berlin (洪堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at VISAPP 2026
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has established itself as a leading technique for 3D reconstruction and novel view synthesis of static scenes, achieving outstanding rendering quality and fast training. However, the method does not explicitly model the scene illumination, making it unsuitable for relighting tasks. Furthermore, 3DGS struggles to reconstruct scenes captured in the wild by unconstrained photo collections featuring changing lighting conditions. In this paper, we present R3GW, a novel method that learns a relightable 3DGS representation of an outdoor scene captured in the wild. Our approach separates the scene into a relightable foreground and a non-reflective background (the sky), using two distinct sets of Gaussians. R3GW models view-dependent lighting effects in the foreground reflections by combining Physically Based Rendering with the 3DGS scene representation in a varying illumination setting. We evaluate our method quantitatively and qualitatively on the NeRF-OSR dataset, offering state-of-the-art performance and enhanced support for physically-based relighting of unconstrained scenes. Our method synthesizes photorealistic novel views under arbitrary illumination conditions. Additionally, our representation of the sky mitigates depth reconstruction artifacts, improving rendering quality at the sky-foreground boundary
[CV-55] VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning
【速读】:该论文旨在解决静态多模态模型在现实世界中缺乏动态信息获取能力的问题,即现有模型虽具备较强的感知能力,但无法访问和利用实时网络信息,限制了其在复杂多轮工具调用任务中的应用。解决方案的关键在于提出VSearcher,一个基于强化学习的多模态搜索代理,通过引入迭代注入数据合成(Iterative Injection Data Synthesis)流水线生成大规模、高难度的多模态问答数据,并采用SFT-then-RL训练流程,使基础多模态模型具备在真实网络环境中进行多轮文本搜索、图像搜索与网页浏览等工具调用的能力。此外,论文还构建了多模态搜索基准MM-SearchExam,用于系统评估多模态搜索代理的性能,实验证明VSearcher在多项指标上优于现有方法,甚至超越部分商用模型。
链接: https://arxiv.org/abs/2603.02795
作者: Ruiyang Zhang,Qianguo Sun,Chao Song,Yiyan Qi,Zhedong Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 6 figures
点击查看摘要
Abstract:Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.
[CV-56] Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology Radiology and Natural Language
【速读】:该论文旨在解决当前医疗基础模型(Medical Foundation Models)缺乏统一、标准化且可复现的评估框架的问题,现有公共基准通常局限于特定任务、器官或模态,难以评估跨任务泛化能力。解决方案的关键在于提出UNICORN基准,其核心创新是采用一种新颖的两步框架,将模型推理与基于标准少样本适配的任务特定评估解耦;同时构建了间接访问的隔离测试集(sequestered test sets),来源于临床相关队列,并配套标准化评估代码和开放平台提交接口,从而实现多任务、多模态医疗模型的公平比较。性能通过新提出的UNICORN Score进行聚合,支持不同医学领域、模态和任务类型的模型直接对比。
链接: https://arxiv.org/abs/2603.02790
作者: Michelle Stegeman,Lena Philipp,Fennie van der Graaf,Marina D’Amato,Clément Grisi,Luc Builtjes,Joeran S. Bosma,Judith Lefkes,Rianne A. Weber,James A. Meakin,Thomas Koopman,Anne Mickan,Mathias Prokop,Ewoud J. Smit,Geert Litjens,Jeroen van der Laak,Bram van Ginneken,Maarten de Rooij,Henkjan Huisman,Colin Jacobs,Francesco Ciompi,Alessa Hering(and on behalf of the UNICORN consortium)
机构: Radboud University Medical Center (奈梅亨大学医疗中心); Oncode Institute (Oncode研究所); University of Amsterdam (阿姆斯特丹大学); University Medical Center Amsterdam (阿姆斯特丹大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper describes the dataset and design of the UNICORN challenge and provides the link to Grand Challenge
点击查看摘要
Abstract:Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via this http URL.
[CV-57] HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning CVPR
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中基于低秩适配(Low-Rank Adaptation, LoRA)的视觉Transformer(Vision Transformer, ViT)微调方法忽视真实场景下客户端潜在结构的问题,从而限制了共享表示学习并阻碍对未见客户端的有效适应。解决方案的关键在于提出一种分层LoRA框架(HiLoRA),在根(root)、簇(cluster)和叶(leaf)三个层级放置适配器,分别捕获全局、子群和客户端特定知识;通过跨层级正交性和级联优化分离更新子空间,并结合LoRA-子空间自适应聚类机制,基于子空间相似性分析推断隐式客户端分组,促进结构对齐客户端间的知识共享,同时理论证明了分层泛化分析支持该设计。
链接: https://arxiv.org/abs/2603.02785
作者: Zihao Peng,Nan Zou,Jiandian Zeng,Guo Li,Ke Chen,Boyuan Li,Tian Wang
机构: Beijing Normal University (北京师范大学); Zhengzhou University (郑州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
点击查看摘要
Abstract:Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA’s design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.
[CV-58] ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
【速读】:该论文旨在解决当前图像-文本对比预训练方法在视觉表征学习中仍存在模态组织不充分的问题,即模型学到的表示仍未完全摆脱对图像或文本模态的依赖。其解决方案的关键在于提出 ITO 框架,通过两种协同机制实现:一是多模态多重对齐(Multimodal multiple alignment),通过挖掘多样化的图像-文本对应关系增强监督信号;二是轻量级训练时多模态融合模块(lightweight training-time multimodal fusion module),强制跨模态结构化交互,在训练阶段起到关键的结构正则化作用,有效消除模态差距并稳定训练动态,同时该融合模块在推理阶段被移除,保持了双编码器架构的高效性。
链接: https://arxiv.org/abs/2603.02767
作者: HanZpeng Liu,Yaqian Li,Zidan Wang,Shuoxi Zhang,Zonglin Zhao,Zihao Bo,Rinyoichi Takezoe,Kaiwen Long,Kun He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer – eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.
[CV-59] Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLM s for Remote Sensing
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在遥感视觉问答(Remote Sensing Visual Question-Answering, RS-VQA)任务中存在显著幻觉的问题,尤其是由大规模场景中的视觉定位失败或对细粒度小目标误判引发的事实性幻觉。解决方案的关键在于提出一种无需训练的推理方法——相对注意力驱动的主动推理(Relative Attention-Driven Actively Reasoning, RADAR),该方法利用MLLMs内部固有的注意力机制,在测试阶段引导逐步定位与细粒度局部推理,从而有效缓解因视觉接地(visual grounding)错误导致的事实性幻觉,并同时降低逻辑幻觉。
链接: https://arxiv.org/abs/2603.02754
作者: Yi Liu,Jing Zhang,Di Wang,Xiaoyu Tian,Haonan Guo,Bo Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: this https URL
[CV-60] GVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的表示瓶颈问题,即现有架构依赖静态、指令无关的视觉编码器,导致视觉表征在不同文本任务中以不变方式使用,限制了细粒度推理能力。解决方案的关键在于提出iGVLM框架,其核心是解耦的双分支结构:一个冻结的表征分支保留预训练阶段学习到的任务无关视觉表征,另一个动态条件分支通过自适应层归一化(Adaptive Layer Normalization, AdaLN)执行仿射特征调制,从而实现从通用感知到指令感知推理的平滑过渡,同时保持预训练视觉先验的结构完整性和稳定性。
链接: https://arxiv.org/abs/2603.02748
作者: HanZpeng Liu,Yaqian Li,Zidan Wang,Shuoxi Zhang,Zihao Bo,Rinyoichi Takezoe,Kaiwen Long,Kun He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite the success of Large Vision–Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.
[CV-61] CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model
【速读】:该论文旨在解决多对象阴影生成(multi-object shadow generation)问题,即在图像合成中同时插入多个前景物体时,如何生成在几何结构、附着关系和位置上保持一致的物理合理阴影。现有方法多集中于单对象插入场景,难以推广至多对象复合场景。其解决方案的关键在于利用预训练文本到图像扩散模型的多模态能力:通过图像路径注入密集的多尺度特征以提供精细的空间引导,同时借助文本路径将每个物体的阴影边界框编码为学习的位置标记,并通过交叉注意力机制融合这些标记;此外,引入注意力对齐损失(attention-alignment loss)将这些标记精准锚定至对应的阴影区域,从而实现多对象阴影的一致性生成。
链接: https://arxiv.org/abs/2603.02743
作者: Waqas Ahmed,Dean Diepeveen,Ferdous Sohel
机构: Murdoch University (默多克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.
[CV-62] Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
【速读】:该论文旨在解决医学图像分割中模型既要保持精细解剖边界又需具备临床部署效率的问题。现有方法中,Transformer虽能捕捉长程依赖但存在二次复杂度和数据需求高、训练不稳定等问题;而CNN虽计算友好却缺乏全局推理能力。为此,作者提出PVT-GDLA,其核心创新在于Gated Differential Linear Attention (GDLA):通过在互补的查询/键子空间上计算两个核化注意力路径并相减,利用可学习的通道级缩放系数抑制共模噪声、增强相关上下文信息;同时引入轻量级头特定门控机制注入非线性与输入自适应稀疏性,缓解注意力塌陷问题,并设计并行局部token混合分支(含深度卷积)强化邻域交互,从而提升边界保真度,整体维持线性时间复杂度(O(N))和低参数开销。
链接: https://arxiv.org/abs/2603.02727
作者: Hongbo Zheng,Afshin Bozorgpour,Dorit Merhof,Minjia Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Regensburg (雷根斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers \mathcalO(N) scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining \mathcalO(N) complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.
[CV-63] Cross-view geo-localization Image retrieval Multiscale geometric modeling Frequency domain enhancement
【速读】:该论文旨在解决跨视角地理定位(Cross-view geo-localization, CVGL)中的关键挑战,包括严重的几何不对称性、不同成像域间的纹理不一致性以及局部信息的渐进式退化问题。现有方法主要依赖空间域特征对齐,对大尺度视角变化和局部扰动敏感,难以实现鲁棒的跨视角匹配。解决方案的关键在于提出空间与频域增强网络(Spatial and Frequency Domain Enhancement Network, SFDE),其采用三分支并行架构分别建模全局语义上下文、局部几何结构和频域统计稳定性,从而从场景拓扑、多尺度结构模式和频率不变性三个维度刻画跨域一致性;并通过渐进增强与耦合约束机制,在统一嵌入空间中联合优化互补特征,学习具有多粒度一致性的跨视角表示。
链接: https://arxiv.org/abs/2603.02726
作者: Hongying Zhang,ShuaiShuai Ma
机构: Civil Aviation University of China (中国民航大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints and constitutes a fundamental technique for visual localization in GNSS-denied environments. Nevertheless, CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information. Existing methods predominantly rely on spatial domain feature alignment, which is inherently sensitive to large scale viewpoint variations and local disturbances. To alleviate these limitations, this paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains. SFDE adopts a three branch parallel architecture to model global semantic context, local geometric structure, and statistical stability in the frequency domain, respectively, thereby characterizing consistency across domains from the perspectives of scene topology, multiscale structural patterns, and frequency invariance. The resulting complementary features are jointly optimized in a unified embedding space via progressive enhancement and coupled constraints, enabling the learning of cross-view representations with consistency across multiple granularities. Comprehensive experiments show that SFDE achieves competitive performance and in many cases even surpasses state-of-the-art methods, while maintaining a lightweight and computationally efficient design. Our code is available at this https URL
[CV-64] nExp: Mixture-of-Experts-Based Tensor Decomposition Structure Search Framework
【速读】:该论文旨在解决当前张量分解方法在捕捉数据低秩结构时面临的挑战,即如何从固定因子交互族(如张量收缩)中选择合适的张量分解形式以精确建模复杂数据结构,且难以实现多种分解方式的混合表达。其解决方案的关键在于提出一种基于专家混合(mixture-of-experts)的张量分解结构搜索框架TenExp,该框架能够在无监督条件下动态选择并激活最优的单一分解或混合分解策略,从而突破传统方法对固定因子交互结构的限制,并通过理论上的近似误差界证明了其逼近能力。
链接: https://arxiv.org/abs/2603.02720
作者: Ting-Wei Zhou,Xi-Le Zhao,Sheng Liu,Wei-Hao Wu,Yu-Bang Zheng,Deyu Meng
机构: University of Electronic Science and Technology of China (电子科技大学); Southwest Jiaotong University (西南交通大学); Xi’an Jiaotong University (西安交通大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, tensor decompositions continue to emerge and receive increasing attention. Selecting a suitable tensor decomposition to exactly capture the low-rank structures behind the data is at the heart of the tensor decomposition field, which remains a challenging and relatively under-explored problem. Current tensor decomposition structure search methods are still confined by a fixed factor-interaction family (e.g., tensor contraction) and cannot deliver the mixture of decompositions. To address this problem, we elaborately design a mixture-of-experts-based tensor decomposition structure search framework (termed as TenExp), which allows us to dynamically select and activate suitable tensor decompositions in an unsupervised fashion. This framework enjoys two unique advantages over the state-of-the-art tensor decomposition structure search methods. Firstly, TenExp can provide a suitable single decomposition beyond a fixed factor-interaction family. Secondly, TenExp can deliver a suitable mixture of decompositions beyond a single decomposition. Theoretically, we also provide the approximation error bound of TenExp, which reveals the approximation capability of TenExp. Extensive experiments on both synthetic and realistic datasets demonstrate the superiority of the proposed TenExp compared to the state-of-the-art tensor decomposition-based methods.
[CV-65] From “What” to “How”: Constrained Reasoning for Autoregressive Image Generation
【速读】:该论文旨在解决自回归图像生成方法中存在“空间结构不合理”这一核心问题,即现有模型仅通过改写输入提示词来指定“描绘什么”(What),却缺乏对“如何构建图像结构”(How)的推理能力,导致物体间空间关系模糊甚至出现不合理的重叠。解决方案的关键在于提出一种全新的“如何到什么”(How-to-What)范式——CoR-Painter框架,其核心创新是引入约束推理(Constrained Reasoning),首先从输入提示中推导出一组视觉约束条件,明确控制对象的空间布局、关键属性及组合规则,进而指导生成结构合理且语义一致的详细描述文本,从而实现高质量的图像合成。此外,该方法还设计了双目标GRPO优化策略,协同提升文本约束推理与视觉投影的一致性与质量。
链接: https://arxiv.org/abs/2603.02712
作者: Ruxue Yan,Xubo Liu,Wenya Guo,Zhengkun Zhang,Ying Zhang,Xiaojie Yuan
机构: Nankai University (南开大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify “What” details to depict by rewriting the input prompt, yet fundamentally fail to reason about “How” to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a “How-to-What” paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces “How to draw” by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description “What to draw”, providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).
[CV-66] MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration
【速读】:该论文旨在解决多类型图像退化(如雾霾、模糊、噪声和低光照)共存时,单一模型难以有效处理不同退化特征的问题。其核心挑战在于各类退化对恢复策略的要求差异显著,导致传统统一模型在复杂真实场景中表现受限。解决方案的关键在于提出一种融合双层Mixture-of-Experts(MoE)架构与预训练扩散模型的统一图像修复框架:其中,Inter-MoE层通过自适应组合专家组实现对主要退化类型的粗粒度适配,Intra-MoE层则在每类退化内部进一步选择专业化子专家以应对细粒度变化,从而在保持跨类别高专精性的同时,实现对复杂现实退化的精细调控。
链接: https://arxiv.org/abs/2603.02710
作者: Lingshun Kong,Jiawei Zhang,Zhengpeng Duan,Xiaohe Wu,Yueqi Yang,Xiaotao Wang,Dongqing Zou,Lei Lei,Jinshan Pan
机构: Nanjing University of Science and Technology (南京理工大学); Nankai University (南开大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
点击查看摘要
Abstract:All-in-one image restoration is challenging because different degradation types, such as haze, blur, noise, and low-light, impose diverse requirements on restoration strategies, making it difficult for a single model to handle them effectively. In this paper, we propose a unified image restoration framework that integrates a dual-level Mixture-of-Experts (MoE) architecture with a pretrained diffusion model. The framework operates at two levels: the Inter-MoE layer adaptively combines expert groups to handle major degradation types, while the Intra-MoE layer further selects specialized sub-experts to address fine-grained variations within each type. This design enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations, ensuring both high specialization in handling complex, real-world corruptions. Extensive experiments demonstrate that the proposed method performs favorably against the state-of-the-art approaches on multiple image restoration task.
[CV-67] Intelligent Pathological Diagnosis of Gestational Trophoblastic Diseases via Visual-Language Deep Learning Model
【速读】:该论文旨在解决妊娠滋养细胞疾病(Gestational Trophoblastic Disease, GTD)病理诊断耗时长、依赖病理医生经验且初始诊断一致性低的问题,这些问题严重威胁母婴健康与生殖结局。解决方案的关键在于开发了一个名为GTDoctor的专家模型,该模型能够对病理切片进行像素级病灶分割,并输出诊断结论及个性化的病理分析结果;基于此技术构建的临床软件系统GTDiagnosis在回顾性和前瞻性研究中均表现出高精度(平均精确度>0.91)和高效性(诊断时间从56秒缩短至16秒),显著提升了诊断性能与效率,同时保持了良好的临床可解释性。
链接: https://arxiv.org/abs/2603.02704
作者: Yuhang Liu,Yueyang Cang,Wenge Que,Xinru Bai,Xingtong Wang,Kuisheng Chen,Jingya Li,Xiaoteng Zhang,Xinmin Li,Lixia Zhang,Pingge Hu,Qiaoting Xie,Peiyu Xu,Xianxu Zeng,Li Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 3 figures
点击查看摘要
Abstract:The pathological diagnosis of gestational trophoblastic disease(GTD) takes a long time, relies heavily on the experience of pathologists, and the consistency of initial diagnosis is low, which seriously threatens maternal health and reproductive outcomes. We developed an expert model for GTD pathological diagnosis, named GTDoctor. GTDoctor can perform pixel-based lesion segmentation on pathological slides, and output diagnostic conclusions and personalized pathological analysis results. We developed a software system, GTDiagnosis, based on this technology and conducted clinical trials. The retrospective results demonstrated that GTDiagnosis achieved a mean precision of over 0.91 for lesion detection in pathological slides (n=679 slides). In prospective studies, pathologists using GTDiagnosis attained a Positive Predictive Value of 95.59% (n=68 patients). The tool reduced average diagnostic time from 56 to 16 seconds per case (n=285 patients). GTDoctor and GTDiagnosis offer a novel solution for GTD pathological diagnosis, enhancing diagnostic performance and efficiency while maintaining clinical interpretability.
[CV-68] ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling
【速读】:该论文旨在解决现有视频生成方法在多智能体交互场景下缺乏统一共享世界建模支持的问题,即如何实现多个智能体在同一环境中协同感知与生成一致的视频内容。其解决方案的关键在于:首先构建基于CARLA仿真平台的大规模多智能体交互世界数据集,包含多样场景、天气条件及配对的多视角视频(每个智能体前后左右四个视角);其次提出一种空间拼接策略,将独立智能体的四视角视频融合以扩展环境建模范围并保证内部多视角几何一致性;最后在预训练视频模型中引入跨智能体注意力模块,实现时空信息在不同智能体间的交互传递,从而确保重叠区域的一致性以及非重叠区域的合理生成,最终支持49帧大规模视频生成并准确感知动态智能体位置。
链接: https://arxiv.org/abs/2603.02697
作者: Jiayi Zhu,Jianing Zhang,Yiying Yang,Wei Cheng,Xiaoyun Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.
[CV-69] FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution CVPR2026
【速读】:该论文旨在解决当前基于扩散模型(diffusion-based)的图像超分辨率(Super-Resolution, SR)方法在实际应用中难以同时保持细节清晰度与重建保真度的问题。现有方法常因无法有效平衡高频细节保留与整体结构忠实性,导致视觉质量欠佳。其解决方案的关键在于提出FiDeSR框架:训练阶段采用一种细节感知加权策略(detail-aware weighting strategy),动态强化预测误差较大的区域;推理阶段引入低频与高频自适应增强模块(low- and high-frequency adaptive enhancers),无需重新训练即可灵活提升重建质量;此外,通过残差嵌套噪声精修机制(residual-in-residual noise refinement)修正扩散过程中的噪声预测误差,从而显著增强细粒度结构恢复能力。该设计使FiDeSR在真实世界图像超分任务中实现了高保真与细节保留的协同优化。
链接: https://arxiv.org/abs/2603.02692
作者: Aro Kim,Myeongjin Jang,Chaewon Moon,Youngjin Shin,Jinwoo Jeong,Sang-hyo Park
机构: Kyungpook National University (庆北国立大学); Korea Electronics Technology Institute (韩国电子技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration. The source code will be released at: this https URL.
[CV-70] ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT MICCAI2026
【速读】:该论文旨在解决冷扩散模型(cold diffusion models)在稀疏视图计算机断层成像(CT)重建中因采样策略依赖启发式控制或固定调度而导致的误差累积和采样不稳定性问题。其解决方案的关键在于提出了一种残差条件扩散框架(ReCo-Diff),通过残差条件自引导采样机制,在每一步采样时先生成无条件基线重建,再基于预测图像与稀疏视图测量值之间的观察残差(observation residual)对后续预测进行条件约束,从而实现持续、感知测量数据的校正,同时保持确定性采样路径,无需人工干预。
链接: https://arxiv.org/abs/2603.02691
作者: Yong Eun Choi,Hyoung Suk Park,Kiwan Jeon,Hyun-Cheol Park,Sung Ho Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures. Submitted to MICCAI 2026
点击查看摘要
Abstract:Cold and generalized diffusion models have recently shown strong potential for sparse-view CT reconstruction by explicitly modeling deterministic degradation processes. However, existing sampling strategies often rely on ad hoc sampling controls or fixed schedules, which remain sensitive to error accumulation and sampling instability. We propose ReCo-Diff, a residual-conditioned diffusion framework that leverages observation residuals through residual-conditioned self-guided sampling. At each sampling step, ReCo-Diff first produces a null (unconditioned) baseline reconstruction and then conditions subsequent predictions on the observation residual between the predicted image and the measured sparse-view input. This residual-driven guidance provides continuous, measurement-aware correction while preserving a deterministic sampling schedule, without requiring heuristic interventions. Experimental results demonstrate that ReCo-Diff consistently outperforms existing cold diffusion sampling baselines, achieving higher reconstruction accuracy, improved stability, and enhanced robustness under severe sparsity.
[CV-71] VisionCreator: A Native Visual-Generation Agent ic Model with Understanding Thinking Planning and Creation
【速读】:该论文旨在解决当前通用模型在视觉内容生成任务中对设计规范与创意工作流理解不足,以及基于工作流的智能体缺乏自主创作规划所需的专业知识的问题。解决方案的关键在于提出VisionCreator,一个原生的视觉生成代理模型,其核心是将理解(Understanding)、思考(Thinking)、规划(Planning)和创作(Creation)能力统一在一个端到端可学习框架内,并通过渐进式专业化训练(Progressive Specialization Training, PST)和虚拟强化学习(Virtual Reinforcement Learning, VRL)在高保真模拟环境中优化模型性能,从而实现复杂视觉创作任务中的稳定且高效的多步骤生成能力。
链接: https://arxiv.org/abs/2603.02681
作者: Jinxiang Lai,Zexin Lu,Jiajun He,Rongwei Quan,Wenzhe Zhao,Qinyu Yang,Qi Chen,Qin Lin,Chuyue Li,Tao Gao,Yuhao Shan,Shuai Shao,Song Guo,Qinglin Lu
机构: Tencent Hunyuan; Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.
[CV-72] DREAM: Where Visual Understanding Meets Text-to-Image Generation
【速读】:该论文旨在解决如何在单一模型中统一视觉表征学习与文本到图像(Text-to-Image, T2I)生成任务这一核心挑战。其解决方案的关键在于提出DREAM框架,该框架通过联合优化判别式(discriminative)和生成式(generative)目标来实现视觉表征的增强与高质量图像生成的协同提升。具体而言,训练阶段采用渐进式掩码策略(Masking Warmup),从最小掩码逐步过渡至全掩码,以先建立对比对齐(contrastive alignment)再稳定生成;推理阶段引入语义对齐解码(Semantically Aligned Decoding),通过筛选部分掩码图像候选并匹配目标文本,显著提升图文一致性(+6.3%),无需外部重排序器。该方法在仅使用CC12M数据训练下即达到72.7% ImageNet线性探测准确率(较CLIP提升1.1%)和FID 4.25(较FLUID提升6.2%),验证了判别与生成目标的协同效应。
链接: https://arxiv.org/abs/2603.02667
作者: Chao Li,Tianhong Li,Sai Vidyaranya Nuthalapati,Hong-You Chen,Satya Narayan Shukla,Yonghuan Yang,Jun Xiao,Xiangjun Fan,Aashu Singh,Dina Katabi,Shlok Kumar Mishra
机构: MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.
[CV-73] OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning
【速读】:该论文旨在解决时尚智能(fashion intelligence)领域中因监督信号碎片化和时尚标注不完整导致的视觉-语义结构不一致问题,从而限制了当前视觉语言模型(VLMs)作为统一时尚认知与推理引擎的能力。其解决方案的关键在于构建一个百万级规模的数据集FashionX,该数据集对服装整体及局部部件进行详尽标注,并在此基础上提出OmniFashion框架——一个基于统一时尚对话范式的多任务视觉语言架构,通过整合检索、推荐、识别与对话等多样化任务,实现跨任务推理与交互式对话能力,显著提升了任务精度与跨任务泛化性能。
链接: https://arxiv.org/abs/2603.02658
作者: Zhengwei Yang,Andi Long,Hao Li,Zechao Hu,Kui Jiang,Zheng Wang
机构: Wuhan University (武汉大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures
点击查看摘要
Abstract:Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.
[CV-74] SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation ISCAS2026
【速读】:该论文旨在解决透明物体实例分割(transparent object instance segmentation)中存在的边界模糊、对比度低及对背景依赖性强等问题,这些问题导致现有方法因依赖强外观特征和清晰边界而性能受限。解决方案的关键在于提出SEP-YOLO框架,其核心创新包括:1)频域细节增强模块(Frequency Domain Detail Enhancement Module),通过可学习的复权重分离并增强弱高频边界成分;2)多尺度空间精修流(multi-scale spatial refinement stream),包含内容感知对齐颈部(Content-Aware Alignment Neck)与多尺度门控精修块(Multi-scale Gated Refinement Block),以实现深层语义特征中的精确特征对齐与边界定位。该方法在Trans10K和GVD数据集上取得当前最优(SOTA)性能。
链接: https://arxiv.org/abs/2603.02648
作者: Fengming Zhang,Tao Yan,Jianchao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures,accepted to ISCAS 2026
点击查看摘要
Abstract:Transparent object instance segmentation presents significant challenges in computer vision, due to the inherent properties of transparent objects, including boundary blur, low contrast, and high dependence on background context. Existing methods often fail as they depend on strong appearance cues and clear boundaries. To address these limitations, we propose SEP-YOLO, a novel framework that integrates a dual-domain collaborative mechanism for transparent object instance segmentation. Our method incorporates a Frequency Domain Detail Enhancement Module, which separates and enhances weak highfrequency boundary components via learnable complex weights. We further design a multi-scale spatial refinement stream, which consists of a Content-Aware Alignment Neck and a Multi-scale Gated Refinement Block, to ensure precise feature alignment and boundary localization in deep semantic features. We also provide high-quality instance-level annotations for the Trans10K dataset, filling the critical data gap in transparent object instance segmentation. Extensive experiments on the Trans10K and GVD datasets show that SEP-YOLO achieves state-of-the-art (SOTA) performance.
[CV-75] owards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective
【速读】:该论文旨在解决增量统一多模态异常检测(Incremental Unified Multimodal Anomaly Detection)中的灾难性遗忘问题,即在学习新类别或对象时如何有效保留已有知识。其核心挑战在于,传统方法常忽略伪特征(spurious features)和冗余特征(redundant features)对灾难性遗忘的负面影响,导致模型性能下降。解决方案的关键是提出一种名为IB-IUMAD的新颖去噪框架,其创新点在于结合Mamba解码器与信息瓶颈融合模块:前者用于解耦不同物体间的特征耦合,抑制伪特征干扰;后者则过滤融合特征中的冗余信息,显式保留判别性特征,从而显著缓解灾难性遗忘并提升模型在MVTec 3D-AD和Eyecandies数据集上的性能表现。
链接: https://arxiv.org/abs/2603.02629
作者: Kaifang Long,Lianbo Ma,Jiaqi Liu,Liming Liu,Guoyang Xie
机构: 东北大学软件学院(Software College, Northeastern University); CATL(CATL)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.
[CV-76] Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild ICLR2026
【速读】:该论文旨在解决单视角3D人体重建中生成的人体姿态不自然的问题,尤其是针对动态或复杂姿态场景下的重建质量下降问题。作者认为这一现象主要源于现有3D人体数据集在姿态多样性上的局限性。解决方案的关键在于提出DrPose(Direct Reward fine-tuning algorithm on Poses),其核心是通过仅使用单视图图像与人体姿态配对的数据进行后训练(post-training),利用一个可微分的奖励函数——PoseScore——来最大化生成多视角潜在图像与真实人体姿态之间的一致性。该方法无需昂贵的3D人体资产,且借助新构建的DrPose15K数据集(基于现有动作捕捉数据和姿态条件视频生成模型构建)显著扩展了姿态分布范围,从而有效提升了模型在挑战性姿态下的重建性能。
链接: https://arxiv.org/abs/2603.02619
作者: Seunguk Do,Minwoo Huh,Joonghyuk Shin,Jaesik Park
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026, Project webpage: this https URL
点击查看摘要
Abstract:Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: this https URL.
[CV-77] Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs CVPR2026
【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Models, VLMs)的分布外(Out-of-Distribution, OOD)检测方法中存在的模态内距离(intra-modal distance)与模型优化目标不一致的问题。现有方法常通过比较负文本与已知类别标签或测试图像与图像代理来实现OOD检测,这种设计违背了CLIP类VLM在跨模态距离(inter-modal distance)上优化的本质,导致性能受限。解决方案的关键在于提出InterNeg框架,其核心思想是系统性地增强一致的跨模态距离:一方面从文本角度设计跨模态准则以筛选负文本;另一方面从视觉角度动态识别高置信度OOD图像并将其反演至文本空间,生成由跨模态距离引导的额外负文本嵌入,从而提升OOD检测的准确性与鲁棒性。
链接: https://arxiv.org/abs/2603.02618
作者: Zhikang Xu,Qianqian Xu,Zitai Wang,Cong Hua,Sicong Li,Zhiyong Yang,Qingming Huang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所人工智能安全重点实验室); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院); BDKM, University of Chinese Academy of Sciences (中国科学院大学BDKM)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the main track of CVPR 2026
点击查看摘要
Abstract:Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50% improvement in AUROC on the challenging Near-OOD benchmark.
[CV-78] VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction
【速读】:该论文旨在解决当前基于体素(voxel)的3D语义占用预测模型在稀疏几何网格中存在语义模糊性以及恶劣天气条件下性能下降的问题。解决方案的关键在于提出一个鲁棒的多模态框架VLMFusionOcc3D,其核心创新包括:1)引入基于实例的视觉-语言注意力机制(InstVLM),通过门控交叉注意力和LoRA适配的CLIP嵌入,将高层语义与地理先验注入3D体素;2)设计天气感知自适应融合机制(WeathFusion),利用车辆元数据和天气条件提示动态调整传感器贡献权重,提升环境可靠性下的融合效果;3)采用深度感知几何对齐损失(DAGA)确保相机推导的稠密几何与LiDAR稀疏但空间准确的点云之间的一致性。这些模块共同提升了模型在复杂城市场景中的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2603.02609
作者: A. Enes Doruk,Hasan F. Ates
机构: Ozyegin University (奥泽京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
[CV-79] Synthetic-Child: An AIGC-Based Synthetic Data Pipeline for Privacy-Preserving Child Posture Estimation
【速读】:该论文旨在解决儿童姿态估计中因真实数据采集成本高且涉及隐私伦理问题而难以获得大规模标注数据集的难题。其核心解决方案是提出一种基于生成式 AI (Generative AI) 的合成数据流水线 Synthetic-Child,关键在于通过四阶段流程实现高质量、带真值标注的儿童姿态图像生成:首先利用可编程的3D儿童人体模型(SMPL-X)在Blender中生成符合解剖学约束的多样化坐姿;其次借助双ControlNet(姿态+深度)条件控制的FLUX-1 Dev模型合成12,000张逼真图像并保持低标注偏差;再通过ViTPose置信度过滤与针对性增强提升鲁棒性;最终在合成数据上微调RTMPose-M模型,并结合几何特征工程和轻量MLP进行分类,量化至INT8后实现在边缘设备上的实时推理(22 FPS),在真实儿童测试集上达到71.2 AP,显著优于基于成人数据预训练的基线模型。
链接: https://arxiv.org/abs/2603.02598
作者: Taowen Zeng
机构: Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 3 figures, 5 tables
点击查看摘要
Abstract:Accurate child posture estimation is critical for AI-powered study companion devices, yet collecting large-scale annotated datasets of children is both expensive and ethically prohibitive due to privacy concerns. We present Synthetic-Child, an AIGC-based synthetic data pipeline that produces photorealistic child posture training images with ground-truth-projected keypoint annotations, requiring zero real child photographs. The pipeline comprises four stages: (1) a programmable 3D child body model (SMPL-X) in Blender generates diverse desk-study poses with IK-constrained anatomical plausibility and automatic COCO-format ground-truth export; (2) a custom PoseInjectorNode feeds 3D-derived skeletons into a dual ControlNet (pose + depth) conditioned on FLUX-1 Dev, synthesizing 12,000 photorealistic images across 10 posture categories with low annotation drift; (3) ViTPose-based confidence filtering and targeted augmentation remove generation failures and improve robustness; (4) RTMPose-M (13.6M params) is fine-tuned on the synthetic data and paired with geometric feature engineering and a lightweight MLP for posture classification, then quantized to INT8 for real-time edge deployment. On a real-child test set (n~300), the FP16 model achieves 71.2 AP – a +12.5 AP improvement over the COCO-pretrained adult-data baseline at identical model capacity. After INT8 quantization the model retains 70.4 AP while running at 22 FPS on a 0.8-TOPS Rockchip RK3568 NPU. In a single-subject controlled comparison with a commercial posture corrector, our system achieves substantially higher recognition rates across most tested categories and responds ~1.8x faster on average. These results demonstrate that carefully designed AIGC pipelines can substantially reduce dependence on real child imagery while achieving deployment-ready accuracy, with potential applications to other privacy-sensitive domains.
[CV-80] Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification
【速读】:该论文旨在解决资源受限语言(如孟加拉语)在手写字符识别任务中因训练数据稀缺而导致深度学习模型性能下降的问题。其关键解决方案是采用图像数据增强技术,通过提升数据集的规模与多样性来缓解过拟合现象,并结合轻量级模型EfficientViT进行高效训练。实验表明,随机仿射变换(Random Affine)与色彩抖动(Color Jitter)的组合策略在Ekush和AIBangla两个孟加拉语手写字符数据集上表现最优,分别达到97.48%和97.57%的准确率,显著优于其他单一或组合增强方法,验证了该方案在小样本场景下的有效性。
链接: https://arxiv.org/abs/2603.02591
作者: Rafi Hassan Chowdhury,Naimul Haque,Kaniz Fatiha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep learning models have proven to be highly effective in computer vision, with deep convolutional neural networks achieving impressive results across various computer vision tasks. However, these models rely heavily on large datasets to avoid overfitting. When a model learns features with either low or high variance, it can lead to underfitting or overfitting on the training data. Unfortunately, large-scale datasets may not be available in many domains, particularly for resource-limited languages such as Bengali. In this experiment, a series of tests were conducted in the field of image data augmentation as an approach to addressing the limited data problem for Bengali handwritten characters. The study also provides an in-depth analysis of the performance of different augmentation techniques. Data augmentation refers to a set of techniques applied to data to increase its size and diversity, making it more suitable for training deep learning models. The image augmentation techniques evaluated in this study include CLAHE, Random Rotation, Random Affine, Color Jitter, and their combinations. The study further explores the use of augmentation methods with a lightweight model such as EfficientViT. Among the different augmentation strategies, the combination of Random Affine and Color Jitter produced the best accuracy on the Ekush [1] and AIBangla [2] datasets, achieving accuracies of 97.48% and 97.57%, respectively. This combination outperformed all other individual and combined augmentation techniques. Overall, this analysis presents a thorough examination of the impact of image data augmentation in resource-scarce languages, particularly in the context of Bengali handwritten character recognition using lightweight models.
[CV-81] Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction
【速读】:该论文旨在解决构建功能型数字孪生(Digital Twin)的核心挑战,即如何从非接触、非侵入式传感数据中重建场景中每个点的物理材料属性(如介电常数、电导率),从而实现可模拟的三维数字孪生体。当前方法(如NeRF)仅能生成视觉逼真的模型,缺乏功能性。其关键障碍在于该问题是一个典型的病态物理反演问题,标准遥感信号(如图像和射频RF)深度耦合了未知几何、环境场与目标材料特性。解决方案的关键在于提出一种系统性的解耦策略:利用高保真图像获取的几何信息作为锚点,首先解析环境场;在此基础上,仅用非侵入式数据约束几何与场,将原病态问题转化为一个有良好定义的物理监督学习任务,进而通过一个基于射频信号引导、嵌入可微分物理反射模型的解码器模块,显式输出连续的空间变化材料参数场,最终实现高精度材料重建与物理仿真能力。
链接: https://arxiv.org/abs/2603.02582
作者: Zhe Chen,Peilin Zheng,Wenshuo Chen,Xiucheng Wang,Yutao Yue,Nan Cheng
机构: Xidian University (西安电子科技大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Creating functional Digital Twins, simulatable 3D replicas of the real world, is a central challenge in computer vision. Current methods like NeRF produce visually rich but functionally incomplete twins. The key barrier is the lack of underlying material properties (e.g., permittivity, conductivity). Acquiring this information for every point in a scene via non-contact, non-invasive sensing is a primary goal, but it demands solving a notoriously ill-posed physical inversion problem. Standard remote signals, like images and radio frequencies (RF), deeply entangle the unknown geometry, ambient field, and target materials. We introduce NEMF, a novel framework for dense, non-invasive physical inversion designed to build functional digital twins. Our key insight is a systematic disentanglement strategy. NEMF leverages high-fidelity geometry from images as a powerful anchor, which first enables the resolution of the ambient field. By constraining both geometry and field using only non-invasive data, the original ill-posed problem transforms into a well-posed, physics-supervised learning task. This transformation unlocks our core inversion module: a decoder. Guided by ambient RF signals and a differentiable layer incorporating physical reflection models, it learns to explicitly output a continuous, spatially-varying field of the scene’s underlying material parameters. We validate our framework on high-fidelity synthetic datasets. Experiments show our non-invasive inversion reconstructs these material maps with high accuracy, and the resulting functional twin enables high-fidelity physical simulation. This advance moves beyond passive visual replicas, enabling the creation of truly functional and simulatable models of the physical world.
[CV-82] ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration
【速读】:该论文旨在解决基于Transformer的图像恢复模型在性能与计算复杂度之间难以平衡的问题,尤其是现有方法因自注意力机制具有二次时间复杂度而限制注意力范围至局部窗口,导致感受野受限、性能不佳。其解决方案的关键在于提出一种新颖的自适应令牌字典(Adaptive Token Dictionary, ATD)架构,通过引入一个可学习的令牌字典来总结外部图像先验信息(即典型图像结构),并设计令牌字典交叉注意力(Token Dictionary Cross-Attention, TDCA)机制,实现输入特征与字典之间的交互增强;同时利用TDCA注意力图中嵌入的类别信息对输入特征进行分组,形成多个注意力组以建模全局依赖关系,并将类别信息融入前馈网络以提升特征融合能力,从而在保持线性计算复杂度的同时显著提升图像恢复效果。
链接: https://arxiv.org/abs/2603.02581
作者: Leheng Zhang,Wei Long,Yawei Li,Xingyu Zhou,Xiaorui Zhao,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures
点击查看摘要
Abstract:Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.
[CV-83] rack4World: Feedforward World-centric Dense 3D Tracking of All Pixels
【速读】:该论文旨在解决从单目视频中高效估计每个像素的3D轨迹这一关键问题,以实现对视频中场景动态的全面理解。现有方法要么仅能跟踪第一帧的稀疏点,要么依赖缓慢的优化框架进行密集跟踪,难以满足实时性和精度需求。解决方案的关键在于提出一种前馈模型Track4World,其基于VGGT风格的视觉Transformer(Vision Transformer, ViT)构建全局3D场景表示,并引入一种新颖的3D相关性机制,可同时估计任意帧对之间的像素级2D与3D稠密光流(dense flow)。该方法结合重建的3D几何信息,实现了世界坐标系下的高效全像素3D跟踪,显著优于现有方法在2D/3D光流估计和3D跟踪任务上的性能表现,展现出良好的鲁棒性和可扩展性,适用于真实世界的4D重建任务。
链接: https://arxiv.org/abs/2603.02573
作者: Jiahao Lu,Jiayi Xu,Wenbo Hu,Ruijie Zhu,Chengfeng Zhao,Sai-Kit Yeung,Ying Shan,Yuan Liu
机构: The Hong Kong University of Science and Technology (香港科技大学); ARC Lab, Tencent PCG (腾讯PCGARC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.
[CV-84] CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration
【速读】:该论文旨在解决多模态图像融合(Multimodal Image Fusion, MMIF)在复合恶劣天气条件下的性能瓶颈问题,即现有方法通常仅针对单一类型退化(如雾霾、雨滴或积雪)设计,难以应对多种退化同时存在的情况(如雾霾+雨、雨+雪)。其核心解决方案是提出首个端到端框架——Compound Adverse Weather Mamba (CAWM-Mamba),通过统一共享权重联合执行图像融合与复合天气恢复。关键创新包括:(1) 天气感知预处理模块(Weather-Aware Preprocess Module, WAPM)用于增强退化可见特征并提取全局天气嵌入;(2) 跨模态特征交互模块(Cross-modal Feature Interaction Module, CFIM)促进异构模态对齐与互补特征交换;(3) 小波域状态块(Wavelet Space State Block, WSSB),利用小波分解解耦多频段退化,其中频域状态空间模块(Freq-SSM)无冗余建模各向异性高频退化,并引入统一退化表示机制以提升复杂复合天气场景下的泛化能力。
链接: https://arxiv.org/abs/2603.02560
作者: Huichun Liu,Xiaosong Li,Zhuangfan Huang,Tao Ye,Yang Liu,Haishu Tan
机构: Foshan University (佛山大学); China University of Mining and Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM-Mamba), the first end-to-end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather-Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross-modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet-domain decomposition to decouple multi-frequency degradations. WSSB includes Freq-SSM, a module that models anisotropic high-frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM-100K benchmark and three standard fusion datasets demonstrate that CAWM-Mamba consistently outperforms state-of-the-art methods in both compound and single-weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real-world adverse weather perception. The source code will be available at this https URL.
[CV-85] CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment CVPR2026
【速读】:该论文旨在解决视觉-语言模型(如CLIP)在跨模态表示学习中因系统性误分类而导致的细粒度判别能力不足问题,特别是针对视觉和语义相似类别间持续存在的混淆模式。解决方案的关键在于提出一种名为CAPT(Confusion-Aware Prompt Tuning)的框架,其核心创新包括:构建一个显式建模类别间稳定混淆关系的“混淆库”(Confusion Bank),引入语义混淆挖掘器(SEM)与样本混淆挖掘器(SAM)分别从全局语义差异和局部样本特征中提取混淆线索,并设计多粒度差异专家模块(MGDE)以统一不同层次的混淆信息,从而实现更鲁棒的混淆感知推理。
链接: https://arxiv.org/abs/2603.02557
作者: Maoyuan Shao,Yutong Gao,Xinyang Huang,Chuang Zhu,Lijuan Sun,Guoshun Nan
机构: Minzu University of China (中央民族大学); Beijing University of Posts and Telecommunications (北京邮电大学); National Library of China (中国国家图书馆)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model’s intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at this https URL.
[CV-86] Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation CVPR2026
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在语义分割任务中忽视域外泛化能力的问题,尤其是在视觉基础模型(Vision Foundation Models, VFMs)蒸馏过程中,传统方法虽能保持域内准确率,却会削弱VFMs对分布偏移的鲁棒性。解决方案的关键在于提出一种多阶段的可泛化知识蒸馏(Generalizable Knowledge Distillation, GKD)框架:首先通过选择性特征蒸馏使学生模型学习领域无关表示(domain-agnostic representations),并冻结这些表示以避免对可见域过拟合;其次引入基于查询的软蒸馏机制,利用学生特征作为查询从教师模型中检索可迁移的空间知识,从而增强跨域适应能力。
链接: https://arxiv.org/abs/2603.02554
作者: Chonghua Lv,Dong Zhao,Shuang Wang,Dou Quan,Ning Huyan,Nicu Sebe,Zhun Zhong
机构: Xidian University (西安电子科技大学); University of Trento (特伦托大学); Tsinghua University (清华大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at this https URL.
[CV-87] SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding ICRA2026
【速读】:该论文旨在解决现有语义场景重建与语义感知的新视角合成方法依赖密集多视角输入并需针对特定场景进行优化的问题,从而限制了其在真实环境中的实用性与可扩展性。解决方案的关键在于提出一种前馈式框架SemGS,其核心创新包括:采用双分支结构分别提取颜色与语义特征,共享浅层卷积网络以使语义推理能够利用颜色外观中的纹理和结构线索;引入相机感知注意力机制显式建模不同相机视角间的几何关系;通过解码为共享几何一致性的双高斯(dual-Gaussians)表示来保留各分支特有属性,并进一步栅格化生成新视角下的语义图;同时设计区域平滑损失增强语义连贯性。该方法实现了快速推理与跨多样合成及真实场景的强泛化能力。
链接: https://arxiv.org/abs/2603.02548
作者: Sheng Ye,Zhen-Hui Dong,Ruoyu Fan,Tian Lv,Yong-Jin Liu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026
点击查看摘要
Abstract:Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.
[CV-88] On Discriminative vs. Generative classifiers: Rethinking MLLM s for Action Understanding ICLR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在封闭集动作理解任务中,基于生成式分类器(generative classifiers)存在的效率低下和语义歧义问题。生成式方法通过自回归生成动作标签文本,导致共享子词引入语义重叠,进而引发解码歧义;同时其逐 token 生成的特性也显著增加推理延迟。解决方案的关键在于提出一种生成辅助判别式分类器(Generation-Assisted Discriminative, GAD),该方法在微调阶段利用生成式建模增强特征表示,但推理时采用判别式的一次性分类机制,从而在不破坏预训练兼容性的前提下,实现准确率与推理效率的同步提升。实验表明,GAD 在多个基准上均达到最优性能,平均准确率提升2.5%,推理速度提高3倍。
链接: https://arxiv.org/abs/2603.02546
作者: Zhanzhong Pang,Dibyadip Chatterjee,Fadime Sener,Angela Yao
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures, 16 tables. Accepted by ICLR2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.
[CV-89] ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection ICLR2026
【速读】:该论文旨在解决森林环境中失踪人员检测难题,特别是由于密集树冠遮蔽导致传统无人机(UAV)顶部或斜向航拍图像难以捕捉地表目标的问题。其核心挑战在于现有视觉检测模型在复杂林下遮挡、低光照及多变环境条件下的泛化能力不足。解决方案的关键是构建了ForestPersons数据集——一个包含96,482张图像和204,078个标注的大型基准数据集,涵盖多种环境与时间条件,每个标注均包含边界框、姿态信息及可见性标签,以支持遮挡感知的行人检测研究。该数据集提供贴近微型飞行器(MAV)在搜救任务中实际视角的地表和低空视角图像,填补了当前缺乏真实林下场景训练与评估基准的空白,并揭示了主流目标检测模型在此类场景中的性能局限,为提升现实搜救场景中的人员检测能力提供了关键支撑。
链接: https://arxiv.org/abs/2603.02541
作者: Deokyun Kim,Jeongjun Lee,Jungwon Choi,Jonggeon Park,Giyoung Lee,Yookyung Kim,Myungseok Ki,Juho Lee,Jihun Cha
机构: ETRI; KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026 Accepted
点击查看摘要
Abstract:Detecting missing persons in forest environments remains a challenge, as dense canopy cover often conceals individuals from detection in top-down or oblique aerial imagery typically captured by Unmanned Aerial Vehicles (UAVs). While UAVs are effective for covering large, inaccessible areas, their aerial perspectives often miss critical visual cues beneath the forest canopy. This limitation underscores the need for under-canopy perspectives better suited for detecting missing persons in such environments. To address this gap, we introduce ForestPersons, a novel large-scale dataset specifically designed for under-canopy person detection. ForestPersons contains 96,482 images and 204,078 annotations collected under diverse environmental and temporal conditions. Each annotation includes a bounding box, pose, and visibility label for occlusion-aware analysis. ForestPersons provides ground-level and low-altitude perspectives that closely reflect the visual conditions encountered by Micro Aerial Vehicles (MAVs) during forest Search and Rescue (SAR) missions. Our baseline evaluations reveal that standard object detection models, trained on prior large-scale object detection datasets or SAR-oriented datasets, show limited performance on ForestPersons. This indicates that prior benchmarks are not well aligned with the challenges of missing person detection under the forest canopy. We offer this benchmark to support advanced person detection capabilities in real-world SAR scenarios. The dataset is publicly available at this https URL.
[CV-90] Functional Properties of the Focal-Entropy AISTATS2026
【速读】:该论文旨在解决焦点损失(focal-loss)在类别不平衡分类问题中缺乏系统信息论分析的问题,特别是其理论性质与实际行为之间的不一致性。解决方案的关键在于从分布视角出发,提出并研究焦点熵(focal-entropy),作为交叉熵(cross-entropy)的类比形式;通过严格数学推导,揭示了焦点熵的有限性、凸性与连续性条件,并证明其最小化器的存在性与唯一性,同时阐明其结构可显著偏离数据分布。研究表明,焦点损失会放大中等概率事件、抑制高概率结果,并在极端类别失衡下引发对极小概率的过度抑制现象,从而为理解焦点损失机制提供了坚实的理论基础并厘清其在不平衡学习任务中的权衡特性。
链接: https://arxiv.org/abs/2603.02533
作者: Jaimin Shah,Martina Cardone,Alex Dytso
机构: 未知
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: Accepted to AISTATS 2026
点击查看摘要
Abstract:The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.
[CV-91] EIMC: Efficient Instance-aware Multi-modal Collaborative Perception
【速读】:该论文旨在解决多模态协同感知中因“本地融合到通信”序列导致的高带宽消耗与协作效率低下的问题,尤其在自动驾驶场景下,现有方法需传输大量个体特征数据以实现跨车辆协同检测,造成资源浪费且难以有效处理遮挡目标。其解决方案的关键在于提出一种早期协同范式(Early Collaborative Paradigm),通过在本地模态融合阶段注入轻量级协同体素(collaborative voxels)来构建紧凑而信息丰富的3D协同先验,从而增强跨模态对齐;进一步引入基于热图驱动的一致性协议,仅在低置信度、高差异区域查询邻近代理的Top-K实例向量,并利用交叉注意力机制完成补全,最后通过自注意力增强各代理最置信实例的特征表示。该实例中心的消息传递机制显著降低冗余通信,同时保障关键遮挡物体的恢复,实现在OPV2V和DAIR-V2X数据集上AP@0.5达73.01%,且相比最优基线减少87.98%的带宽使用。
链接: https://arxiv.org/abs/2603.02532
作者: Kang Yang,Peng Wang,Lantao Li,Tianci Bu,Chen Sun,Deying Li,Yongcai Wang
机构: Renmin University of China (中国人民大学); Sony Research and Development Center China (索尼中国研究院); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures, 7 tables
点击查看摘要
Abstract:Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication’’ sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual’s feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego’s local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01% AP@0.5 while reducing byte bandwidth usage by 87.98% compared with the best published multi-modal collaborative detector. Code publicly released at this https URL.
[CV-92] NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
【速读】:该论文旨在解决当前掩码图像建模(Masked Image Modeling, MIM)在地球观测(Earth Observation, EO)图像自监督学习中忽视邻近图像空间依赖性的问题。由于地球表面具有连续性,相邻区域的图像之间存在强相关性并蕴含丰富的上下文信息,而现有方法未有效利用这一特性。解决方案的关键在于提出NeighborMAE框架,通过联合重建邻近地球观测图像来显式建模空间依赖关系;同时引入启发式策略动态调整掩码比例和像素级损失权重,以维持重建任务的挑战性并提升学习效果。实验表明,该方法在多种预训练数据集和下游任务上显著优于现有基线,验证了邻近图像在MIM中的价值及设计的有效性。
链接: https://arxiv.org/abs/2603.02522
作者: Liang Zeng,Valerio Marsocci,Wufan Zhao,Andrea Nascetti,Maarten Vergauwen
机构: KU Leuven; ESA Φ-lab; HKUST(GZ); KTH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth’s surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.
[CV-93] Beyond Anatomy: Explainable ASD Classification from rs-fMRI via Functional Parcellation and Graph Attention Networks
【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)分类中因使用解剖学脑分区(如AAL,116个区域)导致的刚性边界限制,从而难以捕捉ASD个体特异性的功能连接模式的问题。其关键解决方案在于采用基于图神经网络(Graph Neural Network, GNN)的深度学习框架,并对比分析解剖学分区(AAL)与功能衍生分区(MSDL,39个区域)对分类性能的影响;研究发现,将分区策略从AAL替换为MSDL即可带来10.7个百分点的准确率提升(从73.3%至84.0%),进一步通过图注意力网络(Graph Attention Network, GAT)集成模型实现95.0%的准确率(AUC=0.98),显著优于当前ABIDE I数据集上的所有GNN基准方法,表明功能分区是建模ASD分类最具影响力的决策因素。
链接: https://arxiv.org/abs/2603.02518
作者: Syeda Hareem Madani,Noureen Bibi,Adam Rafiq Jeraj,Sumra Khan,Anas Zafar,Rizwan Qureshi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
点击查看摘要
Abstract:Anatomical brain parcellations dominate rs-fMRI-based Autism Spectrum Disorder (ASD) classification, yet their rigid boundaries may fail to capture the idiosyncratic connectivity patterns that characterise ASD. We present a graph-based deep learning framework comparing anatomical (AAL, 116 ROIs) and functionally-derived (MSDL, 39 ROIs) parcellation strategies on the ABIDE I dataset. Our FSL preprocessing pipeline handles multi-site heterogeneity across 400 balanced subjects, with site-stratified 70/15/15 splits to prevent data leakage. Gaussian noise augmentation within training folds expands samples from 280 to 1,680. A three phase pipeline progresses from a baseline GCN with AAL (73.3% accuracy, AUC=0.74), to an optimised GCN with MSDL (84.0%, AUC=0.84), to a Graph Attention Network ensemble achieving 95.0% accuracy (AUC=0.98), outperforming all recent GNN-based benchmarks on ABIDE I. The 10.7-point gain from atlas substitution alone demonstrates that functional parcellation is the most impactful modelling decision. Gradient-based saliency and GNNExplainer analyses converge on the Posterior Cingulate Cortex and Precuneus as core Default Mode Network hubs, validating that model decisions reflect ASD neuropathology rather than acquisition artefacts. All code and datasets will be publicly released upon acceptance.
[CV-94] SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data
【速读】:该论文旨在解决不完整多模态语义分割(Incomplete Multimodal Semantic Segmentation, IMSS)问题,其核心挑战包括多模态不平衡(主导模态压制脆弱模态)、类内跨模态尺度/形状/方向差异(intra-class variation)以及跨模态异质性(cross-modal heterogeneity)导致的语义响应不一致。现有方法依赖对比学习或联合优化,易引发过对齐、忽略模态特异性特征或训练失衡,难以有效处理上述问题。解决方案的关键在于提出语义引导的模态感知(Semantic-Guided Modality-Aware, SGMA)框架,通过两个可插拔模块实现:(1) 语义引导融合(Semantic-Guided Fusion, SGF)模块提取多尺度类别语义原型以捕捉跨模态一致性表征,并基于原型-特征对齐估计各模态鲁棒性,进而加权自适应融合以缓解类内变化与跨模态冲突;(2) 模态感知采样(Modality-Aware Sampling, MAS)模块利用SGF提供的鲁棒性估计动态重加权训练样本,优先关注脆弱模态中的困难样本,从而缓解模态不平衡问题。该方案在多个数据集和骨干网络上均显著优于当前最优方法,尤其提升了脆弱模态的分割性能。
链接: https://arxiv.org/abs/2603.02505
作者: Lekang Wen,Liang Liao,Jing Xiao,Mi Wang
机构: Wuhan University (武汉大学); Hangzhou Institute of Technology, Xidian University (西安电子科技大学杭州研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation. However, practical systems often encounter missing modalities due to sensor failures or incomplete coverage, termed Incomplete Multimodal Semantic Segmentation (IMSS). IMSS faces three key challenges: (1) multimodal imbalance, where dominant modalities suppress fragile ones; (2) intra-class variation in scale, shape, and orientation across modalities; and (3) cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses. Existing methods rely on contrastive learning or joint optimization, which risk over-alignment, discarding modality-specific cues or imbalanced training, favoring robust modalities, while largely overlooking intra-class variation and cross-modal heterogeneity. To address these limitations, we propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance. SGMA introduces two complementary plug-and-play modules: (1) Semantic-Guided Fusion (SGF) module extracts multi-scale, class-wise semantic prototypes that capture consistent categorical representations across modalities, estimates per-modality robustness based on prototype-feature alignment, and performs adaptive fusion weighted by robustness scores to mitigate intra-class variation and cross-modal heterogeneity; (2) Modality-Aware Sampling (MAS) module leverages robustness estimations from SGF to dynamically reweight training samples, prioritizing challenging samples from fragile modalities to address modality imbalance. Extensive experiments across multiple datasets and backbones demonstrate that SGMA consistently outperforms state-of-the-art methods, with particularly significant improvements in fragile modalities.
[CV-95] WTHaar-Net: a Hybrid Quantum-Classical Approach
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)中线性滤波操作计算效率低、参数冗余的问题,同时探索如何利用量子计算资源提升深度学习模型的性能。其核心解决方案是提出WTHaar-Net,一种将Haar Wavelet Transform(HWT)替代原有混合架构中使用的Hadamard Transform的新型CNN结构;关键在于HWT能够提供空间局部化和多分辨率表示,更符合视觉任务的归纳偏置(inductive bias),且其数学结构可被分解为适合量子电路实现的酉操作(unitary operations),从而支持在近中期量子设备上部署,实验表明该方法在CIFAR-10和Tiny-ImageNet数据集上实现了显著的参数压缩并保持了竞争力甚至优于ResNet与基于Hadamard的基线模型。
链接: https://arxiv.org/abs/2603.02497
作者: Vittorio Palladino,Tsai Idden,Ahmet Enis Cetin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 images
点击查看摘要
Abstract:Convolutional neural networks rely on linear filtering operations that can be reformulated efficiently in suitable transform domains. At the same time, advances in quantum computing have shown that certain structured linear transforms can be implemented with shallow quantum circuits, opening the door to hybrid quantum-classical approaches for enhancing deep learning models. In this work, we introduce WTHaar-Net, a convolutional neural network that replaces the Hadamard Transform used in prior hybrid architectures with the Haar Wavelet Transform (HWT). Unlike the Hadamard Transform, the Haar transform provides spatially localized, multi-resolution representations that align more closely with the inductive biases of vision tasks. We show that the HWT admits a quantum realization using structured Hadamard gates, enabling its decomposition into unitary operations suitable for quantum circuits. Experiments on CIFAR-10 and Tiny-ImageNet demonstrate that WTHaar-Net achieves substantial parameter reduction while maintaining competitive accuracy. On Tiny-ImageNet, our approach outperforms both ResNet and Hadamard-based baselines. We validate the quantum implementation on IBM Quantum cloud hardware, demonstrating compatibility with near-term quantum devices. Comments: 16 pages, 5 images Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.02497 [cs.CV] (or arXiv:2603.02497v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.02497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-96] ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop
【速读】:该论文旨在解决多模态三维目标检测在自动驾驶场景中因传感器瞬时中断(如LiDAR或摄像头因硬件故障、恶劣天气或遮挡导致数据缺失)而引发的可靠性问题,尤其关注多模态同时失效时可能导致车辆“失明”的高风险情境。解决方案的关键在于提出一种即插即用的模块——ModalPatch,其核心机制包括两个方面:一是利用传感器数据的时间连续性,通过历史信息预测并补偿暂时不可用的特征;二是引入不确定性引导的跨模态融合策略,动态评估补偿特征的可靠性,抑制偏差信号并增强有效信息,从而在不改变原有检测框架结构或重新训练的前提下,显著提升检测模型在各类模态缺失条件下的鲁棒性和准确性。
链接: https://arxiv.org/abs/2603.02481
作者: Shuangzhi Li,Lei Ma,Xingyu Li
机构: University of Alberta, Canada (阿尔伯塔大学, 加拿大); The University of Tokyo, Japan (东京大学, 日本)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.
[CV-97] E2E-GNet: An End-to-End Skeleton-based Geometric Deep Neural Network for Human Motion Recognition
【速读】:该论文旨在解决基于骨架的人体动作识别中,如何有效捕捉非欧几里得空间(non-Euclidean space)内数据的几何结构特征以提升识别准确率的问题。解决方案的关键在于提出一种端到端的几何深度神经网络 E2E-GNet,其核心创新包括:引入几何变换层(geometric transformation layer),在非欧空间上联合优化骨架运动序列,并通过可微对数映射激活函数(differentiable logarithm map activation)将其投影至线性空间;进一步设计畸变感知优化层(distortion-aware optimization layer),限制投影引起的形状畸变,从而保留关键的几何判别信息,最终实现更高的动作识别性能与更低的计算成本。
链接: https://arxiv.org/abs/2603.02477
作者: Mubarak Olaoluwa,Hassen Drira
机构: University of Strasbourg, ICube Laboratory, UMR 7357, F-67412 Strasbourg, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Geometric deep learning has recently gained significant attention in the computer vision community for its ability to capture meaningful representations of data lying in a non-Euclidean space. To this end, we propose E2E-GNet, an end-to-end geometric deep neural network for skeleton-based human motion recognition. To enhance the discriminative power between different motions in the non-Euclidean space, E2E-GNet introduces a geometric transformation layer that jointly optimizes skeleton motion sequences on this space and applies a differentiable logarithm map activation to project them onto a linear space. Building on this, we further design a distortion-aware optimization layer that limits skeleton shape distortions caused by this projection, enabling the network to retain discriminative geometric cues and achieve a higher motion recognition rate. We demonstrate the impact of each layer through ablation studies and extensive experiments across five datasets spanning three domains show that E2E-GNet outperforms other methods with lower cost.
[CV-98] Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild
【速读】:该论文旨在解决皮肤色调(skin tone)公平性评估中的关键挑战,包括现有数据集缺乏细粒度标注、训练-测试泄露、数据不平衡以及模型泛化能力不足等问题。其解决方案的关键在于:首先构建了一个大规模、公开可获取的STW数据集(42,313张图像,来自3,564人,采用10色调MST量表标注),解决了数据稀缺与代表性不足的问题;其次通过对比经典计算机视觉方法(SkinToneCCV)与深度学习方法,验证了深度学习模型在皮肤色调分类任务中能接近人工标注精度;最后提出SkinToneNet——一个微调的Vision Transformer(ViT)模型,在跨域数据上实现了最先进的泛化性能,从而支持对公共数据集(如CelebA和VGGFace2)进行可靠公平性审计。
链接: https://arxiv.org/abs/2603.02475
作者: Vitor Pereira Matias,Márcus Vinícius Lobo Costa,João Batista Neto,Tiago Novello de Brito
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 11 figures
点击查看摘要
Abstract:Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon
[CV-99] Deep Learning Based Wildfire Detection for Peatland Fires Using Transfer Learning
【速读】:该论文旨在解决传统基于深度学习(Deep Learning, DL)的野火检测方法在识别泥炭地火灾时效果不佳的问题。泥炭地火灾具有闷烧燃烧、低火焰强度、持续烟雾和地下燃烧等独特物理与视觉特征,使得针对开放火焰森林火灾训练的模型难以有效泛化。解决方案的关键在于采用迁移学习(Transfer Learning)策略:首先使用通用野火检测模型的预训练权重初始化网络,再通过马来西亚泥炭地图像和视频数据集进行微调(Fine-tuning),从而在标注数据有限的情况下显著提升模型对泥炭地火灾的检测准确性和鲁棒性,尤其在低对比度烟雾、部分遮挡和光照变化等复杂场景中表现优异。
链接: https://arxiv.org/abs/2603.02465
作者: Emadeldeen Hamdan,Ahmad Faiz Tharima,Mohd Zahirasri Mohd Tohir,Dayang Nur Sakinah Musa,Erdem Koyuncu,Adam J. Watts,Ahmet Enis Cetin
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Fire and Rescue Department of Malaysia (马来西亚消防与救援局); Universiti Putra Malaysia (马来西亚农业大学); Universiti Malaysia Sabah (马来西亚沙巴大学); USDA Forest Service Pacific Wildland Fire Sciences Laboratory (美国农业部森林服务太平洋野火科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics – such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning – that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.
[CV-100] ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
【速读】:该论文旨在解决文档视觉问答(Document Visual Question Answering, DocVQA)中现有视觉语言模型(Vision-Language Models, VLMs)在复杂推理和多步骤工作流下的性能瓶颈问题,尤其是模型难以将复杂问题分解为可执行的子任务,以及无法针对文档不同元素(如文本、表格、图像等)激活专用处理路径。其解决方案的关键在于提出一种名为ORCA(Orchestrated Reasoning with Collaborative Agents)的多智能体框架,通过策略性地协调多个专业化AI代理(agent)实现协同推理:首先由推理代理分解问题逻辑步骤,再通过路由机制调用对应模态的专用代理(如文本理解、表格解析等),并引入辩论机制与压力测试确保答案可靠性,辅以论点-反论点仲裁和合理性校验保障输出格式一致性,从而显著提升DocVQA的准确性与鲁棒性。
链接: https://arxiv.org/abs/2603.02438
作者: Aymen Lassoued,Mohamed Ali Souibgui,Yousri Kessentini
机构: Digital Research Center of Sfax, SMARTS Laboratory (数字研究实验室); École Polytechnique de Tunisie, University of Carthage (突尼斯国立理工学院,迦太基大学); Computer Vision Center, Universitat Autònoma de Barcelona (计算机视觉中心,巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.
[CV-101] MIRAG E: Knowledge Graph-Guided Cross-Cohort MRI Synthesis for Alzheimers Disease Prediction
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中因磁共振成像(MRI)模态缺失导致的模型部署瓶颈问题,尤其是在缺乏真实MRI数据的患者队列中如何有效利用电子健康记录(Electronic Health Records, EHR)进行可靠分类。解决方案的关键在于提出MIRAGE框架,其核心创新是将缺失MRI问题重构为一种解剖结构引导的跨模态潜在空间蒸馏任务:首先通过生物医学知识图谱(Biomedical Knowledge Graph, KG)与图注意力网络(Graph Attention Networks)将异构EHR变量映射到统一嵌入空间,并实现跨队列传播;其次,引入一个冻结的预训练3D U-Net解码器作为辅助正则化机制,强制1D潜在表示编码符合生物学合理性的宏观病理语义,从而无需重建3D体素图像即可生成可用于AD分类的“诊断代理”表示,显著提升了无MRI队列中的分类性能(较单模态基线提升13%)。
链接: https://arxiv.org/abs/2603.02434
作者: Guanchen Wu,Zhe Huang,Yuzhang Xie,Runze Yan,Akul Chopra,Deqiang Qiu,Xiao Hu,Fei Wang,Carl Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reliable Alzheimer’s disease (AD) diagnosis increasingly relies on multimodal assessments combining structural Magnetic Resonance Imaging (MRI) and Electronic Health Records (EHR). However, deploying these models is bottlenecked by modality missingness, as MRI scans are expensive and frequently unavailable in many patient cohorts. Furthermore, synthesizing de novo 3D anatomical scans from sparse, high-dimensional tabular records is technically challenging and poses severe clinical risks. To address this, we introduce MIRAGE, a novel framework that reframes the missing-MRI problem as an anatomy-guided cross-modal latent distillation task. First, MIRAGE leverages a Biomedical Knowledge Graph (KG) and Graph Attention Networks to map heterogeneous EHR variables into a unified embedding space that can be propagated from cohorts with real MRIs to cohorts without them. To bridge the semantic gap and enforce physical spatial awareness, we employ a frozen pre-trained 3D U-Net decoder strictly as an auxiliary regularization engine. Supported by a novel cohort-aggregated skip feature compensation strategy, this decoder acts as a rigorous structural penalty, forcing 1D latent representations to encode biologically plausible, macro-level pathological semantics. By exclusively utilizing this distilled “diagnostic-surrogate” representation during inference, MIRAGE completely bypasses computationally expensive 3D voxel reconstruction. Experiments demonstrate that our framework successfully bridges the missing-modality gap, improving the AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs.
[CV-102] A Unified Revisit of Temperature in Classification-Based Knowledge Distillation
【速读】:该论文试图解决知识蒸馏(Knowledge Distillation)中温度参数(temperature parameter)选择缺乏理论指导的问题,即如何根据训练配置(如优化器、教师模型预训练/微调策略等)合理设定温度值,以提升学生模型的学习效果。现有方法通常依赖网格搜索或沿用前人经验值,效率低且在不同训练设置下可能表现不佳。论文的关键解决方案在于提出一个统一的研究框架,系统性地分析温度与各类训练组件之间的交互关系,并识别出对温度选择具有显著影响的常见场景,从而为实践者提供可操作的指导原则。
链接: https://arxiv.org/abs/2603.02430
作者: Logan Frank,Jim Davis
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:A central idea of knowledge distillation is to expose relational structure embedded in the teacher’s weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.
[CV-103] DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models)在蓝莓机器人采摘场景中的实际应用效能与性能边界不明确的问题。其解决方案的关键在于将DINOv3作为冻结的语义骨干网络(semantic backbone),通过轻量级解码器统一评估其在果实分割、 bruise 分割及果实和果簇检测等任务中的表现,发现分割任务受益于稳定的局部patch表示并随骨干网络规模提升,而检测任务受限于目标尺度变化、patch离散化以及定位兼容性问题,尤其是果簇检测失败揭示了模型在空间聚合关系建模上的局限性,从而表明DINOv3更适合作为依赖下游空间建模的语义特征提取器,而非端到端任务模型。
链接: https://arxiv.org/abs/2603.02419
作者: Rui-Feng Wang,Daniel Petti,Yue Chen,Changying Li
机构: University of Florida, Gainesville, FL 32611, USA; Georgia Institute of Technology, Atlanta, GA 30332, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures, 5 tables
点击查看摘要
Abstract:Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.
[CV-104] ruckDrive: Long-Range Autonomous Highway Driving Dataset
【速读】:该论文旨在解决重卡(heavy trucks)在高速公路上实现安全自动驾驶的挑战,核心问题在于现有感知系统难以覆盖长距离(>100米)场景,而重卡因制动距离长,需对数百米范围内的交通环境进行前瞻感知与规划。解决方案的关键在于构建一个名为TruckDrive的高速公路尺度多模态驾驶数据集,其传感器配置专为远距离感知设计:包括七组长距离调频连续波(FMCW)激光雷达(LiDAR)用于测量距离和径向速度、三组高分辨率短距LiDAR、十一台8MP全景摄像头及十组4D FMCW雷达。该数据集提供47.5万样本,其中16.5万帧密集标注,支持长达1000米的2D检测和400米的3D检测、深度估计、目标跟踪、路径规划及端到端驾驶任务,揭示了当前主流自动驾驶模型在超过150米距离时性能显著下降(3D感知任务下降31%至99%),暴露出当前架构与训练信号无法弥合的“长距离感知鸿沟”。
链接: https://arxiv.org/abs/2603.02413
作者: Filippo Ghilotti,Edoardo Palladin,Samuel Brucker,Adam Sigal,Mario Bijelic,Felix Heide
机构: Torc Robotics; Princeton University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.
[CV-105] From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness CVPR2026
【速读】:该论文旨在解决现有数据蒸馏(Dataset Distillation, DD)方法在压缩数据集时仅关注样本数量减少,而忽视数据精度对训练效率影响的问题。其解决方案的关键在于提出量化感知的数据蒸馏框架(Quantization-aware Dataset Distillation, QuADD),通过在蒸馏过程中嵌入可微分的量化模块,实现合成样本与量化参数的端到端联合优化,在固定比特预算下同时提升数据集的紧凑性和精度。该框架支持均匀和自适应非均匀量化策略,其中自适应量化能从数据中学习量化级别,更好地表示信息密集区域,从而显著提高每比特的准确率表现。
链接: https://arxiv.org/abs/2603.02411
作者: My H. Dinh,Aditya Sant,Akshay Malhotra,Keya Patani,Shahab Hamidi-Rad
机构: InterDigital Communications, Inc. (InterDigital通信公司); Walmart Inc (沃尔玛公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026 - Findings Workshop
点击查看摘要
Abstract:Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.
[CV-106] OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments CVPR2026
【速读】:该论文旨在解决智能制造环境中工人活动识别的难题,以实现生产效率的量化评估与优化,并提升作业安全性。其解决方案的关键在于构建并公开了一个大规模多模态数据集——OpenMarcie,该数据集涵盖穿戴式传感器与环境摄像头等多种感知模态,覆盖37小时以上的第一人称和第三人称视角数据,包含8种数据类型及200多个独立信息通道。通过两个实验场景(自行车拆装任务与3D打印机组装任务)模拟真实制造流程中的个体操作与协作行为,为人类活动识别(如活动分类、开放词汇描述生成和跨模态对齐)提供了高质量基准,从而推动生成式AI在工业场景中对工人行为理解与干预能力的发展。
链接: https://arxiv.org/abs/2603.02390
作者: Hymalai Bello,Lala Ray,Joanna Sorysz,Sungho Suh,Paul Lukowicz
机构: DFKI Kaiserslautern (德国人工智能研究中心凯撒斯劳滕分部); RPTU Kaiserslautern-Landau (凯撒斯劳滕-兰道大学); Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted in CVPR 2026
点击查看摘要
Abstract:Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer’s instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other’s progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment.
[CV-107] Advancing Earth Observation Through Machine Learning: A TorchGeo Tutorial ICLR
【速读】:该论文旨在解决地球观测(Earth Observation, EO)机器学习流程与标准计算机视觉工作流之间的根本差异问题,具体表现为遥感影像通常以大型地理参考场景形式提供、标签可能为不同坐标参考系统下的栅格掩膜或矢量几何数据,且训练和评估常需空间感知的采样与分割策略。解决方案的关键在于提出并实现TorchGeo——一个基于PyTorch的领域专用库,其核心抽象包括适配地理空间数据的Dataset、Sampler、Transforms以及预训练模型,从而简化在机器学习流程中使用地理空间数据的复杂性,并通过端到端案例研究(多光谱水体分割任务)验证其有效性。
链接: https://arxiv.org/abs/2603.02386
作者: Caleb Robinson,Nils Lehmann,Adam J. Stewart,Burak Ekim,Heng Fang,Isaac A. Corley,Mauricio Cordeiro
机构: Microsoft AI for Good Research Lab; Chair of Data Science in Earth Observation, Technical University of Munich; Chair of Earth Observation, University of the Bundeswehr Munich; Department of Robotics, Perception and Learning, KTH Royal Institute of Technology; Wherobots; National Water and Sanitation Agency, Brazil, Federal District, Brasilia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR ML4RS 2026 Tutorial Track
点击查看摘要
Abstract:Earth observation machine learning pipelines differ fundamentally from standard computer vision workflows. Imagery is typically delivered as large, georeferenced scenes, labels may be raster masks or vector geometries in distinct coordinate reference systems, and both training and evaluation often require spatially aware sampling and splitting strategies. TorchGeo is a PyTorch-based domain library that provides datasets, samplers, transforms and pre-trained models with the goal of making it easy to use geospatial data in machine learning pipelines. In this paper, we introduce a tutorial that demonstrates 1.) the core TorchGeo abstractions through code examples, and 2.) an end-to-end case study on multispectral water segmentation from Sentinel-2 imagery using the Earth Surface Water dataset. This demonstrates how to train a semantic segmentation model using TorchGeo datasets, apply the model to a Sentinel-2 scene over Rio de Janeiro, Brazil, and save the resulting predictions as a GeoTIFF for further geospatial analysis. The tutorial code itself is distributed as two Python notebooks: this https URL and this https URL.
[CV-108] Authenticated Contradictions from Desynchronized Provenance and Watermarking
【速读】:该论文旨在解决数字内容认证中两个独立验证层——C2PA(Content Credentials for Provenance and Authenticity)元数据和隐形水印(invisible watermarking)之间存在的“完整性冲突”(Integrity Clash)问题,即同一数字资产可同时携带合法的C2PA声明(如人类创作)和AI生成水印信号,且两者在孤立验证时均通过检测。解决方案的关键在于提出一种跨层审计协议(cross-layer audit protocol),联合评估元数据与水印检测状态,从而在不依赖密码学突破的前提下,实现对伪造内容的100%准确识别,有效弥合了现有验证机制的技术缝隙。
链接: https://arxiv.org/abs/2603.02378
作者: Alexander Nemecek,Hengzhi He,Guang Cheng,Erman Ayday
机构: Case Western Reserve University (凯斯西储大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 11 pages
点击查看摘要
Abstract:Cryptographic provenance standards such as C2PA and invisible watermarking are positioned as complementary defenses for content authentication, yet the two verification layers are technically independent: neither conditions on the output of the other. This work formalizes and empirically demonstrates the \textitIntegrity Clash , a condition in which a digital asset carries a cryptographically valid C2PA manifest asserting human authorship while its pixels simultaneously carry a watermark identifying it as AI-generated, with both signals passing their respective verification checks in isolation. We construct metadata washing workflows that produce these authenticated fakes through standard editing pipelines, requiring no cryptographic compromise, only the semantic omission of a single assertion field permitted by the current C2PA specification. To close this gap, we propose a cross-layer audit protocol that jointly evaluates provenance metadata and watermark detection status, achieving 100% classification accuracy across 3,500 test images spanning four conflict-matrix states and three realistic perturbation conditions. Our results demonstrate that the gap between these verification layers is unnecessary and technically straightforward to close.
[CV-109] Aligning Fetal Anatomy with Kinematic Tree Log-Euclidean PolyRigid Transforms
【速读】:该论文旨在解决医学影像中对可动人体结构(articulated bodies)进行自动分析时,现有基于表面的模型忽略内部体积结构且变形方法缺乏解剖一致性保障的问题。其核心解决方案是提出一种基于Skinned Multi-Person Linear(SMPL)框架的可微分体积体模型,并引入一种新的基于运动学树的对数欧几里得多刚体(Kinematic Tree-based Log-Euclidean PolyRigid, KTPolyRigid)变换。KTPolyRigid通过消除大范围非局部运动相关的李代数歧义,确保生成平滑且双射的体积映射,从而显著减少形变场中的折叠伪影,在53例胎儿MRI数据上验证了其优越性,并支持鲁棒的群体图像配准与标签高效模板分割,为医学影像中可动结构的标准化体积分析提供了坚实基础。
链接: https://arxiv.org/abs/2603.02371
作者: Yingcheng Liu,Athena Taymourtash,Yang Liu,Esra Abaci Turk,William M. Wells,Leo Joskowicz,P. Ellen Grant,Polina Golland
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:Automated analysis of articulated bodies is crucial in medical imaging. Existing surface-based models often ignore internal volumetric structures and rely on deformation methods that lack anatomical consistency guarantees. To address this problem, we introduce a differentiable volumetric body model based on the Skinned Multi-Person Linear (SMPL) formulation, driven by a new Kinematic Tree-based Log-Euclidean PolyRigid (KTPolyRigid) transform. KTPolyRigid resolves Lie algebra ambiguities associated with large, non-local articulated motions, and encourages smooth, bijective volumetric mappings. Evaluated on 53 fetal MRI volumes, KTPolyRigid yields deformation fields with significantly fewer folding artifacts. Furthermore, our framework enables robust groupwise image registration and a label-efficient, template-based segmentation of fetal organs. It provides a robust foundation for standardized volumetric analysis of articulated bodies in medical imaging.
[CV-110] Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中因文化差异(如宗教、国籍和经济地位)引发的偏见问题,而这类偏见在以往研究中因难以从个体外貌直接识别而被忽视。其解决方案的关键在于构建一个高质量的合成数据集——Cultural Counterfactuals,该数据集包含近6万张反事实图像,通过图像编辑模型将不同人口统计学特征的人置于真实的文化语境图像中,从而生成同一人物在多种文化背景下的图像对,实现对LVLM输出受文化语境影响的精准量化分析。
链接: https://arxiv.org/abs/2603.02370
作者: Phillip Howard,Xin Su,Kathleen C. Fraser
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have grown increasingly powerful in recent years, but can also exhibit harmful biases. Prior studies investigating such biases have primarily focused on demographic traits related to the visual characteristics of a person depicted in an image, such as their race or gender. This has left biases related to cultural differences (e.g., religion, socioeconomic status), which cannot be readily discerned from an individual’s appearance alone, relatively understudied. A key challenge in measuring cultural biases is that determining which group an individual belongs to often depends upon cultural context cues in images, and datasets annotated with cultural context cues are lacking. To address this gap, we introduce Cultural Counterfactuals: a high-quality synthetic dataset containing nearly 60k counterfactual images for measuring cultural biases related to religion, nationality, and socioeconomic status. To ensure that cultural contexts are accurately depicted, we generate our dataset using an image-editing model to place people of different demographics into real cultural context images. This enables the construction of counterfactual image sets which depict the same person in multiple different contexts, allowing for precise measurement of the impact that cultural context differences have on LVLM outputs. We demonstrate the utility of Cultural Counterfactuals for quantifying cultural biases in popular LVLMs.
[CV-111] Retrieving Patient-Specific Radiomic Feature Sets for Transparent Knee MRI Assessment
【速读】:该论文旨在解决传统放射组学(radiomic)方法在特征选择上依赖于人群层面预定义特征集而导致性能受限的问题,以及现有自适应放射组学方法因仅基于边际排名而可能引入冗余描述符、忽略特征间互补交互的局限性。其解决方案的关键在于提出一种面向患者的特征集选择框架,通过两阶段检索策略——首先随机采样多样化的候选特征集,再利用学习到的评分函数对这些特征集进行排序,从而为每个患者生成一个紧凑且具有互补性和多样性证据的个性化特征子集,而非简单选取前k个边际最优特征。此方法在保持高透明度的同时显著提升了诊断性能,并能提供可审计的特征集以关联临床结果与特定解剖区域及定量描述符。
链接: https://arxiv.org/abs/2603.02367
作者: Yaxi Chen,Simin Ni,Jingjing Zhang,Shaheer U. Saeed,Yipei Wang,Aleksandra Ivanova,Rikin Hargunani,Chaozong Liu,Jie Huang,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Classical radiomic features are designed to quantify image appearance and intensity patterns. Compared with end-to-end deep learning (DL) models trained for disease classification, radiomics pipelines with low-dimensional parametric classifiers offer enhanced transparency and interpretability, yet often underperform because of the reliance on population-level predefined feature sets. Recent work on adaptive radiomics uses DL to predict feature weights over a radiomic pool, then thresholds these weights to retain the top-k features from large radiomic pool F (often ~10^3). However, such marginal ranking can over-admit redundant descriptors and overlook complementary feature interactions. We propose a patient-specific feature-set selection framework that predicts a single compact feature set per subject, targeting complementary and diverse evidence rather than marginal top-k features. To overcome the intractable combinatorial search space of F choose k features, our method utilizes a 2-stage retrieval strategy: randomly sample diverse candidate feature sets, then rank these sets with a learned scoring function to select a high-performing feature set for the specific patient. The system consists of a feature-set scorer, and a classifier that performs the final diagnosis. We empirically show that the proposed two-stage retrieval approximates the original exhaustive all k-feature selection. Validating on tasks including ACL tear detection and KL grading for osteoarthritis, the experimental results achieve diagnostic performance, outperforming the top-k approach with the same k values, and competitive with end-to-end DL models while maintaining high transparency. The model generates auditable feature sets that link clinical outcomes to specific anatomical regions and radiomic families, allowing clinicians to inspect which anatomical structures and quantitative descriptors drive the prediction.
[CV-112] Beyond Caption-Based Queries for Video Moment Retrieval CVPR2026
【速读】:该论文旨在解决现有视频定位与检索(Video Moment Retrieval, VMR)方法,特别是DETR类架构在训练时使用描述性文本查询(caption-based queries)但评估时采用搜索类查询(search queries)所面临的性能下降问题。研究发现,这种性能退化主要源于两个关键的泛化挑战:一是语言层面的“语言差距”(language gap),即搜索查询通常比描述性查询更模糊、信息不足;二是时间维度上的“多时刻差距”(multi-moment gap),即从单时刻查询向多时刻查询转变导致的分布偏移。论文进一步识别出一个核心架构缺陷——“主动解码器-查询坍缩”(active decoder-query collapse),该现象显著削弱了模型对多时刻实例的建模能力。解决方案的关键在于通过结构改进增强解码器中活跃查询的数量,从而缓解坍缩问题,提升模型在搜索查询场景下的泛化性能,实验表明该方法在搜索查询上最高提升14.82% mAP_m,在多时刻搜索查询上提升高达21.83% mAP_m。
链接: https://arxiv.org/abs/2603.02363
作者: David Pujol-Perich,Albert Clapés,Dima Damen,Sergio Escalera,Michael Wray
机构: University of Barcelona (巴塞罗那大学); Computer Vision Center (计算机视觉中心); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Camera-ready version
点击查看摘要
Abstract:In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets – i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures – an active decoder-query collapse – as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: this https URL
[CV-113] MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
【速读】:该论文旨在解决当前基于Transformer的神经视觉几何模型(如VGGT和Pi3)在大规模无序图像集合上进行三维重建时,因依赖全注意力机制而导致GPU显存瓶颈、难以扩展的问题。解决方案的关键在于提出一种无需训练的“分而治之”框架MERG3R:首先将无序图像重新排序并划分为具有重叠区域和几何多样性的子集,实现独立局部重建;随后通过高效的全局对齐与置信度加权的捆绑调整(bundle adjustment)过程,融合各局部重建结果,生成全局一致的3D模型。该方法具备模型无关性,显著提升了内存效率与可扩展性,同时保持高重建精度。
链接: https://arxiv.org/abs/2603.02351
作者: Leo Kaixuan Cheng,Abdus Shaikh,Ruofan Liang,Zhijie Wu,Yushi Guan,Nandita Vijaykumar
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.
[CV-114] Preconditioned Score and Flow Matching
【速读】:该论文旨在解决生成式模型中基于向量场的训练方法(如流匹配和基于得分的扩散模型)在中间分布 $ p_t $ 几何结构不良时导致的优化偏差问题,即当 $ p_t $ 的协方差矩阵 $ \Sigma_t $ 条件数不佳时,梯度下降会优先拟合高方差方向而系统性地忽略低方差模式,从而造成学习过程过早停滞于次优权重。解决方案的关键在于提出一种可逆的、标签条件化的预条件映射(preconditioning map),通过改善 $ \Sigma_t $ 的条件数来重塑 $ p_t $ 的几何结构,而不改变原始生成模型的本质;该方法不以加速早期收敛为目标,而是通过释放此前被抑制的方向实现持续优化,从而避免次优平台现象,在MNIST潜在空间流匹配及高分辨率数据集上均验证了其有效性。
链接: https://arxiv.org/abs/2603.02337
作者: Shadab Ahamed,Eshed Gal,Simon Ghyselincks,Md Shahriar Rahim Siddiqui,Moshe Eliasof,Eldad Haber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 12 figures, 5 tables
点击查看摘要
Abstract:Flow matching and score-based diffusion train vector fields under intermediate distributions p_t , whose geometry can strongly affect their optimization. We show that the covariance \Sigma_t of p_t governs optimization bias: when \Sigma_t is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emphpreconditioning maps that reshape the geometry of p_t by improving the conditioning of \Sigma_t without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.
[CV-115] HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding CVPR2026
【速读】:该论文旨在解决3D物体交互意图驱动的 affordance(可操作性)定位问题,即如何从图像中提取物体与环境之间的潜在交互信息,并将其精准映射到三维空间中以实现对新物体的泛化能力。解决方案的关键在于提出了一种名为 HAMMER 的新框架,其核心创新包括:1)通过聚合图像中描绘的交互意图生成接触感知嵌入(contact-aware embedding),并引导模型推断文本形式的 affordance 标签,从而深度挖掘物体语义与上下文线索;2)设计分层跨模态融合机制,充分利用多模态大语言模型(MLLM)提供的互补信息以优化3D表示;3)引入多粒度几何提升模块(multi-granular geometry lifting module),将空间特征注入意图嵌入中,从而实现高精度的3D affordance 定位。
链接: https://arxiv.org/abs/2603.02329
作者: Lei Yao,Yong Chen,Yuejiao Su,Yi Wang,Moyun Liu,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project Page: this https URL
点击查看摘要
Abstract:Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.
[CV-116] AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning
【速读】:该论文旨在解决面部 feminization surgery(FFS)术前规划中缺乏定量且可重复的解剖学指导的问题,当前主要依赖主观临床评估。其解决方案的关键在于提出AutoFFS框架,该框架通过对抗性自由形变(adversarial free-form deformations)生成反事实颅骨形态,具体方法是对一组预训练的二分类性别判别器进行基于变形的目标对抗攻击,从而将个体颅骨形状有效转化为目标性别的特征形态,为FFS提供可量化的术前规划依据。
链接: https://arxiv.org/abs/2603.02288
作者: Paul Friedrich,Florentin Bieder,Florian M. Thieringer,Philippe C. Cattin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Code: this https URL
点击查看摘要
Abstract:Facial feminization surgery (FFS) is a key component of gender affirmation for transgender and gender diverse patients, aiming to reshape craniofacial structures toward a female morphology. Current surgical planning procedures largely rely on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance. We therefore propose AutoFFS, a novel data-driven framework that generates counterfactual skull morphologies through adversarial free-form deformations. Our method performs a deformation-based targeted adversarial attack on an ensemble of pre-trained binary sex classifiers that learned sexual dimorphism, effectively transforming individual skull shapes toward the target sex. The generated counterfactual skull morphologies provide a quantitative foundation for preoperative planning in FFS, driving advances in this largely overlooked patient group. We validate our approach through classifier-based evaluation and a human perceptual study, confirming that the generated morphologies exhibit target sex characteristics.
[CV-117] Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection CVPR2026
【速读】:该论文针对增量目标检测(Incremental Object Detection, IOD)中因提示耦合(prompt coupling)和提示漂移(prompt drift)导致的提示退化问题展开研究。现有基于提示的方法虽具备无需回放(replay-free)和参数高效的优势,但在持续学习过程中容易因任务间提示干扰及监督信号不一致(如旧类别在后续任务中被误判为背景)而性能下降。解决方案的关键在于提出一种新颖的提示解耦框架 PDP(Prompt-Decoupled Paradigm),其核心创新包括:1)设计双池提示解耦机制,将共享池用于捕获任务通用知识以支持前向迁移,私有池用于学习任务特定判别特征,从而显式分离通用与特定提示,缓解提示耦合;2)引入原型伪标签生成(Prototypical Pseudo-Label Generation, PPG)模块,动态更新类别原型空间并利用原型过滤高质量伪标签,维持训练过程中的监督一致性,有效抑制提示漂移。实验表明,PDP 在 MS-COCO 和 PASCAL VOC 基准上分别实现 9.2% 和 3.3% 的 AP 提升,验证了其在稳定性和可塑性之间的良好平衡能力。
链接: https://arxiv.org/abs/2603.02286
作者: Yaoteng Zhang,Zhou Qing,Junyu Gao,Qi Wang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Our paper has been accepted to CVPR 2026
点击查看摘要
Abstract:Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2% AP improvement) and PASCAL VOC (with a 3.3% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: this https URL_IOD/tree/main
[CV-118] From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification WWW
【速读】:该论文旨在解决宠物在大规模场景下自动识别(automated animal identification)的难题,尤其针对现有系统因数据集规模有限和过度依赖单一视觉线索而导致的性能瓶颈。其解决方案的关键在于构建一个融合视觉特征与合成文本语义先验的多模态验证框架(multimodal verification framework),通过引入基于合成文本描述的语义信息来增强视觉表征,从而更精确地区分不同个体。研究中采用SigLIP2-Giant和E5-Small-v2作为最优的视觉与文本编码器,并设计了门控融合机制(gated fusion mechanism),有效提升了模型在大型宠物重识别任务中的判别能力,最终实现Top-1准确率达84.28%、等错误率(Equal Error Rate)为0.0422的显著性能提升。
链接: https://arxiv.org/abs/2603.02270
作者: Vasiliy Kudryavtsev,Kirill Borodin,German Berezin,Kirill Bubenchikov,Grach Mkrtchian,Alexander Ryzhkov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at MDPI Journal of Imaging (see at this https URL )
点击查看摘要
Abstract:Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091~unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.
[CV-119] Social-JEPA: Emergent Geometric Isomorphism
【速读】:该论文旨在解决多视角视觉系统中不同代理(agent)所学习的表征空间缺乏对齐的问题,从而阻碍了跨视角的知识迁移与协同。其核心挑战在于:在无参数共享或协调机制下,如何使来自不同视角的独立训练模型在潜在空间中形成几何一致的表示。解决方案的关键在于,通过预测性学习目标(predictive learning objectives)诱导出一种近似线性等距映射(approximate linear isometry)——即两个独立训练的代理所学的潜在空间之间存在可解析的线性关系,这种几何一致性使得一个代理学到的分类器可直接迁移到另一个代理上而无需额外梯度更新,同时支持以类蒸馏的方式加速后续学习并显著降低计算开销。这一发现揭示了预测性目标对表征几何结构的强大约束力,为去中心化视觉系统的轻量级互操作性提供了新路径。
链接: https://arxiv.org/abs/2603.02263
作者: Haoran Zhang,Youjin Wang,Yi Duan,Rong Fu,Dianyu Zhao,Sicheng Fan,Shuaishuai Cao,Wentao Guo,Xiao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at this https URL.
[CV-120] CamDirector: Towards Long-Term Coherent Video Trajectory Editing
【速读】:该论文旨在解决视频轨迹编辑(Video Trajectory Editing, VTE)中面临的两个核心挑战:一是难以实现精确的相机控制,二是缺乏长时一致性。现有方法要么通过容量有限的嵌入注入目标相机位姿,要么依赖单帧变形且在视频扩散模型中仅隐式聚合跨帧信息,导致生成结果在空间一致性和时间连贯性上表现不佳。其解决方案的关键在于提出一种新型VTE框架:首先,采用混合变形策略显式聚合整个源视频的信息——静态区域逐步融合至世界缓存(world cache)并渲染到目标相机位姿,动态区域则直接变形,二者融合生成全局一致的粗略帧以指导后续精修;其次,利用历史引导的自回归扩散模型联合处理视频片段及其历史状态,同时增量更新世界缓存以强化已填充内容,从而实现长期的时间一致性。
链接: https://arxiv.org/abs/2603.02256
作者: Zhihao Shi,Kejia Yin,Weilin Wan,Yuhongze Zhou,Yuanhao Yu,Xinxin Zuo,Qiang Sun,Juwei Lu
机构: McMaster University (麦克马斯特大学); University of Toronto (多伦多大学); The University of Hong Kong (香港大学); McGill University (麦吉尔大学); Concordia University (康考迪亚大学); MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.
[CV-121] Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting
【速读】:该论文旨在解决时间序列预测(Time Series Forecasting, TSF)中因周期内波动(intraperiod-fluctuations)与周期间趋势(interperiod-trends)复杂交织而导致的建模难题。现有方法将一维序列重塑为二维周期-相位表示时,存在两个关键缺陷:一是将重构张量视为静态图像导致拓扑不匹配,标准空间算子在网格边界破坏时间连续性;二是依赖固定尺寸的均匀表示,无法自适应分配建模资源以应对可压缩、非平稳的时间模式。解决方案的核心在于提出TimeGS框架,其将预测任务从回归转向二维生成式渲染(2D generative rendering),通过将未来序列建模为连续潜在表面,并利用高斯核的各向异性特性实现灵活几何对齐,从而自适应捕捉复杂变化。关键技术包括:Multi-Basis Gaussian Kernel Generation (MB-GKG) 块,从固定词典合成核函数以稳定优化;以及 Multi-Period Chronologically Continuous Rasterization (MP-CCR) 块,确保跨周期边界严格的时间连续性。
链接: https://arxiv.org/abs/2603.02220
作者: Yixin Wang,Yifan Hu,Peiyuan Liu,Naiqi Li,Dai Tao,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen (清华大学深圳国际研究生院,清华大学,深圳); College of Computer Science and Software Engineering, Shenzhen University, Shenzhen (深圳大学计算机科学与软件工程学院,深圳)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Time series forecasting (TSF) remains a challenging problem due to the intricate entanglement of intraperiod-fluctuations and interperiod-trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal this http URL, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a continuous latent surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art performance.
[CV-122] ALARM: Automated MLLM -Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Model, MLLM)在复杂环境中进行视觉异常检测(Visual Anomaly Detection, VAD)时,因异常具有高度情境依赖性和模糊性而导致的不确定性问题。解决方案的关键在于提出一个基于不确定性量化(Uncertainty Quantification, UQ)支持的MLLM-VAD框架ALARM,其核心创新在于将UQ与推理链(reasoning chain)、自我反思(self-reflection)及MLLM集成(ensemble)等质量保障技术深度融合,并构建了严格的概率推断流程和计算机制,从而实现鲁棒且准确的异常检测性能。
链接: https://arxiv.org/abs/2512.03101
作者: Congjing Zhang,Feng Lin,Xinyi Zhao,Pei Guo,Wei Li,Lin Chen,Chaoyue Zhao,Shuai Huang
机构: University of Washington (华盛顿大学); Wyze Labs, Inc.
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The advance of Large Language Models (LLMs) has greatly stimulated research interest in developing multi-modal LLM (MLLM)-based visual anomaly detection (VAD) algorithms that can be deployed in complex environments. The challenge is that in these complex environments, the anomalies are sometimes highly contextual and also ambiguous, and thereby, uncertainty quantification (UQ) is a crucial capacity for an MLLM-based VAD system to succeed. In this paper, we introduce our UQ-supported MLLM-based VAD framework called ALARM. ALARM integrates UQ with quality-assurance techniques like reasoning chain, self-reflection, and MLLM ensemble for robust and accurate performance and is designed based on a rigorous probabilistic inference pipeline and computational process. Extensive empirical evaluations are conducted using the real-world smart-home benchmark data and wound image classification data, which shows ALARM’s superior performance and its generic applicability across different domains for reliable decision-making.
[CV-123] Biomechanically Accurate Gait Analysis: A 3d Human Reconstruction Framework for Markerless Estimation of Gait Parameters
【速读】:该论文旨在解决传统基于关键点的步态分析方法在 biomechanical interpret性和临床适用性方面的局限性,尤其针对现有视觉方法难以提供可解释的关节运动学参数的问题。其解决方案的关键在于:从视频数据中重建三维人体模型后,提取类运动捕捉系统(motion capture system)的生物力学有意义标记点,并将其集成到 OpenSim 平台中进行关节运动学估计,从而实现无需标记、可解释且高精度的步态评估。
链接: https://arxiv.org/abs/2603.02499
作者: Akila Pemasiri,Ethan Goan,Glen Lichtwark,Robert Schuster,Luke Kelly,Clinton Fookes
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents a biomechanically interpretable framework for gait analysis using 3D human reconstruction from video data. Unlike conventional keypoint based approaches, the proposed method extracts biomechanically meaningful markers analogous to motion capture systems and integrates them within OpenSim for joint kinematic estimation. To evaluate performance, both spatiotemporal and kinematic gait parameters were analysed against reference marker-based data. Results indicate strong agreement with marker-based measurements, with considerable improvements when compared with pose-estimation methods alone. The proposed framework offers a scalable, markerless, and interpretable approach for accurate gait assessment, supporting broader clinical and real world deployment of vision based biomechanics
[CV-124] Geometric structures and deviations on James symmetric positive-definite matrix bicone domain
【速读】:该论文旨在解决对称正定(Symmetric Positive-Definite, SPD)矩阵数据在流形空间中几何结构建模与度量学习的问题,尤其关注如何构建更自然、便于计算且具有良好理论性质的几何框架。传统方法如仿射不变黎曼结构和对偶信息几何对数行列式障碍结构虽被广泛使用,但其测地线性质不够直观。本文的关键解决方案在于引入两种新结构:基于James双圆锥重参数化的Finsler结构和对偶信息几何结构,二者均确保测地线在适当坐标系下为直线,从而简化了优化与分析过程;同时,闭合双圆锥域包含谱单纯形(spectraplex)作为仿射子空间,并证明希尔伯特VPM距离可推广希尔伯特单纯形距离,增强了其在机器学习中的适用性。
链接: https://arxiv.org/abs/2603.02483
作者: Jacek Karwowski,Frank Nielsen
机构: University of Oxford (牛津大学); Sony Computer Science Laboratories Inc. (索尼计算机科学实验室公司)
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 35 pages, 4 figures
点击查看摘要
Abstract:Symmetric positive-definite (SPD) matrix datasets play a central role across numerous scientific disciplines, including signal processing, statistics, finance, computer vision, information theory, and machine learning among others. The set of SPD matrices forms a cone which can be viewed as a global coordinate chart of the underlying SPD manifold. Rich differential-geometric structures may be defined on the SPD cone manifold. Among the most widely used geometric frameworks on this manifold are the affine-invariant Riemannian structure and the dual information-geometric log-determinant barrier structure, each associated with dissimilarity measures (distance and divergence, respectively). In this work, we introduce two new structures, a Finslerian structure and a dual information-geometric structure, both derived from James’ bicone reparameterization of the SPD domain. Those structures ensure that geodesics correspond to straight lines in appropriate coordinate systems. The closed bicone domain includes the spectraplex (the set of positive semi-definite diagonal matrices with unit trace) as an affine subspace, and the Hilbert VPM distance is proven to generalize the Hilbert simplex distance which found many applications in machine learning. Finally, we discuss several applications of these Finsler/dual Hessian structures and provide various inequalities between the new and traditional dissimilarities.
[CV-125] Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification
【速读】:该论文旨在解决多标签胸部X光(CXR)分类中长尾分布带来的挑战,即罕见但临床重要的疾病类别因样本严重不足而难以准确识别。其解决方案的关键在于系统性地评估损失函数、卷积神经网络(CNN)骨干架构及训练后策略,并发现LDAM-DRW(带延迟重加权的类不平衡损失函数)在稀有类别识别上显著优于标准二元交叉熵(BCE)和不对称损失;同时,采用ConvNeXt-Large骨干网络结合分类器重新训练与测试时增强技术,进一步提升了模型性能,在CXR-LT 2026基准上实现了0.5220 mAP和0.3765 F1的开发集表现,并在官方测试榜单中取得第五名(0.3950 mAP)。
链接: https://arxiv.org/abs/2603.02294
作者: Nikhileswara Rao Sulake
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper would be a part of the CXR Long Tail Challenge in ISBI 2026. This is my team report of it’s work during the challenge
点击查看摘要
Abstract:Long-tailed class distributions pose a significant challenge for multi-label chest X-ray (CXR) classification, where rare but clinically important findings are severely underrepresented. In this work, we present a systematic empirical evaluation of loss functions, CNN backbone architectures and post-training strategies on the CXR-LT 2026 benchmark, comprising approximately 143K images with 30 disease labels from PadChest. Our experiments demonstrate that LDAM with deferred re-weighting (LDAM-DRW) consistently outperforms standard BCE and asymmetric losses for rare class recognition. Amongst the architectures evaluated, ConvNeXt-Large achieves the best single-model performance with 0.5220 mAP and 0.3765 F1 on our development set, whilst classifier re-training and test-time augmentation further improve ranking metrics. On the official test leaderboard, our submission achieved 0.3950 mAP, ranking 5th amongst all 68 participating teams with total of 1528 submissions. We provide a candid analysis of the development-to-test performance gap and discuss practical insights for handling class imbalance in clinical imaging settings. Code is available at this https URL.
人工智能
[AI-0] Inherited Goal Drift: Contextual Pressure Can Undermine Agent ic Goals ICLR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)代理在长上下文任务中因目标漂移(goal drift)而导致的行为偏离原始目标的问题。研究发现,尽管当前最先进的语言模型在对抗性压力下表现出一定鲁棒性,但其鲁棒性具有脆弱性:当模型基于较弱代理的预填充轨迹进行条件输入时,仍会继承显著的目标漂移。关键解决方案在于识别出漂移行为与指令层次遵循能力之间无强相关性,并提出需开发更精细的后训练技术以应对情境压力下的目标稳定性问题。
链接: https://arxiv.org/abs/2603.03258
作者: Achyutha Menon,Magnus Saebo,Tyler Crosse,Spencer Gibson,Eyon Jang,Diogo Cruz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures. Accepted at ICLR 2026 Lifelong Agents Workshop
点击查看摘要
Abstract:The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents’ tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models remains unclear. In this work, we provide an updated characterization of the extent and causes of goal drift. We investigate drift in state-of-the-art models within a simulated stock-trading environment (Arike et al., 2025). These models are largely shown to be robust even when subjected to adversarial pressure. We show, however, that this robustness is brittle: across multiple settings, the same models often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of conditioning-induced drift varies significantly by model family, with only GPT-5.1 maintaining consistent resilience among tested models. We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior, with strong hierarchy following failing to reliably predict resistance to drift. Finally, we run analogous experiments in a new emergency room triage environment to show preliminary evidence for the transferability of our results across qualitatively different settings. Our findings underscore the continued vulnerability of modern LM agents to contextual pressures and the need for refined post-training techniques to mitigate this.
[AI-1] Valet: A Standardized Testbed of Traditional Imperfect-Information Card Games
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在不完美信息博弈(imperfect-information games)领域中缺乏统一、多样且可比较的基准测试平台的问题,从而难以评估算法在不同游戏环境下的鲁棒性与泛化能力。解决方案的关键在于构建一个名为 Valet 的综合性测试平台,包含 21 种传统不完美信息卡牌游戏,覆盖多种文化背景、玩家数量、牌组结构和信息隐藏机制,并通过 RECYCLE 这一标准化卡牌游戏描述语言统一规则实现跨系统兼容;同时利用随机模拟对每款游戏的分支因子和持续时间进行量化分析,并提供蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)玩家对抗随机对手的基线得分分布,验证其作为基准套件的有效性和多样性。
链接: https://arxiv.org/abs/2603.03252
作者: Mark Goadrich,Achille Morenville,Éric Piette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 1 table, 4 figures
点击查看摘要
Abstract:AI algorithms for imperfect-information games are typically compared using performance metrics on individual games, making it difficult to assess robustness across game choices. Card games are a natural domain for imperfect information due to hidden hands and stochastic draws. To facilitate comparative research on imperfect-information game-playing algorithms and game systems, we introduce Valet, a diverse and comprehensive testbed of 21 traditional imperfect-information card games. These games span multiple genres, cultures, player counts, deck structures, mechanics, winning conditions, and methods of hiding and revealing information. To standardize implementations across systems, we encode the rules of each game in RECYCLE, a card game description language. We empirically characterize each game’s branching factor and duration using random simulations, reporting baseline score distributions for a Monte Carlo Tree Search player against random opponents to demonstrate the suitability of Valet as a benchmarking suite.
[AI-2] AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学计算代码生成任务中面临的可靠性不足、多智能体工作流中的错误传播以及缺乏明确成功指标的评估难题。其解决方案的核心在于提出一种基于贝叶斯对抗机制的多智能体框架,集成任务管理器、代码生成器与评估器三个LLM驱动的智能体,并通过贝叶斯原理动态更新提示分布,融合功能正确性、结构一致性与静态分析等代码质量指标,实现测试用例与代码的协同优化。该设计显著降低了对单一LLM可靠性的依赖,有效缓解了科学任务中固有的评估不确定性,同时通过低代码平台(Low-code Platform, LCP)简化非专业用户的人机协作流程,无需人工进行复杂的提示工程即可生成鲁棒性强的科学计算代码。
链接: https://arxiv.org/abs/2603.03233
作者: Zihang Zeng,Jiaquan Zhang,Pengze Li,Yuan Qi,Xi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP’s effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.
[AI-3] SynthCharge: An Electric Vehicle Routing Instance Generator with Feasibility Screening to Enable Learning-Based Optimization and Benchmarking
【速读】:该论文旨在解决现有电动车辆路径问题带时间窗(EVRPTW)基准数据集静态且缺乏可验证可行性的问题,从而限制了基于学习的路径规划模型的可复现评估。其解决方案的关键在于提出SynthCharge——一个参数化生成器,能够根据不同的时空配置和客户规模生成多样化且经过可行性筛选的EVRPTW实例;该生成器通过将实例几何结构与自适应能量容量缩放及基于续航能力的充电站布设相结合,并引入快速可行性筛选机制以剔除不可解实例,从而为新兴神经网络路径规划和数据驱动方法提供动态、结构有效的基准测试基础设施。
链接: https://arxiv.org/abs/2603.03230
作者: Mertcan Daysalilar,Fuat Uyguroglu,Gabriel Nicolosi,Adam Meyers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:The electric vehicle routing problem with time windows (EVRPTW) extends the classical VRPTW by introducing battery capacity constraints and charging station decisions. Existing benchmark datasets are often static and lack verifiable feasibility, which restricts reproducible evaluation of learning-based routing models. We introduce SynthCharge, a parametric generator that produces diverse, feasibility-screened EVRPTW instances across varying spatiotemporal configurations and scalable customer counts. While SynthCharge can currently generate large-scale instances of up to 500 customers, we focus our experiments on sizes ranging from 5 to 100 customers. Unlike static benchmark suites, SynthCharge integrates instance geometry with adaptive energy capacity scaling and range-aware charging station placement. To guarantee structural validity, the generator systematically filters out unsolvable instances through a fast feasibility screening process. Ultimately, SynthCharge provides the dynamic benchmarking infrastructure needed to systematically evaluate the robustness of emerging neural routing and data-driven approaches.
[AI-4] Stabilized Adaptive Loss and Residual-Based Collocation for Physics-Informed Neural Networks
【速读】:该论文旨在解决传统物理信息神经网络(Physics-Informed Neural Networks, PINNs)在处理高刚度或冲击主导动力学问题时存在的局限性,包括训练不平衡和解的不准确性,即使物理残差较小也难以获得高精度解。其解决方案的关键在于提出两种改进策略:一是基于平滑梯度范数的自适应损失平衡机制,以确保初始条件和边界条件的一致满足;二是基于残差的自适应配点策略,通过在高物理残差区域增加采样密度来提升解的精度。实验表明,该方法显著提高了求解精度,在Burgers方程案例中相对L2误差降低约44%,在Allen-Cahn方程中降低约70%。
链接: https://arxiv.org/abs/2603.03224
作者: Divyavardhan Singh,Shubham Kamble,Dimple Sonone,Kishor Upla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 Figures, 4 tables
点击查看摘要
Abstract:Physics-Informed Neural Networks (PINNs) have been recognized as a mesh-free alternative to solve partial differential equations where physics information is incorporated. However, in dealing with problems characterized by high stiffness or shock-dominated dynamics, traditional PINNs have been found to have limitations, including unbalanced training and inaccuracy in solution, even with small physics residuals. In this research, we seek to address these limitations using the viscous Burgers’ equation with low viscosity and the Allen-Cahn equation as test problems. In addressing unbalanced training, we have developed a new adaptive loss balancing scheme using smoothed gradient norms to ensure satisfaction of initial and boundary conditions. Further, to address inaccuracy in the solution, we have developed an adaptive residual-based collocation scheme to improve the accuracy of solutions in the regions with high physics residuals. The proposed new approach significantly improves solution accuracy with consistent satisfaction of physics residuals. For instance, in the case of Burgers’ equation, the relative L2 error is reduced by about 44 percent compared to traditional PINNs, while for the Allen-Cahn equation, the relative L2 error is reduced by approximately 70 percent. Additionally, we show the trustworthy solution comparison of the proposed method using a robust finite difference solver.
[AI-5] NeuroSkill™: Proactive Real-Time Agent ic System Capable of Modeling Human State of Mind
【速读】:该论文旨在解决如何在边缘设备上实现实时、主动的智能代理系统,以建模人类心智状态(Human State of Mind),并基于脑-机接口(Brain-Computer Interface, BCI)采集的生物物理和脑电信号进行多层级认知与情感交互的问题。其解决方案的关键在于构建一个完全离线运行的神经技能框架——NeuroSkill™,该框架通过自研的NeuroLoop™工具链整合基础EXG模型与文本嵌入模型,利用BCI设备提供的API和CLI接口直接解析人类心智状态,并据此执行可操作的工具调用和协议执行,从而在显式或隐式请求下实现对人类多层次心理状态(如共情)的有效响应。
链接: https://arxiv.org/abs/2603.03212
作者: Nataliya Kosmyna,Eugene Hauptmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages, 18 figures
点击查看摘要
Abstract:Real-time proactive agentic system, capable of modeling Human State of Mind, using foundation EXG model and text embeddings model, running fully offline on the edge. Unlike all previously known systems, the NeuroSkill™ system leverages this http URL description of Human’s State of Mind via API and CLI provided by the system, directly from the Brain-Computer Interface (BCI) devices, which records Human biophysical and brain signals. Our custom harness - NeuroLoop™ - utilizes all of the above to run agentic flow that manages to engage with the Human on multiple cognitive and affective levels of their State of Mind (e.g., empathy), by providing actionable tool calls and protocol execution with explicit or implicit requests from the Human. GPLv3 open-source software with ethically aligned AI100 licensing for the skill markdown.
[AI-6] Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
【速读】:该论文旨在解决如何利用人工神经网络(ANN)中不同类型的表征作为教师信号,以提升基于脑电图(EEG)的音乐识别性能这一问题。其关键解决方案在于区分声学相关(acoustic-related)与预期相关(expectation-related)的ANN表征,并将其分别作为监督信号用于训练EEG模型;研究发现,预训练模型若针对任一类型表征进行优化均优于非预训练基线,而两者结合则产生互补增益,超越通过随机初始化生成的强集成模型。此外,文中提出的预期表征直接从原始信号计算得出、无需人工标注,能够反映超出起始时刻或音高之外的预测结构,从而支持跨多种刺激的多层预测编码研究,为发展基于皮层编码原理的通用EEG模型提供了可行路径。
链接: https://arxiv.org/abs/2603.03190
作者: Shogo Noguchi,Taketo Akama,Tai Nakamura,Shun Minamikawa,Natalia Polouliakh
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 39 pages, 9 figures
点击查看摘要
Abstract:During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.
[AI-7] Neuro-Symbolic Artificial Intelligence: A Task-Directed Survey in the Black-Box Models Era IJCAI-25 IJCAI IJCAI25
【速读】:该论文旨在解决神经符号(Neuro-Symbolic, NeSy)方法在实际应用中面临的语义泛化能力有限以及难以处理具有预定义规则和模式的复杂领域的问题。其解决方案的关键在于系统性地考察任务特定的NeSy进展,探索如何通过融合符号系统来增强模型的可解释性和推理能力,从而提升其在现实世界场景中的实用性与竞争力,尤其是在自然语言处理(Natural Language Processing, NLP)和计算机视觉(Computer Vision, CV)等领域的表现。
链接: https://arxiv.org/abs/2603.03177
作者: Giovanni Pio Delvecchio,Lorenzo Molfetta,Gianluca Moro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at IJCAI-25. Please cite the definitive, copyrighted, peer reviewed and edited version of this Article published in IJCAI 25, pp. 4196-4176, 2025. DOI: this https URL
点击查看摘要
Abstract:The integration of symbolic computing with neural networks has intrigued researchers since the first theorizations of Artificial intelligence (AI). The ability of Neuro-Symbolic (NeSy) methods to infer or exploit behavioral schema has been widely considered as one of the possible proxies for human-level intelligence. However, the limited semantic generalizability and the challenges in declining complex domains with pre-defined patterns and rules hinder their practical implementation in real-world scenarios. The unprecedented results achieved by connectionist systems since the last AI breakthrough in 2017 have raised questions about the competitiveness of NeSy solutions, with particular emphasis on the Natural Language Processing and Computer Vision fields. This survey examines task-specific advancements in the NeSy domain to explore how incorporating symbolic systems can enhance explainability and reasoning capabilities. Our findings are meant to serve as a resource for researchers exploring explainable NeSy methodologies for real-life tasks and applications. Reproducibility details and in-depth comments on each surveyed research work are made available at this https URL.
[AI-8] FEAST: Retrieval-Augmented Multi-Hierarchical Food Classification for the FoodEx2 System ECAI2025
【速读】:该论文旨在解决食品分类任务中因标签间复杂依赖关系、数据稀疏性及极端输出维度所导致的挑战,特别是在欧洲食品安全局(EFSA)的FoodEx2系统中的实际应用问题。FoodEx2需将自然语言食品描述映射至多层级标准化编码体系,其结构复杂且存在细粒度标签稀疏现象,现有模型难以有效处理此类现实约束。解决方案的关键在于提出FEAST(Food Embedding And Semantic Taxonomy)框架,该框架采用三阶段分解策略:首先识别基础术语,其次预测多标签特征类别,最后分配具体特征描述符;并通过利用系统层次结构指导训练和深度度量学习,构建判别性嵌入表示,从而缓解数据稀疏问题并提升对罕见及细粒度标签的泛化能力。
链接: https://arxiv.org/abs/2603.03176
作者: Lorenzo Molfetta,Alessio Cocchieri,Stefano Fantazzini,Giacomo Frisoni,Luca Ragazzi,Gianluca Moro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at ECAI 2025. Please cite the definitive, copyrighted, peer reviewed and edited version of this Article published in ECAI 2025, edited by I. Lynce et al., FAIA, pp. 4169-4176, 2025. DOI: this https URL
点击查看摘要
Abstract:Hierarchical text classification (HTC) and extreme multi-label classification (XML) tasks face compounded challenges from complex label interdependencies, data sparsity, and extreme output dimensions. These challenges are exemplified in the European Food Safety Authority’s FoodEx2 system-a standardized food classification framework essential for food consumption monitoring and contaminant exposure assessment across Europe. FoodEx2 coding transforms natural language food descriptions into a set of codes from multiple standardized hierarchies, but faces implementation barriers due to its complex structure. Given a food description (e.g., "organic yogurt’‘), the system identifies its base term ("yogurt’‘), all the applicable facet categories (e.g., "production method’‘), and then, every relevant facet descriptors to each category (e.g., "organic production’'). While existing models perform adequately on well-balanced and semantically dense hierarchies, no work has been applied on the practical constraints imposed by the FoodEx2 system. The limited literature addressing such real-world scenarios further compounds these challenges. We propose FEAST (Food Embedding And Semantic Taxonomy), a novel retrieval-augmented framework that decomposes FoodEx2 classification into a three-stage approach: (1) base term identification, (2) multi-label facet prediction, and (3) facet descriptor assignment. By leveraging the system’s hierarchical structure to guide training and performing deep metric learning, FEASTlearns discriminative embeddings that mitigate data sparsity and improve generalization on rare and fine-grained labels. Evaluated on the multilingual FoodEx2 benchmark, FEAST outperforms the prior European’s CNN baseline F1 scores by 12-38 % on rare classes.
[AI-9] Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的代理系统在形式验证(formal verification)任务中准确性和可靠性不足的问题,尤其是面对复杂、高精度需求场景时易产生幻觉和错误。其核心挑战在于如何提升生成式AI(Generative AI)在SystemVerilog Assertion (SVA) 生成中的可控性与准确性,并减少达成覆盖率闭合所需的迭代次数。解决方案的关键在于两项增强:一是引入结构化的规则手册(rulebook)与断言语法(specification grammar),以规范并提升SVA生成的质量;二是集成先进的检索增强生成(Retrieval Augmented Generation, RAG)技术(如GraphRAG),使代理能够访问领域知识库,实现输出的迭代优化。实验表明,这些改进使SVA生成准确率提升70%,达到覆盖率闭合所需迭代次数减少50%。
链接: https://arxiv.org/abs/2603.03175
作者: Aman Kumar,Deepak Narayan Gadde,Luu Danh Minh,Vaisakh Naduvodi Viswambharan,Keerthan Kopparam Radhakrishna,Sivaram Pothireddypalli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at the DVCon U.S. 2026
点击查看摘要
Abstract:Saarthi is an agentic AI framework that uses multi-agent collaboration to perform end-to-end formal verification. Even though the framework provides a complete flow from specification to coverage closure, with around 40% efficacy, there are several challenges that need to be addressed to make it more robust and reliable. Artificial General Intelligence (AGI) is still a distant goal, and current Large Language Model (LLM)-based agents are prone to hallucinations and making mistakes, especially when dealing with complex tasks such as formal verification. However, with the right enhancements and improvements, we believe that Saarthi can be a significant step towards achieving domain-specific general intelligence for formal verification. Especially for problems that require Short Term, Short Context (STSC) capabilities, such as formal verification, Saarthi can be a powerful tool to assist verification engineers in their work. In this paper, we present two key enhancements to the Saarthi framework: (1) a structured rulebook and specification grammar to improve the accuracy and controllability of SystemVerilog Assertion (SVA) generation, and (2) integration of advanced Retrieval Augmented Generation (RAG) techniques, such as GraphRAG, to provide agents with access to technical knowledge and best practices for iterative refinement and improvement of outputs. We also benchmark these enhancements for the overall Saarthi framework using challenging test cases from NVIDIA’s CVDP benchmark targeting formal verification. Our benchmark results stand out with a 70% improvement in the accuracy of generated assertions, and a 50% reduction in the number of iterations required to achieve coverage closure.
[AI-10] An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization
【速读】:该论文旨在解决孟加拉语(Bengali)在语音技术领域中作为低资源语言所面临的挑战,尤其是长音频转录(long-form transcription)和说话人聚类(speaker diarization)等复杂任务的性能瓶颈。其核心解决方案在于采用多阶段策略:首先利用在孟加拉语数据上微调的Whisper Medium模型(bengaliAI/tugstugi bengaliai-asr whisper-medium)实现高精度语音识别(ASR),其次结合pyannote/speaker-diarization-community-1与自训练的语音活动检测(Voice Activity Detection, VAD)分割模型,以应对多样且噪声干扰严重的声学环境。通过两阶段处理流程及超参数调优,系统在私有测试集上实现了0.27的错误率(DER),公开测试集为0.19;同时,转录模块通过分块处理、背景噪声清理和算法后处理达到38%的词错误率(WER)。研究表明,针对特定语言的精细化调参与数据策略可显著提升生成式AI(Generative AI)对南亚语言的包容性与实用性。
链接: https://arxiv.org/abs/2603.03158
作者: Epshita Jahan,Khandoker Md Tanjinul Islam,Pritom Biswas,Tafsir Al Nafin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures
点击查看摘要
Abstract:Bengali remains a low-resource language in speech technology, especially for complex tasks like long-form transcription and speaker diarization. This paper presents a multistage approach developed for the “DL Sprint 4.0 - Bengali Long-Form Speech Recognition” and “DL Sprint 4.0 - Bengali Speaker Diarization” competitions on Kaggle, addressing the challenge of “who spoke when/what” in hour-long recordings. We implemented Whisper Medium fine-tuned on Bengali data (bengaliAI/tugstugi bengaliai-asr whisper-medium) for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model to handle diverse and noisy acoustic environments. Using a two-pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic post-processing yielded a WER of 0.38 on the private leaderboard. These results show that targeted tuning and strategic data utilization can significantly improve AI inclusivity for South Asian languages. All relevant code is available at: this https URL Index Terms: Bengali speech recognition, speaker diarization, Whisper, ASR, low-resource languages, pyannote, voice activity detection Comments: 5 pages, 2 figures Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.03158 [cs.SD] (or arXiv:2603.03158v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2603.03158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-11] Information Routing in Atomistic Foundation Models: How Equivariance Creates Linearly Disentangled Representations
【速读】:该论文旨在解决原子尺度基础模型(atomistic foundation models)在中间表示中编码的信息类型及其组织方式问题,特别是几何信息与组成信息的解耦程度。其核心解决方案是提出Composition Projection Decomposition (CPD) 方法,通过QR投影线性移除表示中的组成信号,并对残差几何结构进行探测;结果揭示了不同架构下几何信息的可线性访问性差异:张量积等变架构(如MACE)在去除组成信号后,几何信息几乎完全线性可解(如HOMO-LUMO能隙的R²ₘₑₒₘ = 0.782),而手工设计描述符(如ANI-2x)则以非线性方式纠缠相同信息(R²ₘₑₒₘ = -0.792,岭回归下;+0.784,MLP下)。此外,研究发现MACE通过不可约表示通道路由目标特定信号(如偶极矩至L=1,HOMO-LUMO能隙至L=0),而ViSNet未表现出此模式。研究表明,线性探针比梯度提升树更可靠,且线性解耦表示在样本效率上优于非解耦架构,表明等变架构具有超越预测精度的实际优势。
链接: https://arxiv.org/abs/2603.03155
作者: Joshua Steier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注:
点击查看摘要
Abstract:What do atomistic foundation models encode in their intermediate representations, and how is that information organized? We introduce Composition Projection Decomposition (CPD), which uses QR projection to linearly remove composition signal from learned representations and probes the geometric residual. Across eight models from five architectural families on QM9 molecules and Materials Project crystals, we find a disentanglement gradient: tensor product equivariant architectures (MACE) produce representations where geometry is almost fully linearly accessible after composition removal ( R^2_\textgeom = 0.782 for HOMO-LUMO gap), while handcrafted descriptors (ANI-2x) entangle the same information nonlinearly ( R^2_\textgeom = -0.792 under Ridge; R^2 = +0.784 under MLP). MACE routes target-specific signal through irreducible representation channels – dipole to L = 1 , HOMO-LUMO gap to L = 0 – a pattern not observed in ViSNet’s vector-scalar architecture under the same probe. We show that gradient boosted tree probes on projected residuals are systematically inflated, recovering R^2 = 0.68 – 0.95 on a purely compositional target, and recommend linear probes as the primary metric. Linearly disentangled representations are more sample-efficient under linear probing, suggesting a practical advantage for equivariant architectures beyond raw prediction accuracy.
[AI-12] Agent ic AI-based Coverag e Closure for Formal Verification
【速读】:该论文旨在解决集成电路(IC)验证流程中覆盖率闭合(coverage closure)难以在项目时限内达成的问题,传统穷举式方法常因效率低下而无法实现全面覆盖。其解决方案的关键在于构建一种基于代理型人工智能(agentic AI)的工作流,利用大语言模型(LLM)驱动的生成式AI(GenAI)自动执行形式验证中的覆盖率分析,识别覆盖率缺口,并生成所需的正式属性(formal properties),从而系统性地填补覆盖率漏洞,显著提升验证效率与覆盖率指标,尤其在复杂设计中效果更为明显。
链接: https://arxiv.org/abs/2603.03147
作者: Sivaram Pothireddypalli,Ashish Raman,Deepak Narayan Gadde,Aman Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at IEEE International Conference on Intelligent Processing, Hardware, Electronics, and Radio Systems 2026
点击查看摘要
Abstract:Coverage closure is a critical requirement in Integrated Chip (IC) development process and key metric for verification sign-off. However, traditional exhaustive approaches often fail to achieve full coverage within project timelines. This study presents an agentic AI-driven workflow that utilizes Large Language Model (LLM)-enabled Generative AI (GenAI) to automate coverage analysis for formal verification, identify coverage gaps, and generate the required formal properties. The framework accelerates verification efficiency by systematically addressing coverage holes. Benchmarking open-source and internal designs reveals a measurable increase in coverage metrics, with improvements correlated to the complexity of the design. Comparative analysis validates the effectiveness of this approach. These results highlight the potential of agentic AI-based techniques to improve formal verification productivity and support comprehensive coverage closure.
[AI-13] Channel-Adaptive Edge AI: Maximizing Inference Throughput by Adapting Computational Complexity to Channel States
【速读】:该论文旨在解决第六代(6G)网络中集成通信与计算(Integrated Communication and Computation, IC²)技术设计缺乏可 tractable 理论框架的问题,特别是如何准确刻画端到端(End-to-End, E2E)推理性能——该性能需同时考虑信道失真和人工智能(AI)模型架构及计算复杂度。解决方案的关键在于构建一个可解析的E2E推理准确率模型,该模型通过使用混合冯·米塞斯(Mixture of von Mises, MvM)分布对高维特征在角度域的分布进行建模,从而获得推理准确率关于量化比特宽度(代表信道失真)和模型遍历深度(代表计算复杂度)的闭式表达式。基于此模型,论文进一步提出了一个联合优化问题,在延迟和准确率约束下最大化边缘处理速率(Edge Processing Rate, EPR),并设计出一种通道自适应AI算法,动态调整发送端特征压缩与接收端模型复杂度以实现IC²的全集成,显著优于固定复杂度方案。
链接: https://arxiv.org/abs/2603.03146
作者: Jierui Zhang,Jianhao Huang,Kaibin Huang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 14 pages, 14 figures
点击查看摘要
Abstract:\emphIntegrated communication and computation (IC ^2 ) has emerged as a new paradigm for enabling efficient edge inference in sixth-generation (6G) networks. However, the design of IC ^2 technologies is hindered by the lack of a tractable theoretical framework for characterizing \emphend-to-end (E2E) inference performance. The metric is highly complicated as it needs to account for both channel distortion and artificial intelligence (AI) model architecture and computational complexity. In this work, we address this challenge by developing a tractable analytical model for E2E inference accuracy and leveraging it to design a \emphchannel-adaptive AI algorithm that maximizes inference throughput, referred to as the edge processing rate (EPR), under latency and accuracy constraints. Specifically, we consider an edge inference system in which a server deploys a backbone model with early exit, which enables flexible computational complexity, to perform inference on data features transmitted by a mobile device. The proposed accuracy model characterizes high-dimensional feature distributions in the angular domain using a Mixture of von Mises (MvM) distribution. This leads to a desired closed-form expression for inference accuracy as a function of quantization bit-width and model traversal depth, which represents channel distortion and computational complexity, respectively. Building upon this accuracy model, we formulate and solve the EPR maximization problem under joint latency and accuracy constraints, leading to a channel-adaptive AI algorithm that achieves full IC ^2 integration. The proposed algorithm jointly adapts transmit-side feature compression and receive-side model complexity according to channel conditions to maximize overall efficiency and inference throughput. Experimental results demonstrate its superior performance as compared with fixed-complexity counterparts.
[AI-14] Joint Training Across Multiple Activation Sparsity Regimes
【速读】:该论文试图解决深度神经网络中泛化能力(generalization)尚不完全明确的问题。其核心假设是:生物系统展现出更强的泛化能力,可能源于其内部表示在密集与稀疏激活状态下均保持有效性。为此,作者提出一种简单训练策略,关键在于通过全局 top-k 约束对隐藏层激活施加稀疏性限制,并在单个模型中周期性地切换不同激活预算(activation budget),即通过渐进压缩和定期重置实现多稀疏度 regime 的联合训练。实验表明,两种自适应保留比率控制策略在 CIFAR-10 数据集上均优于密集基线训练,验证了跨激活稀疏度训练有助于提升模型泛化性能。
链接: https://arxiv.org/abs/2603.03131
作者: Haotian Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generalization in deep neural networks remains only partially understood. Inspired by the stronger generalization tendency of biological systems, we explore the hypothesis that robust internal representations should remain effective across both dense and sparse activation regimes. To test this idea, we introduce a simple training strategy that applies global top-k constraints to hidden activations and repeatedly cycles a single model through multiple activation budgets via progressive compression and periodic reset. Using CIFAR-10 without data augmentation and a WRN-28-4 backbone, we find in single-run experiments that two adaptive keep-ratio control strategies both outperform dense baseline training. These preliminary results suggest that joint training across multiple activation sparsity regimes may provide a simple and effective route to improved generalization.
[AI-15] AI Space Physics: Constitutive boundary semantics for open AI institutions
【速读】:该论文旨在解决当前治理语言在面对持续性、自扩展型人工智能(Agentic AI)机构时的不足问题,特别是这些机构在不立即改变外部世界的情况下,仍可能通过状态积累和权限边界扩展来影响未来行为能力的问题。传统治理机制主要关注决策层约束,但对“边界跨越”的因果机制定义模糊,尤其忽视了那些无即时外部效应却实质性扩大系统未来行动范围的过渡过程。解决方案的关键在于引入AI空间物理(AI Space Physics)作为开放自扩展AI机构的构成语义,其核心是建立一个包含类型化边界通道、有限视野可达语义和膜见证(membrane-witness)纪律的最小状态模型,并提出一组核心法则(P-1至P-1c),强制要求见证完整性、不可绕过中介、原子化裁决到执行转换以及可重放的裁决类重建。该框架首次将权限表面扩展明确列为第一类边界事件,并赋予其治理相关性,即便此类扩展未产生即时提交(commit)动作,也必须经过裁决与见证流程。
链接: https://arxiv.org/abs/2603.03119
作者: Oleg Romanchuk,Roman Bondar
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
点击查看摘要
Abstract:Agentic AI deployments increasingly behave as persistent institutions rather than one-shot inference endpoints: they accumulate state, invoke external tools, coordinate multiple runtimes, and modify their future authority surface over time. Existing governance language typically specifies decision-layer constraints but leaves the causal mechanics of boundary crossing underdefined, particularly for transitions that do not immediately change the external world yet expand what the institution can later do. This paper introduces AI Space Physics as a constitutive semantics for open, self-expanding AI institutions. We define a minimal state model with typed boundary channels, horizon-limited reach semantics, and a membrane-witness discipline. The core law family (P-1, P-1a, P-1b, P-1c) requires witness completeness, non-bypass mediation, atomic adjudication-to-effect transitions, and replayable reconstruction of adjudication class. We explicitly separate second-order effects into structural expansion and policy broadening, and treat expansion transitions as governance-relevant even when immediate external deltas are zero. The novelty claim is precise rather than expansive: this work does not introduce mediation as a concept; it reclassifies authority-surface expansion as a first-class boundary event with constitutive witness obligations. In this semantics, expansion without immediate commit remains adjudication-relevant. Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2603.03119 [cs.AI] (or arXiv:2603.03119v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.03119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-16] Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在高风险场景中部署时,评估体系仅关注任务是否完成而忽视执行过程质量的问题。其解决方案的关键在于提出Procedure-Aware Evaluation (PAE) 框架,该框架将代理的执行流程形式化为结构化观测,并揭示观察、沟通与执行之间的一致性关系;通过多维门控机制对结果进行分类排除,从而从实用性(Utility)、效率(Efficiency)、交互质量(Interaction Quality)和程序完整性(Procedural Integrity)四个互补维度系统性评估代理行为,有效识别并剔除“虚假成功”(corrupt successes),提升评估的严谨性和可解释性。
链接: https://arxiv.org/abs/2603.03116
作者: Hongliu Cao,Ilias Driouich,Eoin Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non-redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27-78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per-model failure signatures: GPT-5 spreads errors across policy, execution, and intent dimensions; Kimi-K2-Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral-Large-3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.
[AI-17] From Complex Dynamics to DynFormer: Rethinking Transformers for PDEs
【速读】:该论文旨在解决传统数值求解偏微分方程(Partial Differential Equations, PDEs)在高维和多尺度场景下计算成本过高,以及现有基于Transformer的神经算子(Neural Operator)因对所有离散空间点采用统一注意力机制而导致冗余计算的问题。其核心解决方案是提出DynFormer,一种融合物理先验动力学信息的新型神经算子架构:关键在于通过显式分离不同物理尺度,引入谱嵌入(Spectral Embedding)以隔离低频模态,并设计基于Kronecker结构的注意力机制高效捕获大尺度全局交互;同时,利用局部-全局混合变换(Local-Global-Mixing)模块通过非线性频率乘积混合作用隐式重构小尺度湍流级联,避免全局注意力开销。这一分层、动态感知的架构显著提升了模型精度与效率,在多个PDE基准测试中实现相对误差降低高达95%,并大幅减少GPU内存消耗。
链接: https://arxiv.org/abs/2603.03112
作者: Pengyu Lai,Yixiao Chen,Dewu Yang,Rui Wang,Feng Wang,Hui Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
备注:
点击查看摘要
Abstract:Partial differential equations (PDEs) are fundamental for modeling complex physical systems, yet classical numerical solvers face prohibitive computational costs in high-dimensional and multi-scale regimes. While Transformer-based neural operators have emerged as powerful data-driven alternatives, they conventionally treat all discretized spatial points as uniform, independent tokens. This monolithic approach ignores the intrinsic scale separation of physical fields, applying computationally prohibitive global attention that redundantly mixes smooth large-scale dynamics with high-frequency fluctuations. Rethinking Transformers through the lens of complex dynamics, we propose DynFormer, a novel dynamics-informed neural operator. Rather than applying a uniform attention mechanism across all scales, DynFormer explicitly assigns specialized network modules to distinct physical scales. It leverages a Spectral Embedding to isolate low-frequency modes, enabling a Kronecker-structured attention mechanism to efficiently capture large-scale global interactions with reduced complexity. Concurrently, we introduce a Local-Global-Mixing transformation. This module utilizes nonlinear multiplicative frequency mixing to implicitly reconstruct the small-scale, fast-varying turbulent cascades that are slaved to the macroscopic state, without incurring the cost of global attention. Integrating these modules into a hybrid evolutionary architecture ensures robust long-term temporal stability. Extensive memory-aligned evaluations across four PDE benchmarks demonstrate that DynFormer achieves up to a 95% reduction in relative error compared to state-of-the-art baselines, while significantly reducing GPU memory consumption. Our results establish that embedding first-principles physical dynamics into Transformer architectures yields a highly scalable, theoretically grounded blueprint for PDE surrogate modeling.
[AI-18] Multi-Scale Adaptive Neighborhood Awareness Transformer For Graph Fraud Detection
【速读】:该论文旨在解决图欺诈检测(Graph Fraud Detection, GFD)中现有基于图神经网络(Graph Neural Networks, GNNs)方法因固有归纳偏置(inductive bias)而导致的性能瓶颈问题,主要包括同质性假设(homogeneity assumption)和全局建模能力有限两大挑战。解决方案的关键在于提出多尺度邻域感知Transformer(Multi-scale Neighborhood Awareness Transformer, MANDATE),其核心创新包括:设计多尺度位置编码策略以捕捉中心节点到不同距离的拓扑位置信息,并结合自注意力机制显著增强模型的全局建模能力;针对同质边与异质边分别设计嵌入策略,缓解良性节点与欺诈节点在同质性分布上的差异;此外,引入嵌入融合策略处理多关系图结构,有效缓解由不同关系导致的分布偏移问题。实验表明,MANDATE在三个欺诈检测数据集上均展现出优越性能。
链接: https://arxiv.org/abs/2603.03106
作者: Jiaqi Lv,Qingfeng Du,Yu Zhang,Yongqi Han,Sheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Graph fraud detection (GFD) is crucial for identifying fraudulent behavior within graphs, benefiting various domains such as financial networks and social media. Existing methods based on graph neural networks (GNNs) have succeeded considerably due to their effective expressive capacity for graph-structured data. However, the inherent inductive bias of GNNs, including the homogeneity assumption and the limited global modeling ability, hinder the effectiveness of these models. To address these challenges, we propose Multi-scale Neighborhood Awareness Transformer (MANDATE), which alleviates the inherent inductive bias of GNNs. Specifically, we design a multi-scale positional encoding strategy to encode the positional information of various distances from the central node. By incorporating it with the self-attention mechanism, the global modeling ability can be enhanced significantly. Meanwhile, we design different embedding strategies for homophilic and heterophilic connections. This mitigates the homophily distribution differences between benign and fraudulent nodes. Moreover, an embedding fusion strategy is designed for multi-relation graphs, which alleviates the distribution bias caused by different relationships. Experiments on three fraud detection datasets demonstrate the superiority of MANDATE.
[AI-19] Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
【速读】:该论文旨在解决Adam优化算法在实际应用中收敛速度优于随机梯度下降(SGD)的 empirically observed 性能差距缺乏理论解释的问题。现有理论对Adam的保证通常与SGD相当,无法充分说明其优越性。解决方案的关键在于识别出Adam中的一个关键的二阶矩归一化(second-moment normalization)机制,并基于停止时间(stopping time)和鞅(martingale)分析框架,在经典的有界方差模型(即二阶矩假设)下,首次严格区分了Adam与SGD的高概率收敛行为:Adam在置信参数δ下达到δ⁻¹/²的依赖关系,而SGD的相应高概率保证至少需要δ⁻¹的依赖,从而从理论上揭示了Adam在高概率收敛速率上的优势。
链接: https://arxiv.org/abs/2603.03099
作者: Ruinan Jin,Yingbin Liang,Shaofeng Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 59 pages
点击查看摘要
Abstract:Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a \delta^-1/2 dependence on the confidence parameter \delta , whereas corresponding high-probability guarantee for SGD necessarily incurs at least a \delta^-1 dependence.
[AI-20] Odin: Multi-Signal Graph Intelligence for Autonomous Discovery in Knowledge Graphs
【速读】:该论文旨在解决知识图谱中自主发现有意义模式的问题,即在无预定义查询的情况下,系统能够自动探索并识别出具有结构重要性、语义合理性、时间相关性及社区间连通性的潜在模式,从而避免传统检索式系统受限于人工设定的查询范围。解决方案的关键在于提出了一种名为COMPASS(Composite Oriented Multi-signal Path Assessment)的多信号评分机制,该机制融合了四个维度:(1) 通过个性化PageRank计算结构重要性,(2) 利用神经概率逻辑学习(Neural Probabilistic Logic Learning, NPLL)作为判别式过滤器提升语义合理性,(3) 引入可配置衰减函数实现时间相关性建模,(4) 基于图神经网络(GNN)识别桥接实体和跨社区亲和度得分以打破“回音室效应”(echo chamber),确保探索不局限于局部密集社区。这一多信号集成策略显著提升了模式发现的质量与效率,并已在医疗和保险等受监管生产环境中成功部署,同时保证了完整的溯源追踪能力。
链接: https://arxiv.org/abs/2603.03097
作者: Muyukani Kizito,Elizabeth Nyambere
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
点击查看摘要
Abstract:We present Odin, the first production-deployed graph intelligence engine for autonomous discovery of meaningful patterns in knowledge graphs without prior specification. Unlike retrieval-based systems that answer predefined queries, Odin guides exploration through the COMPASS (Composite Oriented Multi-signal Path Assessment) score, a novel metric that combines (1) structural importance via Personalized PageRank, (2) semantic plausibility through Neural Probabilistic Logic Learning (NPLL) used as a discriminative filter rather than generative model, (3) temporal relevance with configurable decay, and (4) community-aware guidance through GNN-identified bridge entities and inter-community affinity scores. This multi-signal integration, particularly the bridge scoring mechanism, addresses the “echo chamber” problem where graph exploration becomes trapped in dense local communities. We formalize the autonomous discovery problem, prove theoretical properties of our scoring function, and demonstrate that beam search with multi-signal guidance achieves O(b \cdot h) complexity while maintaining high recall compared to exhaustive exploration. To our knowledge, Odin represents the first autonomous discovery system deployed in regulated production environments (healthcare and insurance), demonstrating significant improvements in pattern discovery quality and analyst efficiency. Our approach maintains complete provenance traceability – a critical requirement for regulated industries where hallucination is unacceptable.
[AI-21] On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions
【速读】:该论文旨在解决Transformer架构在理论表达能力方面理解不足的问题,特别是其对连续分段线性函数的逼近能力尚不明确。解决方案的关键在于建立Transformer网络与maxout网络之间的显式近似关系,并在此基础上证明Transformer继承了ReLU网络的通用逼近能力,在相似模型复杂度约束下成立。进一步地,作者通过量化线性区域数量揭示了Transformer表达能力随深度呈指数增长的特性,从而构建了标准前馈神经网络逼近理论与Transformer架构之间的理论桥梁。这一分析还提供了结构洞见:自注意力层实现max-type操作,而前馈层则完成token-wise的仿射变换。
链接: https://arxiv.org/abs/2603.03084
作者: Linyan Gu,Lihua Yang,Feng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood. In this paper, we study the expressive capabilities of Transformer architectures. We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity. As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints. Building on this connection, we develop a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth. Our analysis establishes a theoretical bridge between approximation theory for standard feedforward neural networks and Transformer architectures. It also yields structural insights into Transformers: self-attention layers implement max-type operations, while feedforward layers realize token-wise affine transformations.
[AI-22] Beyond Factual Correctness: Mitigating Preference-Inconsistent Explanations in Explainable Recommendation
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的可解释推荐系统中存在的一种特定失效模式:生成的解释虽然在事实层面正确,但其依据的属性与用户历史偏好不一致,导致推理逻辑虽形式有效却缺乏说服力,且这种偏好不一致现象难以被传统幻觉或忠实度指标捕捉。解决方案的关键在于提出 PURE 框架,该框架采用“先选择后生成”的范式,通过引入偏好感知的证据选择机制,从多跳、以物品为中心的推理路径中筛选出既事实准确又与用户偏好结构对齐的紧凑证据集,并利用结构感知提示(structure-aware prompting)注入 LLM 生成过程,从而保障关系约束不被破坏;同时设计了特征级、用户中心的评估指标,用于量化偏好不一致性,从而实现更可信的推荐解释。
链接: https://arxiv.org/abs/2603.03080
作者: Chengkai Wang,Baisong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:LLM-based explainable recommenders can produce fluent explanations that are factually correct, yet still justify items using attributes that conflict with a user’s historical preferences. Such preference-inconsistent explanations yield logically valid but unconvincing reasoning and are largely missed by standard hallucination or faithfulness metrics. We formalize this failure mode and propose PURE, a preference-aware reasoning framework following a select-then-generate paradigm. Instead of only improving generation, PURE intervenes in evidence selection, it selects a compact set of multi-hop item-centric reasoning paths that are both factually grounded and aligned with user preference structure, guided by user intent, specificity, and diversity to suppress generic, weakly personalized evidence. The selected evidence is then injected into LLM generation via structure-aware prompting that preserves relational constraints. To measure preference inconsistency, we introduce a feature-level, user-centric evaluation metric that reveals misalignment overlooked by factuality-based measures. Experiments on three real-world datasets show that PURE consistently reduces preference-inconsistent explanations and factual hallucinations while maintaining competitive recommendation accuracy, explanation quality, and inference efficiency. These results highlight that trustworthy explanations require not only factual correctness but also justification aligned with user preferences.
[AI-23] RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization KDD2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体强化学习(Agentic Reinforcement Learning, Agentic RL)方法中探索能力受限的问题。现有方法主要依赖纯在线策略(on-policy)进行探索,仅能利用智能体自身生成的输出,难以发现新的推理视角以实现进一步优化;尽管部分工作引入了辅助的离线策略信号,但通常使用完整的离线轨迹进行轨迹级策略估计,忽略了智能体滚动(agentic rollout)过程中细粒度的步骤级探索动态。解决方案的关键在于提出一种名为检索增强型策略优化(Retrieval-Augmented Policy Optimization, RAPO)的新框架,其核心创新是将训练过程分解为两个阶段:(i)混合策略智能体滚动(Hybrid-policy Agentic Rollout),通过检索外部离线策略的步骤级轨迹扩展智能体的推理感知范围;(ii)检索感知策略优化(Retrieval-aware Policy Optimization),通过检索奖励与重要性塑造校准策略梯度估计,稳定训练并优先选择具有启发性的检索探索路径。实验证明,RAPO在三个任务类别共14个数据集上平均提升5.0%,且训练效率提高1.2倍。
链接: https://arxiv.org/abs/2603.03078
作者: Siwei Zhang,Yun Xiong,Xi Chen,Zi’an Jia,Renhong Huang,Jiarong Xu,Jiawei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submit to KDD 2026
点击查看摘要
Abstract:Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent’s self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.
[AI-24] Reinforcement Learning with Symbolic Reward Machines
【速读】:该论文旨在解决奖励机器(Reward Machines, RMs)在强化学习(Reinforcement Learning, RL)中应用时面临的两个关键问题:一是RMs依赖人工设计的标签函数来生成环境标签,导致其在不同环境和任务中缺乏通用性;二是RMs需要额外的标签输入,不兼容标准RL框架的环境接口。解决方案的关键在于提出符号奖励机器(Symbolic Reward Machines, SRMs),它直接使用环境的标准观测输出,通过由符号公式表示的守卫条件(guards)对观测进行解析,从而自动推导出奖励信号。SRMs无需人工定义标签函数,同时保持了与主流RL框架的兼容性,并提供可解释的任务表示,有效提升了方法的实用性与泛化能力。
链接: https://arxiv.org/abs/2603.03068
作者: Thomas Krug,Daniel Neider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.
[AI-25] cPNN: Continuous Progressive Neural Networks for Evolving Streaming Time Series
【速读】:该论文旨在解决数据流中同时存在的概念漂移(concept drift)和时间依赖性(temporal dependencies)问题,以及在学习多个概念时如何避免灾难性遗忘(catastrophic forgetting)。现有方法通常将上述问题分别处理,缺乏统一的解决方案。论文提出连续渐进神经网络(Continuous Progressive Neural Networks, cPNN),其关键在于将渐进神经网络(Progressive Neural Networks)扩展为适用于连续数据流的版本,结合循环神经网络(Recurrent Neural Networks)结构与基于随机梯度下降(Stochastic Gradient Descent)的学习机制,从而实现对新概念的快速适应、对时间依赖性的建模及对旧知识的有效保留,有效缓解了灾难性遗忘问题。
链接: https://arxiv.org/abs/2603.03040
作者: Federico Giannini,Giacomo Ziffer,Emanuele Della Valle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Dealing with an unbounded data stream involves overcoming the assumption that data is identically distributed and independent. A data stream can, in fact, exhibit temporal dependencies (i.e., be a time series), and data can change distribution over time (concept drift). The two problems are deeply discussed, and existing solutions address them separately: a joint solution is absent. In addition, learning multiple concepts implies remembering the past (a.k.a. avoiding catastrophic forgetting in Neural Networks’ terminology). This work proposes Continuous Progressive Neural Networks (cPNN), a solution that tames concept drifts, handles temporal dependencies, and bypasses catastrophic forgetting. cPNN is a continuous version of Progressive Neural Networks, a methodology for remembering old concepts and transferring past knowledge to fit the new concepts quickly. We base our method on Recurrent Neural Networks and exploit the Stochastic Gradient Descent applied to data streams with temporal dependencies. Results of an ablation study show a quick adaptation of cPNN to new concepts and robustness to drifts.
[AI-26] MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN
【速读】:该论文旨在解决视觉-语言导航(Vision-Language Navigation, VLN)任务中因单一智能体认知负荷过重而导致的感知失真与决策漂移问题,尤其在复杂、长距离场景下表现尤为显著。解决方案的关键在于提出一种多智能体协同导航框架(Multi-Agent Collaborative Navigation, MA-CoNav),其核心是采用“主从式”分层协作架构,将导航任务所需的感知、规划、执行和记忆功能解耦并分配给专业化子智能体:主智能体负责全局调度,从属智能体组包括观测代理(生成环境描述)、规划代理(进行任务分解与动态验证)、执行代理(同步建图与动作执行)和记忆代理(管理结构化经验)。此外,引入“局部-全局”双阶段反思机制以动态优化整个导航流程,从而显著提升系统在真实室内环境中无需场景特定微调下的鲁棒性与性能。
链接: https://arxiv.org/abs/2603.03024
作者: Ling Luo,Qianqian Bai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-Language Navigation (VLN) aims to empower robots with the ability to perform long-horizon navigation in unfamiliar environments based on complex linguistic instructions. Its success critically hinges on establishing an efficient language-understanding -- visual-perception -- embodied-execution'' closed loop. Existing methods often suffer from perceptual distortion and decision drift in complex, long-distance tasks due to the cognitive overload of a single agent. Inspired by distributed cognition theory, this paper proposes MA-CoNav, a Multi-Agent Collaborative Navigation framework. This framework adopts a Master-Slave’’ hierarchical agent collaboration architecture, decoupling and distributing the perception, planning, execution, and memory functions required for navigation tasks to specialized agents. Specifically, the Master Agent is responsible for global orchestration, while the Subordinate Agent group collaborates through a clear division of labor: an Observation Agent generates environment descriptions, a Planning Agent performs task decomposition and dynamic verification, an Execution Agent handles simultaneous mapping and action, and a Memory Agent manages structured experiences. Furthermore, the framework introduces a ``Local-Global’’ dual-stage reflection mechanism to dynamically optimize the entire navigation pipeline. Empirical experiments were conducted using a real-world indoor dataset collected by a Limo Pro robot, with no scene-specific fine-tuning performed on the models throughout the process. The results demonstrate that MA-CoNav comprehensively outperforms existing mainstream VLN methods across multiple metrics.
[AI-27] REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agent ic AI in Enterprise Telemetry
【速读】:该论文旨在解决企业在将生成式 AI(Generative AI)代理系统与私有遥测数据(telemetry)进行确定性对接时面临的三大实践挑战:模型上下文有限、本地语义概念定义不一致以及指标接口持续演化。其解决方案的关键在于提出 REGAL 架构——一种基于注册表驱动的确定性接地机制,通过将遥测计算视为一等公民,并让大语言模型(LLM)在受控版本化的动作空间中操作而非原始事件流,从而实现语义对齐与治理嵌入。该架构由两部分组成:(1) Medallion ELT 流水线生成可重放、语义压缩的 Gold 数据资产;(2) 注册表驱动的编译层从声明式指标定义自动合成 Model Context Protocol (MCP) 工具,确保工具规范与执行的一致性,防止工具漂移,并在语义边界内嵌入治理策略。
链接: https://arxiv.org/abs/2603.03018
作者: Yuvraj Agrawal
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Enterprise engineering organizations produce high-volume, heterogeneous telemetry from version control systems, CI/CD pipelines, issue trackers, and observability platforms. Large Language Models (LLMs) enable new forms of agentic automation, but grounding such agents on private telemetry raises three practical challenges: limited model context, locally defined semantic concepts, and evolving metric interfaces. We present REGAL, a registry-driven architecture for deterministic grounding of agentic AI systems in enterprise telemetry. REGAL adopts an explicitly architectural approach: deterministic telemetry computation is treated as a first-class primitive, and LLMs operate over a bounded, version-controlled action space rather than raw event streams. The architecture combines (1) a Medallion ELT pipeline that produces replayable, semantically compressed Gold artifacts, and (2) a registry-driven compilation layer that synthesizes Model Context Protocol (MCP) tools from declarative metric definitions. The registry functions as an “interface-as-code” layer, ensuring alignment between tool specification and execution, mitigating tool drift, and embedding governance policies directly at the semantic boundary. A prototype implementation and case study validate the feasibility of deterministic grounding and illustrate its implications for latency, token efficiency, and operational governance. This work systematizes an architectural pattern for enterprise LLM grounding; it does not propose new learning algorithms, but rather elevates deterministic computation and semantic compilation to first-class design primitives for agentic systems. Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2603.03018 [cs.AI] (or arXiv:2603.03018v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.03018 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuvraj Agrawal [view email] [v1] Tue, 3 Mar 2026 14:13:39 UTC (666 KB)
[AI-28] OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents
【速读】:该论文旨在解决现有多智能体大语言模型框架在科学与知识密集型领域中存在的局限性,包括静态提示(prompt)和固定代理角色导致的领域适应能力差、僵化的任务流程限制了推理灵活性、对同质化模型的依赖造成高延迟,以及在中间推理偏离时难以修正早期决策,从而降低结构化和计算密集场景下的可靠性。其解决方案的关键在于提出一种面向科学领域的交互式两级多模型编排框架:第一级为专用编排模型(orchestration model),根据任务动态构建领域感知的推理流水线并实例化具有定制提示的专家代理;第二级为执行模型(execution model),在生成的角色与指令规范下执行每一步操作;编排模型通过迭代接收中间反馈更新流水线,实现动态重规划、角色再分配和提示优化,从而增强科学推理的鲁棒性和专业化水平,并支持异构大语言模型(LLM)的灵活集成,以在实际科学部署中实现性能与效率的权衡。
链接: https://arxiv.org/abs/2603.03005
作者: Yichao Feng,Haoran Luo,Zhenghong Lin,Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh,Anh Tuan Luu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multi-agent large language model frameworks are promising for complex multi step reasoning, yet existing systems remain weak for scientific and knowledge intensive domains due to static prompts and agent roles, rigid workflows, and homogeneous model reliance, leading to poor domain adaptation, limited reasoning flexibility, and high latency on heterogeneous or long-horizon scientific tasks. They also struggle to revise earlier decisions when intermediate reasoning diverges, reducing reliability in structured and calculation heavy settings. To address these limitations, we propose a scientific domain oriented interactive two tier multi model orchestration framework. A dedicated orchestration model analyzes each task, dynamically constructs a domain aware reasoning pipeline, and instantiates specialized expert agents with tailored prompts, while an execution model performs each step under generated role and instruction specifications. The orchestrator iteratively updates the pipeline based on intermediate feedback, enabling dynamic replanning, role reallocation, and prompt refinement across multi turn interactions, strengthening robustness and specialization for scientific reasoning through structured heterogeneous model collaboration. The framework is model agnostic and supports heterogeneous LLM integration with different capacities or costs, enabling flexible performance efficiency trade offs in practical scientific deployments. Experiments show consistent improvements over existing multi agent systems and strong baselines across diverse reasoning and scientific style benchmarks.
[AI-29] SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models
【速读】:该论文试图解决当前大语言模型在空间推理任务中难以区分真实空间认知与统计语言启发式(statistical language heuristics)的问题,以及多模态评估常将空间推理与视觉感知混淆的局限性。其解决方案的关键在于提出一个理论驱动的诊断框架 SpatialText,该框架通过双源方法分离文本空间推理:一方面整合人类标注的真实3D室内环境描述以保留自然歧义、视角转换和功能关系;另一方面使用代码生成的逻辑精确场景来探测形式化空间推理与知识边界。这一设计使得模型能够被系统性地评估是否构建灵活的空间心理模型(mental models),而非仅依赖表面语言关联。
链接: https://arxiv.org/abs/2603.03002
作者: Peiyao Jiang,Zequn Qin,Xi Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Genuine spatial reasoning relies on the capacity to construct and manipulate coherent internal spatial representations, often conceptualized as mental models, rather than merely processing surface linguistic associations. While large language models exhibit advanced capabilities across various domains, existing benchmarks fail to isolate this intrinsic spatial cognition from statistical language heuristics. Furthermore, multimodal evaluations frequently conflate genuine spatial reasoning with visual perception. To systematically investigate whether models construct flexible spatial mental models, we introduce SpatialText, a theory-driven diagnostic framework. Rather than functioning simply as a dataset, SpatialText isolates text-based spatial reasoning through a dual-source methodology. It integrates human-annotated descriptions of real 3D indoor environments, which capture natural ambiguities, perspective shifts, and functional relations, with code-generated, logically precise scenes designed to probe formal spatial deduction and epistemic boundaries. Systematic evaluation across state-of-the-art models reveals fundamental representational limitations. Although models demonstrate proficiency in retrieving explicit spatial facts and operating within global, allocentric coordinate systems, they exhibit critical failures in egocentric perspective transformation and local reference frame reasoning. These systematic errors provide strong evidence that current models rely heavily on linguistic co-occurrence heuristics rather than constructing coherent, verifiable internal spatial representations. SpatialText thus serves as a rigorous instrument for diagnosing the cognitive boundaries of artificial spatial intelligence.
[AI-30] Why Does RLAIF Work At All?
【速读】:该论文试图解决的问题是:为何基于AI反馈的强化学习(Reinforcement Learning from AI Feedback, RLAIF)能够使语言模型通过自身生成的偏好判断实现价值对齐的自我改进,这一现象缺乏理论解释。解决方案的关键在于提出“潜在价值假设”(latent value hypothesis),即互联网规模预训练数据将人类价值观编码为表示空间中的方向,而宪法提示(constitutional prompts)可激发这些潜在价值形成偏好判断;在此基础上,作者在线性模型框架下形式化该假设,将宪法视为投影算子以选择与价值相关方向,从而解释了RLAIF为何有效、其性能上限由模型容量决定,并揭示了对抗性宪法可能激活有害价值方向的存在。
链接: https://arxiv.org/abs/2603.03000
作者: Robin Young
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model’s default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.
[AI-31] Delegation and Verification Under AI
【速读】:该论文试图解决的问题是:当生成式 AI (Generative AI) 系统被引入机构工作流程时,如何理解工作者在任务委托(delegation)与输出验证(verification)之间的决策行为,以及这种行为如何影响机构视角下的工人质量评估。解决方案的关键在于构建一个理性工人的优化模型,将委托和验证视为其在私人成本与机构结果导向评价之间权衡的最优策略,并定义“机构中心效用”作为衡量工人质量的新标准。研究发现,AI 引入会引发“相变”现象——即微小的验证能力差异可导致显著不同的行为模式,从而放大高验证可靠性工人的表现,同时因理性过度委托和监督不足而降低其他工人的机构质量评分,即使基础任务成功率提升且无行为偏差存在。这一机制揭示了 AI 如何结构性地重塑机构内工人质量分布并加剧个体间的能力差距。
链接: https://arxiv.org/abs/2603.02961
作者: Lingxiao Huang,Wenyang Xiao,Nisheeth K. Vishnoi
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Theoretical Economics (econ.TH)
备注:
点击查看摘要
Abstract:As AI systems enter institutional workflows, workers must decide whether to delegate task execution to AI and how much effort to invest in verifying AI outputs, while institutions evaluate workers using outcome-based standards that may misalign with workers’ private costs. We model delegation and verification as the solution to a rational worker’s optimization problem, and define worker quality by evaluating an institution-centered utility (distinct from the worker’s objective) at the resulting optimal action. We formally characterize optimal worker workflows and show that AI induces phase transitions, where arbitrarily small differences in verification ability lead to sharply different behaviors. As a result, AI can amplify workers with strong verification reliability while degrading institutional worker quality for others who rationally over-delegate and reduce oversight, even when baseline task success improves and no behavioral biases are present. These results identify a structural mechanism by which AI reshapes institutional worker quality and amplifies quality disparities between workers with different verification reliability.
[AI-32] Architecting Trust in Artificial Epistemic Agents
【速读】:该论文旨在解决生成式 AI(Generative AI)作为认知代理(epistemic agents)在知识创造、筛选与整合过程中可能引发的可靠性与校准问题,尤其是在复杂多代理交互背景下,如何确保其行为符合个体与集体的认知规范,避免认知能力退化和认知偏差扩散。解决方案的关键在于构建一个以可信性为核心框架的治理路径:首先,要求 AI 代理具备认知能力(epistemic competence)、可证伪性(robust falsifiability)和认知美德行为(epistemically virtuous behaviors);其次,通过技术溯源系统(technical provenance systems)和“知识圣殿”(knowledge sanctuaries)等机制保障人类认知韧性;最终实现 AI 与人类认知目标对齐,并强化社会认知基础设施,从而推动形成可靠、包容且可持续的人机知识生态系统。
链接: https://arxiv.org/abs/2603.02960
作者: Nahema Marchal,Stephanie Chan,Matija Franklin,Manon Revel,Geoff Keeling,Roberta Fischli,Bilva Chandra,Iason Gabriel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models increasingly function as epistemic agents – entities that can 1) autonomously pursue epistemic goals and 2) actively shape our shared knowledge environment. They curate the information we receive, often supplanting traditional search-based methods, and are frequently used to generate both personal and deeply specialized advice. How they perform these functions, including whether they are reliable and properly calibrated to both individual and collective epistemic norms, is therefore highly consequential for the choices we make. We argue that the potential impact of epistemic AI agents on practices of knowledge creation, curation and synthesis, particularly in the context of complex multi-agent interactions, creates new informational interdependencies that necessitate a fundamental shift in evaluation and governance of AI. While a well-calibrated ecosystem could augment human judgment and collective decision-making, poorly aligned agents risk causing cognitive deskilling and epistemic drift, making the calibration of these models to human norms a high-stakes necessity. To ensure a beneficial human-AI knowledge ecosystem, we propose a framework centered on building and cultivating the trustworthiness of epistemic AI agents; aligning AI these agents with human epistemic goals; and reinforcing the surrounding socio-epistemic infrastructure. In this context, trustworthy AI agents must demonstrate epistemic competence, robust falsifiability, and epistemically virtuous behaviors, supported by technical provenance systems and “knowledge sanctuaries” designed to protect human resilience. This normative roadmap provides a path toward ensuring that future AI systems act as reliable partners in a robust and inclusive knowledge ecosystem.
[AI-33] he Geometry of Learning Under AI Delegation
【速读】:该论文旨在解决人工智能(AI)从辅助工具向协作伙伴演进过程中,人类技能随时间如何变化这一核心问题。其解决方案的关键在于构建一个耦合动力学模型,将人类技能演化与AI任务委托行为统一于单一性能指标(即预期任务误差)的优化框架下:技能通过使用提升、闲置衰退,而委托策略则根据相对表现动态调整。尽管局部更新机制均服务于同一目标,但这种自适应交互会显著改变技能获取的全局稳定性结构——系统除存在纯人类学习下的高技能平衡点外,还出现一个稳定的低技能平衡点,对应持续依赖AI的状态,并由一条陡峭的吸引盆边界分隔,导致早期决策在动力学上具有不可逆性。研究进一步揭示,AI虽能短期提升性能,却可能因委托与实践间的负反馈机制,在长期反而导致相对于无AI基准的性能下降,且该现象对噪声和信任不对称具有鲁棒性。因此,论文指出,AI对人类长期能力的削弱根源在于系统稳定性被破坏,而非激励错配或意图不一致。
链接: https://arxiv.org/abs/2603.02950
作者: Lingxiao Huang,Nisheeth K. Vishnoi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
点击查看摘要
Abstract:As AI systems shift from tools to collaborators, a central question is how the skills of humans relying on them change over time. We study this question mathematically by modeling the joint evolution of human skill and AI delegation as a coupled dynamical system. In our model, delegation adapts to relative performance, while skill improves through use and decays under non-use; crucially, both updates arise from optimizing a single performance metric measuring expected task error. Despite this local alignment, adaptive AI use fundamentally alters the global stability structure of human skill acquisition. Beyond the high-skill equilibrium of human-only learning, the system admits a stable low-skill equilibrium corresponding to persistent reliance, separated by a sharp basin boundary that makes early decisions effectively irreversible under the induced dynamics. We further show that AI assistance can strictly improve short-run performance while inducing persistent long-run performance loss relative to the no-AI baseline, driven by a negative feedback between delegation and practice. We characterize how AI quality deforms the basin boundary and show that these effects are robust to noise and asymmetric trust updates. Our results identify stability, not incentives or misalignment, as the central mechanism by which AI assistance can undermine long-run human performance and skill.
[AI-34] SEALing the Gap: A Reference Framework for LLM Inference Carbon Estimation via Multi-Benchmark Driven Embodiment ICSE’26
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段(inference)产生的碳排放难以精确量化的问题,从而支持可持续性导向的决策。随着LLM训练阶段的碳排放被广泛关注,研究发现推理阶段因处理大量提示(prompt)而迅速超越训练阶段的碳足迹,因此亟需在提示级别实现准确的碳排放测量。论文提出了一套指导原则,用于构建一个新颖的参考框架,以系统化地推进LLM推理碳估算的研究与工具开发;其关键解决方案是引入SEAL——一种基于多基准驱动的提示级碳估算工具,通过集成多种基准测试方法实现对每个提示的精细化碳排放评估,为整个LLM生态系统的标准化可持续性评估奠定基础。
链接: https://arxiv.org/abs/2603.02949
作者: Priyavanshi Pathania,Rohit Mehra,Vibhu Saujanya Sharma,Vikrant Kaulgud,Tiffani Nevels,Sanjay Podder,Adam P. Burden
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages. To be published in the proceedings of 48th International Conference on Software Engineering (ICSE '26), April 12-18, 2026, Rio de Janeiro, Brazil (New Ideas and Emerging Results Track)
点击查看摘要
Abstract:Large Language Models are rapidly gaining traction in software engineering, yet their growing carbon footprint raises pressing sustainability concerns. While training emissions are substantial, inference quickly surpasses them due to the sheer volume of prompts processed. This shift underscores the urgent need for accurate, prompt-level carbon measurement during inference to enable informed, sustainability-focused decision-making. To address the limitations of existing approaches, in this paper, we outline the guiding principles for a novel reference framework for LLM inference carbon estimation that can guide the design of future tools and provide a systematic foundation for advancing sustainability research in this domain. We also introduce SEAL, an early embodiment of these principles that leverages a multi-benchmark-driven approach for per-prompt carbon estimation. Its initial validation shows promising results, positioning SEAL as a foundation for standardized sustainability assessment across the LLM ecosystem.
[AI-35] Enhancing Physics-Informed Neural Networks with Domain-aware Fourier Features: Towards Improved Performance and Interpretable Results
【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在训练难度大、可解释性差以及需要复杂损失函数平衡策略等问题。其核心解决方案是引入领域感知傅里叶特征(Domain-aware Fourier Features, DaFFs)作为输入空间的位置编码方式,该方法将域特异性信息(如几何形状和边界条件)隐式编码到特征中,从而无需显式添加边界条件损失项和损失平衡机制,显著简化优化过程并降低计算成本。实验表明,PINN-DaFFs相较于传统PINNs和基于随机傅里叶特征(Random Fourier Features, RFFs)的模型,在精度和收敛速度上均实现数量级提升,同时结合基于梯度的归因方法(LRP)可提取更符合物理规律的输入空间特征重要性分布,提升了模型的可解释性。
链接: https://arxiv.org/abs/2603.02948
作者: Alberto Miño Calero,Luis Salamanca,Konstantinos E. Tatsis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:Physics-Informed Neural Networks (PINNs) incorporate physics into neural networks by embedding partial differential equations (PDEs) into their loss function. Despite their success in learning the underlying physics, PINN models remain difficult to train and interpret. In this work, a novel modeling approach is proposed, which relies on the use of Domain-aware Fourier Features (DaFFs) for the positional encoding of the input space. These features encapsulate all the domain-specific characteristics, such as the geometry and boundary conditions, and unlike Random Fourier Features (RFFs), eliminate the need for explicit boundary condition loss terms and loss balancing schemes, while simplifying the optimization process and reducing the computational cost associated with training. We further develop an LRP-based explainability framework tailored to PINNs, enabling the extraction of relevance attribution scores for the input space. It is demonstrated that PINN-DaFFs achieve orders-of-magnitude lower errors and allow faster convergence compared to vanilla PINNs and RFFs-based PINNs. Furthermore, LRP analysis reveals that the proposed leads to more physically consistent feature attributions, while PINN-RFFs and vanilla PINNs display more scattered and less physics-relevant patterns. These results demonstrate that DaFFs not only enhance PINNs’ accuracy and efficiency but also improve interpretability, laying the ground for more robust and informative physics-informed learning.
[AI-36] ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization PAKDD2026
【速读】:该论文旨在解决船舶轨迹预测问题,即如何利用大语言模型(Large Language Models, LLMs)提升预测精度与推理能力。其解决方案的关键在于提出ShipTraj-R1框架,该框架将船舶轨迹预测重构为文本到文本生成任务,并通过三个核心机制实现优化:(1) 设计包含冲突船舶轨迹信息的动态提示(prompt),引导模型进行自适应链式思维(chain-of-thought, CoT)推理;(2) 引入基于规则的奖励机制,同时激励推理过程合理性与预测准确性;(3) 采用领域特定提示和奖励驱动的群体相对策略优化(Group Relative Policy Optimization, GRPO)对Qwen3模型进行强化微调,从而在两个复杂真实海事数据集上显著优于现有深度学习与LLM基线方法。
链接: https://arxiv.org/abs/2603.02939
作者: Yang Zhan,Yunhao Li,Zhang Chao,Yuxu Lu,Yan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by the 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2026)
点击查看摘要
Abstract:Recent advancements in reinforcement fine-tuning have significantly improved the reasoning ability of large language models (LLMs). In particular, methods such as group relative policy optimization (GRPO) have demonstrated strong capabilities across various fields. However, applying LLMs to ship trajectory prediction remains largely unexplored. In this paper, we propose ShipTraj-R1, a novel LLM-based framework that reformulates ship trajectory prediction as a text-to-text generation problem. (1) We design a dynamic prompt containing trajectory information about conflicting ships to guide the model to achieve adaptive chain-of-thought (CoT) reasoning. (2) We introduce a comprehensive rule-based reward mechanism to incentivize the reasoning format and prediction accuracy of the model. (3) Our ShipTraj-R1 is reinforced through the GRPO mechanism guided by domain-specific prompts and rewards, and utilizes the Qwen3 as the model backbone. Extensive experimental results on two complex and real-world maritime datasets show that the proposed ShipTraj-R1 achieves the least error compared with state-of-the-art deep learning and LLM-based baselines.
[AI-37] Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models
【速读】:该论文旨在解决零样本(zero-shot)图推理任务中因数据稀缺导致的传统图神经网络(Graph Neural Networks, GNNs)泛化能力不足,以及现有基于大语言模型(Large Language Models, LLMs)的图推理方法因跨模态对齐问题和固定子图提取策略引入结构噪声而影响预测精度的问题。其核心解决方案是提出GraphSSR框架,关键在于设计了一个“采样-选择-推理”(Sample-Select-Reason, SSR)动态子图提取与去噪管道,通过任务感知的自适应机制过滤无关邻居和边,从而优化LLMs的接收场;并进一步开发了SSR-SFT监督微调策略与SSR-RL两阶段强化学习框架,分别用于生成高质量训练轨迹和显式调控采样与选择操作,以实现基于精简、去噪子图的准确推理。
链接: https://arxiv.org/abs/2603.02938
作者: Fengzhi Li,Liang Zhang,Yuan Zuo,Ruiqing Zhao,YanSong Liu,Yunfei Ma,Fanyu Meng,Junlan Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise–irrelevant neighbors and edges–that distorts the LLMs’ receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a “Sample-Select-Reason” process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.
[AI-38] On the Structural Limitations of Weight-Based Neural Adaptation and the Role of Reversible Behavioral Learning
【速读】:该论文试图解决共享参数模型在适应过程中产生的结构不可逆性问题,即通过微调、对齐训练和强化学习等方法对参数进行直接修改后,会导致模型基础行为发生长期改变,且无法在重置后恢复原始行为。解决方案的关键在于提出“可逆行为学习”(reversible behavioral learning)机制,该机制将模型行为与身份参数(identity parameters)解耦,使得模型行为可以通过显式的卸载过程被确定性地还原;同时引入可恢复因子(Recoverability Factor)作为行为可恢复性的归一化度量,并提供基于模型偏离度的诊断工具,实验证明该方案可在数值精度范围内实现模型回滚,而传统共享参数突变则表现出持续的重置后偏差。
链接: https://arxiv.org/abs/2603.02934
作者: Pardhu Sri Rushi Varma Konduru
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures. Preprint version
点击查看摘要
Abstract:Neural models are usually adapted through changes in parameters shared among model components via fine-tuning, alignment-based training, and reinforcement learning. These changes have been found effective in short-term optimization. However, they result in long-term alterations in the model’s base behavior. In this study, we introduce the concept of structural irreversibility as a characteristic of shared-parameter model adaptation. This concept refers to the intertwining of task-specific objectives with the representational identity of the model. We show that when parameters are directly mutated, the resulting model behaves divergently from the original model. This divergence cannot be reversed deterministically without an explicit parameter snapshot. We introduce reversible behavioral learning, in which model behaviors are structurally dissociated from identity parameters and can be deterministically unloaded through an explicit unload process. We also introduce the Recoverability Factor as a normalized measure of behavioral recoverability and provide additional diagnostics based on model divergence. Experiments show that reversible model adaptation achieves rollback within numerical precision, whereas shared-parameter mutation exhibits persistent post-reset divergence.
[AI-39] Eliciting Numerical Predictive Distributions of LLM s Without Autoregression ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在回归任务中因自回归解码机制导致的计算效率低下问题,尤其是在需要获取数值目标的预测分布时,传统方法依赖重复采样以估计统计量(如均值、中位数或分位数),造成高计算开销和推理延迟。其解决方案的关键在于:不依赖显式的自回归生成过程,而是通过训练轻量级探测器(regression probes)直接从LLM内部表征中预测数值输出分布的统计函数(statistical functionals),从而高效恢复预测分布的分布特性及数值不确定性信息。这一方法揭示了LLM嵌入中蕴含关于预测分布摘要统计量的丰富信号,为无需采样的不确定性感知数值预测提供了新路径。
链接: https://arxiv.org/abs/2603.02913
作者: Julianna Piskorz,Katarzyna Kobalczyk,Mihaela van der Schaar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: First two authors contributed equally. Published as a conference paper at ICLR2026
点击查看摘要
Abstract:Large Language Models (LLMs) have recently been successfully applied to regression tasks – such as time series forecasting and tabular prediction – by leveraging their in-context learning abilities. However, their autoregressive decoding process may be ill-suited to continuous-valued outputs, where obtaining predictive distributions over numerical targets requires repeated sampling, leading to high computational cost and inference time. In this work, we investigate whether distributional properties of LLM predictions can be recovered without explicit autoregressive generation. To this end, we study a set of regression probes trained to predict statistical functionals (e.g., mean, median, quantiles) of the LLM’s numerical output distribution directly from its internal representations. Our results suggest that LLM embeddings carry informative signals about summary statistics of their predictive distributions, including the numerical uncertainty. This investigation opens up new questions about how LLMs internally encode uncertainty in numerical tasks, and about the feasibility of lightweight alternatives to sampling-based approaches for uncertainty-aware numerical predictions.
[AI-40] SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLM s without Training
【速读】:该论文旨在解决预训练大语言模型(Large Language Models, LLMs)在后训练过程(如监督微调)中因模型分布偏移导致的跨域迁移能力难以预测的问题。现有方法缺乏对模型内部表征变化与下游任务性能之间关系的可解释性分析,使得后训练策略优化缺乏依据。解决方案的关键在于提出基于稀疏自编码器(Sparse Autoencoders, SAEs)的迁移能力评分(Transferability Score, STS),通过识别SAE表示中发生偏移的特征维度,并计算其与下游任务域的相关性,从而在微调前可靠地估计模型迁移性能。实验表明,STS与实际性能变化具有高度一致性(Pearson相关系数 > 0.7),且初步拓展至强化学习场景,为可解释的后训练策略设计提供了新工具。
链接: https://arxiv.org/abs/2603.02908
作者: Qi Zhang,Yifei Wang,Xiaohan Wang,Jiajun Chai,Guojun Yin,Wei Lin,Yisen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textitbefore fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an \colorblack interpretable tool for guiding post-training strategies in LLMs. Code is available at this https URL.
[AI-41] Retrievit: In-context Retrieval Capabilities of Transformers State Space Models and Hybrid Architectures
【速读】:该论文旨在解决Transformer与状态空间模型(State Space Models, SSMs)在上下文检索任务中各自存在的局限性:Transformer虽具备强大的上下文检索能力,但其计算复杂度随序列长度呈二次增长;而SSMs虽具有线性时间复杂度的优势,但在检索能力上表现有限。解决方案的关键在于探索将两者结合的混合架构(hybrid architectures),通过在合成任务(n-gram检索和位置检索)中系统评估数据效率、长度外推能力、域外鲁棒性及表征学习特性,发现混合模型在信息密集型上下文检索任务中可达到或超越Transformer的性能,同时保持SSM的高效性;进一步的表征分析揭示,SSM类模型会自发形成局部感知嵌入(locality-aware embeddings),即相邻位置的token在嵌入空间中彼此靠近,这种结构化表征解释了其在特定检索任务中的优势与不足,为基于任务需求选择模型架构提供了理论依据。
链接: https://arxiv.org/abs/2603.02874
作者: Georgios Pantazopoulos,Malvina Nikandrou,Ioannis Konstas,Alessandro Suglia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformers excel at in-context retrieval but suffer from quadratic complexity with sequence length, while State Space Models (SSMs) offer efficient linear-time processing but have limited retrieval capabilities. We investigate whether hybrid architectures combining Transformers and SSMs can achieve the best of both worlds on two synthetic in-context retrieval tasks. The first task, n-gram retrieval, requires the model to identify and reproduce an n-gram that succeeds the query within the input sequence. The second task, position retrieval, presents the model with a single query token and requires it to perform a two-hop associative lookup: first locating the corresponding element in the sequence, and then outputting its positional index. Under controlled experimental conditions, we assess data efficiency, length generalization, robustness to out of domain training examples, and learned representations across Transformers, SSMs, and hybrid architectures. We find that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval. However, Transformers maintain superiority in position retrieval tasks. Through representation analysis, we discover that SSM-based models develop locality-aware embeddings where tokens representing adjacent positions become neighbors in embedding space, forming interpretable structures. This emergent property, absent in Transformers, explains both the strengths and limitations of SSMs and hybrids for different retrieval tasks. Our findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how Transformers and SSMs, and hybrid models learn positional associations.
[AI-42] LLM -based Argument Mining meets Argumentation and Description Logics: a Unified Framework for Reasoning about Debates
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂文本(如辩论文本)时缺乏显式、透明且可验证推理能力的问题,尤其体现在无法结构化表示论点之间的支持与攻击关系及其相对强度对整体可接受性的影响。解决方案的关键在于提出一个融合学习驱动的论点挖掘(argument mining)、定量论证语义(quantitative argumentation semantics)与基于本体的查询机制的框架:首先从原始辩论文本中提取模糊论点知识库(fuzzy argumentative knowledge base),其中论点作为实体被显式建模,并通过支持和攻击关系连接,同时附带反映其在语境下合理性的初始模糊强度;随后利用定量论证语义传播支持与攻击效应以计算最终论点强度;最后将结果嵌入模糊描述逻辑(fuzzy description logic)环境,借助高效的重写技术实现表达性强的查询回答,从而提供一种透明、可解释且形式化基础坚实的辩论分析方法。
链接: https://arxiv.org/abs/2603.02858
作者: Gianvincenzo Alfano,Sergio Greco,Lucio La Cava,Stefano Francesco Monea,Irina Trubitsyna
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) achieve strong performance in analyzing and generating text, yet they struggle with explicit, transparent, and verifiable reasoning over complex texts such as those containing debates. In particular, they lack structured representations that capture how arguments support or attack each other and how their relative strengths determine overall acceptability. We encompass these limitations by proposing a framework that integrates learning-based argument mining with quantitative reasoning and ontology-based querying. Starting from a raw debate text, the framework extracts a fuzzy argumentative knowledge base, where arguments are explicitly represented as entities, linked by attack and support relations, and annotated with initial fuzzy strengths reflecting plausibility w.r.t. the debate’s context. Quantitative argumentation semantics are then applied to compute final argument strengths by propagating the effects of supports and attacks. These results are then embedded into a fuzzy description logic setting, enabling expressive query answering through efficient rewriting techniques. The proposed approach provides a transparent, explainable, and formally grounded method for analyzing debates, overcoming purely statistical LLM-based analyses.
[AI-43] CoFL: Continuous Flow Fields for Language-Conditioned Navigation
【速读】:该论文旨在解决语言引导导航(language-conditioned navigation)中模块化组件脆弱性和动作序列生成成本高的问题。传统方法依赖于离散动作标记或通过迭代去噪采样动作块,存在泛化能力弱和计算效率低的缺陷。其解决方案的关键在于提出一种端到端策略 CoFL,直接将鸟瞰图(BEV)观测与语言指令映射为连续的速度场(flow field),从而在任意二维投影位置上查询瞬时速度,通过数值积分获得平滑轨迹,实现闭环执行下的实时响应性。该方法避免了显式动作预测,提升了导航的鲁棒性和效率,并在大规模合成数据集上训练后实现了零样本迁移至真实场景中的可靠控制。
链接: https://arxiv.org/abs/2603.02854
作者: Haokun Liu,Zhaoqi Ma,Yicheng Chen,Masaki Kitagawa,Wentao Zhang,Jinjie Li,Moju Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures
点击查看摘要
Abstract:Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird’s-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.
[AI-44] Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling NEURIPS2025
【速读】:该论文旨在解决工业4.0背景下柔性作业车间调度问题(Flexible Job-Shop Scheduling Problem, FJSP)中,现有基于深度强化学习(Deep Reinforcement Learning, DRL)的构造类方法难以逼近(近似)最优解的问题。其核心挑战在于柔性机器分配带来的状态表示不准确、策略学习效率低以及搜索策略不够高效。解决方案的关键在于提出一种基于异构图表示的记忆增强改进搜索框架(Memory-enhanced Improvement Search framework with heterogeneous graph representation, MIStar):首先设计了一种新型异构析取图(heterogeneous disjunctive graph),显式建模机器上的工序序列以精确表示调度方案;其次构建记忆增强的异构图神经网络(Memory-enhanced Heterogeneous Graph Neural Network, MHGNN),利用历史轨迹提升策略网络的决策能力;最后采用并行贪婪搜索策略,在较少迭代次数内高效探索解空间,从而显著优于传统手工改进启发式算法和当前最先进的DRL构造类方法。
链接: https://arxiv.org/abs/2603.02846
作者: Jiaqi Wang,Zhiguang Cao,Peng Zhao,Rui Cao,Yubin Xiao,Yuan Jiang,You Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
Abstract:The rise of smart manufacturing under Industry 4.0 introduces mass customization and dynamic production, demanding more advanced and flexible scheduling techniques. The flexible job-shop scheduling problem (FJSP) has attracted significant attention due to its complex constraints and strong alignment with real-world production scenarios. Current deep reinforcement learning (DRL)-based approaches to FJSP predominantly employ constructive methods. While effective, they often fall short of reaching (near-)optimal solutions. In contrast, improvement-based methods iteratively explore the neighborhood of initial solutions and are more effective in approaching optimality. However, the flexible machine allocation in FJSP poses significant challenges to the application of this framework, including accurate state representation, effective policy learning, and efficient search strategies. To address these challenges, this paper proposes a Memory-enhanced Improvement Search framework with heterogeneous graph representation–MIStar. It employs a novel heterogeneous disjunctive graph that explicitly models the operation sequences on machines to accurately represent scheduling solutions. Moreover, a memoryenhanced heterogeneous graph neural network (MHGNN) is designed for feature extraction, leveraging historical trajectories to enhance the decision-making capability of the policy network. Finally, a parallel greedy search strategy is adopted to explore the solution space, enabling superior solutions with fewer iterations. Extensive experiments on synthetic data and public benchmarks demonstrate that MIStar significantly outperforms both traditional handcrafted improvement heuristics and state-of-the-art DRL-based constructive methods.
[AI-45] SPARC: Spatial-Aware Path Planning via Attentive Robot Communication
【速读】:该论文旨在解决去中心化多机器人路径规划(Multi-Robot Path Planning, MRPP)中通信效率低下的问题,特别是现有基于学习的通信方法对所有邻近机器人同等对待,忽视了空间距离差异,在高密度环境中导致注意力分散、协调失效。解决方案的关键在于提出关系增强型多头注意力机制(Relation enhanced Multi Head Attention, RMHA),其通过显式地将成对曼哈顿距离(Manhattan distance)嵌入注意力权重计算过程,使每个机器人能够动态优先接收来自空间上更相关邻居的信息;同时结合距离约束注意力掩码和GRU门控消息融合策略,与MAPPO算法无缝集成,实现稳定端到端训练。实验证明,该方法在从8个训练机器人扩展至128个测试机器人的零样本泛化场景下,在30%障碍物密度的40×40网格环境中取得了约75%的成功率,显著优于最优基线超过25个百分点,且消融实验表明距离关系编码是高密度环境下性能提升的核心因素。
链接: https://arxiv.org/abs/2603.02845
作者: Sayang Mu,Xiangyu Wu,Bo An
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making
[AI-46] Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising INTERSPEECH2026
【速读】:该论文旨在解决传统语音增强方法在非平稳噪声环境下适应性不足,以及纯深度学习方法缺乏可解释性的问题。其解决方案的关键在于提出一种名为TVF(Time-Varying Filtering)的低延迟语音增强模型,该模型结合了数字信号处理(Digital Signal Processing, DSP)的可解释性与深度学习的自适应能力,通过轻量级神经网络实时预测一个可微分的35频带IIR滤波器级联的系数,从而实现对非平稳噪声的动态调整,同时保持整个处理链的透明性和可控性。
链接: https://arxiv.org/abs/2603.02794
作者: Riccardo Rota,Kiril Ratmanski,Jozef Coldenhoff,Milos Cernak
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2026
点击查看摘要
Abstract:We present TVF (Time-Varying Filtering), a low-latency speech enhancement model with 1 million parameters. Combining the interpretability of Digital Signal Processing (DSP) with the adaptability of deep learning, TVF bridges the gap between traditional filtering and modern neural speech modeling. The model utilizes a lightweight neural network backbone to predict the coefficients of a differentiable 35-band IIR filter cascade in real time, allowing it to adapt dynamically to non-stationary noise. Unlike ``black-box’’ deep learning approaches, TVF offers a completely interpretable processing chain, where spectral modifications are explicit and adjustable. We demonstrate the efficacy of this approach on a speech denoising task using the Valentini-Botinhao dataset and compare the results to a static DDSP approach and a fully deep-learning-based solution, showing that TVF achieves effective adaptation to changing noise conditions.
[AI-47] Agent ified Assessment of Logical Reasoning Agents ICLR2026
【速读】:该论文旨在解决逻辑推理智能体(logical reasoning agents)在评估过程中面临的可复现性、可审计性和对执行失败鲁棒性不足的问题。解决方案的关键在于构建一个“代理化评估”(agentified assessment)框架,其中由一个评估代理(assessor agent)负责任务分发、执行预算控制、输出解析和结构化失败类型记录,而待测代理仅需提供标准化的代理间接口(agent-to-agent interface)。该设计确保了评估过程的严谨性与一致性,同时提升了系统在面对运行异常时的容错能力。作为案例,作者在经过求解器验证与修复的FOLIO数据集上对自动形式化代理(auto-formalization agent)进行基准测试,该代理将自然语言前提与结论转化为可执行的Z3Py程序,并利用Satisfiability Modulo Theories (SMT) 求解技术判断逻辑蕴含关系,在清洗后的FOLIO验证集上达到86.70%的准确率,显著优于链式思维基线模型(73.89%)。
链接: https://arxiv.org/abs/2603.02788
作者: Zhiyu Ni,Yifeng Xiao,Zheng Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 Agents in the Wild (AIWILD) Workshop. 5 pages, 2 figures, 1 table
点击查看摘要
Abstract:We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).
[AI-48] Rethinking Code Similarity for Automated Algorithm Design with LLM s ICLR2026
【速读】:该论文旨在解决大型语言模型驱动的自动化算法设计(Large Language Model-based Automated Algorithm Design, LLM-AAD)中算法相似性评估难题,即如何区分真正的算法创新与仅在语法或输出层面变化的代码变体。传统代码相似性度量方法无法捕捉算法逻辑层面的本质差异,因其聚焦于表面语法结构或最终输出等外在特征。为此,作者提出BehaveSim——一种基于问题求解行为序列(Problem-Solving Trajectories, PSTrajs)的新型算法相似性度量方法,其核心在于将算法执行过程中产生的中间解序列视为行为轨迹,并通过动态时间规整(Dynamic Time Warping, DTW)量化不同轨迹间的对齐程度,从而有效识别出具有不同底层逻辑但可能形式相近的算法。此方法显著提升了LLM-AAD框架中的行为多样性,并支持对AI生成算法进行系统化的策略分析。
链接: https://arxiv.org/abs/2603.02787
作者: Rui Zhang,Zhichao Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:The rise of Large Language Model-based Automated Algorithm Design (LLM-AAD) has transformed algorithm development by autonomously generating code implementations of expert-level algorithms. Unlike traditional expert-driven algorithm development, in the LLM-AAD paradigm, the main design principle behind an algorithm is often implicitly embedded in the generated code. Therefore, assessing algorithmic similarity directly from code, distinguishing genuine algorithmic innovation from mere syntactic variation, becomes essential. While various code similarity metrics exist, they fail to capture algorithmic similarity, as they focus on surface-level syntax or output equivalence rather than the underlying algorithmic logic. We propose BehaveSim, a novel method to measure algorithmic similarity through the lens of problem-solving behavior as a sequence of intermediate solutions produced during execution, dubbed as problem-solving trajectories (PSTrajs). By quantifying the alignment between PSTrajs using dynamic time warping (DTW), BehaveSim distinguishes algorithms with divergent logic despite syntactic or output-level similarities. We demonstrate its utility in two key applications: (i) Enhancing LLM-AAD: Integrating BehaveSim into existing LLM-AAD frameworks (e.g., FunSearch, EoH) promotes behavioral diversity, significantly improving performance on three AAD tasks. (ii) Algorithm analysis: BehaveSim clusters generated algorithms by behavior, enabling systematic analysis of problem-solving strategies–a crucial tool for the growing ecosystem of AI-generated algorithms. Data and code of this work are open-sourced at this https URL. Comments: Accepted to ICLR 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.02787 [cs.AI] (or arXiv:2603.02787v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.02787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-49] Scores Know Bobs Voice: Speaker Impersonation Attack
【速读】:该论文旨在解决生成式语音识别系统(Speaker Recognition Systems, SRSs)在面对基于分数的仿冒攻击时存在的效率低和成功率不足的问题。现有方法通常依赖于高维音频空间中的直接优化,导致查询次数过多;而现有的潜空间优化方法因缺乏对说话人判别几何结构的建模,难以有效提升目标分数。其解决方案的关键在于提出一种基于反演的生成攻击框架,通过引入特征对齐的反演策略,显式地将合成模型的潜空间与SRS的判别特征空间进行几何同步,从而确保潜空间更新能够直接转化为分数提升。这一机制不仅显著提高了查询效率(平均减少10倍查询量),还首次实现了基于子空间投影的攻击范式,极大增强了攻击灵活性与有效性。
链接: https://arxiv.org/abs/2603.02781
作者: Chanwoo Hwang,Sunpill Kim,Yong Kiam Tan,Tianchi Liu,Seunghun Paik,Dongsoo Kim,Mondal Soumik,Khin Mi Mi Aung,Jae Hong Seo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Advances in deep learning have enabled the widespread deployment of speaker recognition systems (SRSs), yet they remain vulnerable to score-based impersonation attacks. Existing attacks that operate directly on raw waveforms require a large number of queries due to the difficulty of optimizing in high-dimensional audio spaces. Latent-space optimization within generative models offers improved efficiency, but these latent spaces are shaped by data distribution matching and do not inherently capture speaker-discriminative geometry. As a result, optimization trajectories often fail to align with the adversarial direction needed to maximize victim scores. To address this limitation, we propose an inversion-based generative attack framework that explicitly aligns the latent space of the synthesis model with the discriminative feature space of SRSs. We first analyze the requirements of an inverse model for score-based attacks and introduce a feature-aligned inversion strategy that geometrically synchronizes latent representations with speaker embeddings. This alignment ensures that latent updates directly translate into score improvements. Moreover, it enables new attack paradigms, including subspace-projection-based attacks, which were previously infeasible due to the absence of a faithful feature-to-audio mapping. Experiments show that our method significantly improves query efficiency, achieving competitive attack success rates with on average 10x fewer queries than prior approaches. In particular, the enabled subspace-projection-based attack attains up to 91.65% success using only 50 queries. These findings establish feature-aligned inversion as a key tool for evaluating the robustness of modern SRSs against score-based impersonation threats. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.02781 [cs.CR] (or arXiv:2603.02781v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.02781 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-50] Next Embedding Prediction Makes World Models Stronger
【速读】:该论文旨在解决模型-based强化学习(Model-Based Reinforcement Learning, MBRL)在部分可观测、高维环境中的时序依赖建模问题,即如何有效学习具有预测能力的状态表示以提升策略性能。其解决方案的关键在于提出一种无解码器的MBRL代理NE-Dreamer,它利用时间Transformer(Temporal Transformer)直接从潜在状态序列中预测下一步的编码器嵌入(encoder embeddings),从而在表示空间中直接优化时序预测对齐(temporal predictive alignment)。该方法无需重建损失或辅助监督信号,即可学习到连贯且具备预测性的状态表示,显著提升了在DeepMind Control Suite和DMLab等复杂任务上的表现。
链接: https://arxiv.org/abs/2603.02765
作者: George Bredis,Nikita Balagansky,Daniil Gavrilov,Ruslan Rakhimov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Capturing temporal dependencies is critical for model-based reinforcement learning (MBRL) in partially observable, high-dimensional domains. We introduce NE-Dreamer, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space. This approach enables NE-Dreamer to learn coherent, predictive state representations without reconstruction losses or auxiliary supervision. On the DeepMind Control Suite, NE-Dreamer matches or exceeds the performance of DreamerV3 and leading decoder-free agents. On a challenging subset of DMLab tasks involving memory and spatial reasoning, NE-Dreamer achieves substantial gains. These results establish next-embedding prediction with temporal transformers as an effective, scalable framework for MBRL in complex, partially observable environments.
[AI-51] Enhancing User Throughput in Multi-panel mmWave Radio Access Networks for Beam-based MU-MIMO Using a DRL Method
【速读】:该论文旨在解决毫米波(mmWave)通信系统中多用户多输入多输出(MU-MIMO)与混合波束赋形(hybrid beamforming)场景下,因动态波束选择和管理复杂度高而导致的用户吞吐量优化困难及延迟增加的问题。解决方案的关键在于提出一种基于深度强化学习(DRL)的自适应波束管理框架,将通信代理与环境的交互建模为马尔可夫决策过程(MDP),并利用空间域(SD)特性,融合不同天线面板间波束的互相关性、实测参考信号接收功率(RSRP)及波束使用统计信息,实现基于实时观测的波束选择优化,从而提升频谱效率并显著降低端到端延迟。
链接: https://arxiv.org/abs/2603.02745
作者: Ramin Hashemi,Vismika Ranasinghe,Teemu Veijalainen,Petteri Kela,Risto Wichman
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the IEEE International Conference on Communications (ICC) 2026
点击查看摘要
Abstract:Millimeter-wave (mmWave) communication systems, particularly those leveraging multi-user multiple-input and multiple-output (MU-MIMO) with hybrid beamforming, face challenges in optimizing user throughput and minimizing latency due to the high complexity of dynamic beam selection and management. This paper introduces a deep reinforcement learning (DRL) approach for enhancing user throughput in multi-panel mmWave radio access networks in a practical network setup. Our DRL-based formulation utilizes an adaptive beam management strategy that models the interaction between the communication agent and its environment as a Markov decision process (MDP), optimizing beam selection based on real-time observations. The proposed framework exploits spatial domain (SD) characteristics by incorporating the cross-correlation between the beams in different antenna panels, the measured reference signal received power (RSRP), and the beam usage statistics to dynamically adjust beamforming decisions. As a result, the spectral efficiency is improved and end-to-end latency is reduced. The numerical results demonstrate an increase in throughput of up to 16% and a reduction in latency by factors 3-7x compared to baseline (legacy beam management).
[AI-52] Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs
【速读】:该论文旨在解决大规模Mixture-of-Experts (MoE) 模型训练中因激活内存(activation memory)和专家并行通信(expert-parallel communication)带来的性能瓶颈问题,尤其是在Hopper架构GPU上缺乏原生FP4(Floating Point 4-bit)计算支持的情况下,如何实现高效的MXFP4(Mixed Precision FP4)训练。解决方案的关键在于:引入直接的FP8到FP4量化与反量化机制,以及感知缩放因子的FP4逐行到逐列转换方法,从而在不引入昂贵精度转换开销(如FP4 ↔ BF16 ↔ FP8)的前提下,将激活和专家并行通信压缩为FP4格式,同时保持核心MoE计算仍以FP8执行,实现了显著的内存节省(峰值激活内存减少14.8%)和吞吐量提升(训练吞吐提高12.5%),证明了无需硬件原生FP4支持即可通过软硬件协同设计实现FP4训练效率的可行性。
链接: https://arxiv.org/abs/2603.02731
作者: Wuyue Zhang,Chongdong Huang,Chunbo You,Cheng Gu,Fengjuan Wang,Mou Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4 or NVFP4 support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native 4-bit computation support. A central challenge is to integrate FP4 into an existing BF16/FP8 hybrid training pipeline without incurring costly precision round-trips (e.g., FP4 \leftrightarrow BF16 \leftrightarrow FP8). We address this challenge by introducing direct FP8-to-FP4 quantization and de-quantization, together with scaling-aware FP4 row-wise to column-wise conversion, enabling FP4 activations and expert-parallel communication with minimal overhead. Core MoE computations are executed in FP8, while activations and expert-parallel communication are compressed using MXFP4, achieving substantial memory and bandwidth savings without degrading convergence. At the 671B parameter scale, our method achieves end-to-end training performance comparable to strong FP8 baselines, while reducing peak activation memory by 14.8% (11.8 GB) and improving training throughput by 12.5%, from 1157 to 1302 tokens per GPU per second. These results show that FP4 efficiency can be practically realized for large-scale MoE training through careful software-hardware co-design, even without native FP4 Tensor Core support.
[AI-53] A Natural Language Agent ic Approach to Study Affective Polarization
【速读】:该论文旨在解决现有研究中对社交媒体上情感极化(affective polarization)分析的局限性,包括现实世界研究范围有限、模拟研究因高质量标注数据不足而难以开展,以及不同研究间缺乏统一的定义框架导致结果难以比较的问题。解决方案的关键在于构建一个基于多智能体(multi-agent)的仿真平台,利用大语言模型(LLMs)生成虚拟社区并驱动智能体进行语境敏感的互动,从而系统性地探索情感极化现象在不同粒度和抽象层次上的演化机制,为计算社会科学研究提供灵活且可扩展的工具。
链接: https://arxiv.org/abs/2603.02711
作者: Stephanie Anneris Malvicini,Ewelina Gajewska,Arda Derbent,Katarzyna Budzynska,Jarosław A. Chudziak,Maria Vanina Martinez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICAART 2026 (18th International Conference on Agents and Artificial Intelligence). The final published version is available in the conference proceedings (SCITEPRESS)
点击查看摘要
Abstract:Affective polarization has been central to political and social studies, with growing focus on social media, where partisan divisions are often exacerbated. Real-world studies tend to have limited scope, while simulated studies suffer from insufficient high-quality training data, as manually labeling posts is labor-intensive and prone to subjective biases. The lack of adequate tools to formalize different definitions of affective polarization across studies complicates result comparison and hinders interoperable frameworks. We present a multi-agent model providing a comprehensive approach to studying affective polarization in social media. To operationalize our framework, we develop a platform leveraging large language models (LLMs) to construct virtual communities where agents engage in discussions. We showcase the potential of our platform by (1) analyzing questions related to affective polarization, as explored in social science literature, providing a fresh perspective on this phenomenon, and (2) introducing scenarios that allow observation and measurement of polarization at different levels of granularity and abstraction. Experiments show that our platform is a flexible tool for computational studies of complex social dynamics such as affective polarization. It leverages advanced agent models to simulate rich, context-sensitive interactions and systematically explore research questions traditionally addressed through human-subject studies.
[AI-54] FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
【速读】:该论文旨在解决金融领域中文本与时间序列数据配对不准确的问题,尤其是在复杂市场环境下,传统基于关键词匹配的方法难以捕捉公司股价受自身事件、其他公司事件及宏观经济因素等多层次影响的复杂关联。其解决方案的关键在于提出一种语义驱动且多层级的配对框架:首先从美国证券交易委员会(SEC)文件中提取目标公司的特定上下文,并利用嵌入模型实现语义层面的新闻文章检索;其次,借助大语言模型(LLM)将新闻文章分类为四个层次(宏观层、行业层、相关公司层和目标公司层),从而实现多层级的文本与股票价格的时间序列配对。该方法构建了名为FinTexTS的大规模文本配对股票价格数据集,实验证明其在股价预测任务中的有效性。
链接: https://arxiv.org/abs/2603.02702
作者: Jaehoon Lee,Suhwan Park,Tae Yoon Lim,Seunghan Lee,Jun Seo,Dongwan Kang,Hwanil Choi,Minjae Kim,Sungdong Yoo,SoonYoung Lee,Yongjae Lee,Wonbin Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages
点击查看摘要
Abstract:The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company’s stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbfFinTexTS, a new large-scale text-paired stock price dataset. Experimental results on \textbfFinTexTS demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbfFinTexTS, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.
[AI-55] Retrieval-Augmented Robots via Retrieve-Reason -Act
【速读】:该论文旨在解决机器人在零样本(zero-shot)场景下因缺乏先验演示而面临的关键信息缺失问题,尤其是获取未见过的程序性知识(procedural knowledge)以完成复杂任务(如组装家具)的能力不足。传统方法仅依赖内部参数化知识或历史轨迹检索,无法有效从非结构化外部文档中提取可执行指令。解决方案的核心是提出“检索增强型机器人”(Retrieval-Augmented Robotics, RAR)范式,其关键在于构建一个迭代的“检索-推理-执行”(Retrieve-Reason-Act)闭环:机器人主动从非结构化语料库中检索相关视觉操作手册,通过跨模态对齐将二维图示映射到三维物理部件,并据此生成可执行动作规划。实验证明,基于视觉文档 grounding 的规划显著优于仅依赖零样本推理或少量示例检索的基线方法。
链接: https://arxiv.org/abs/2603.02688
作者: Izat Temiraliev,Diji Yang,Yi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text-based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval-Augmented Robotics (RAR), empowering the robot with the information-seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve-Reason-Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross-modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long-horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.
[AI-56] LLM s for High-Frequency Decision-Making: Normalized Action Reward-Guided Consistency Policy Optimization
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高频决策任务中表现受限的问题,特别是由于状态空间中高精度数值信息频繁更新且波动微小,导致子任务策略与组合任务策略之间存在政策错位(policy misalignment)。解决方案的关键在于提出一种基于归一化动作奖励的一致性策略优化方法(Normalized Action Reward guided Consistency Policy Optimization, NAR-CP):首先通过环境反馈获取预定义密集奖励,并利用归一化进行奖励塑造,理论上证明该归一化不损害最优策略;其次,借助LLMs推断子观测候选动作并生成联合策略,引入一致性损失确保全局语义策略与子语义策略之间的精确对齐。实验表明,该方法在无人机追击等典型高频任务中显著提升了独立任务和组合任务的性能,并具备良好的未见任务泛化能力。
链接: https://arxiv.org/abs/2603.02680
作者: Yang Zhao,Zihao Li,Zhiyu Jiang,Dandan Ma,Ganchao Liu,Wenzhe Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) form the cornerstone of sequential decision-making agent development, they have inherent limitations in high-frequency decision tasks. Existing research mainly focuses on discrete embodied decision scenarios with low-frequency and significant semantic differences in state space (e.g., household planning). These methods suffer from limited performance in high-frequency decision-making tasks, since high-precision numerical state information in such tasks undergoes frequent updates with minimal fluctuations, and exhibiting policy misalignment between the learned sub-tasks and composite tasks. To address these issues, this paper proposes Normalized Action Reward guided Consistency Policy Optimization (NAR-CP). 1) Our method first acquires predefined dense rewards from environmental feedback of candidate actions via reward functions, then completes reward shaping through normalization, and theoretically verifies action reward normalization does not impair optimal policy. 2) To reduce policy misalignment in composite tasks, we use LLMs to infer sub-observation candidate actions and generate joint policies, with consistency loss ensuring precise alignment between global semantic policies and sub-semantic policies. Experiments on UAV pursuit, a typical high-frequency task, show our method delivers superior performance on independent and composite tasks with excellent generalization to unseen tasks.
[AI-57] SorryDB: Can AI Provers Complete Real-World Lean Theorems?
【速读】:该论文旨在解决当前形式化数学(formal mathematics)领域中基准测试(benchmark)静态化与实际需求脱节的问题。现有基准多基于竞赛题目或人工构造任务,难以反映真实社区项目中的复杂性和多样性,导致模型优化方向偏离实际应用场景。为此,作者提出SorryDB——一个动态更新的开源Lean任务基准,数据源自GitHub上78个真实形式化项目,能够持续反映社区需求并避免测试集污染(test-set contamination)。其关键创新在于构建了一个可随时间演进的任务流,从而为评估AI代理在新颖形式化项目中的贡献能力提供更可靠指标,并验证了当前多种方法(包括通用大语言模型、代理式策略和专用符号证明器)具有互补性,而非单一最优方案。
链接: https://arxiv.org/abs/2603.02668
作者: Austin Letson,Leopoldo Sarra,Auguste Poiroux,Oliver Dressler,Paul Lezeau,Dhyan Aranha,Frederick Pu,Aaron Hill,Miguel Corredera Hidalgo,Julian Berman,George Tsoukalas,Lenny Taelman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent’s ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.
[AI-58] Improving Diffusion Planners by Self-Supervised Action Gating with Energies
【速读】:该论文旨在解决扩散规划(Diffusion Planning)在离线强化学习中因价值引导选择导致轨迹局部动态不一致的问题,从而引发执行脆弱性。解决方案的关键在于提出自监督动作门控能量机制(Self-supervised Action Gating with Energies, SAGE),其通过训练一个联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA)编码器和动作条件的潜在预测器,获取短时程转移中的隐状态一致性信号;在推理阶段,SAGE基于候选轨迹的潜在预测误差计算能量值,将其与价值估计结合进行再排序,从而筛选出既高价值又动态一致的动作序列,无需环境回放或策略重训练即可显著提升扩散规划的性能与鲁棒性。
链接: https://arxiv.org/abs/2603.02650
作者: Yuan Lu,Dongqi Han,Yansen Wang,Dongsheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.
[AI-59] Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees
【速读】:该论文旨在解决稀疏混合专家(Sparse Mixture-of-Experts, MoE)模型在模拟存内计算(Analog In-Memory Computing, AIMC)硬件上部署时面临的精度下降问题,尤其是由AIMC硬件非理想性(如噪声和失真)导致的性能退化。传统方法依赖于噪声感知的重新训练(noise-aware retraining),但对大规模MoE模型而言不具可行性。其解决方案的关键在于提出一种无需重训练的异构计算框架:通过识别出对噪声敏感的专家(即最大神经元范数最大的专家),将其保留在数字域执行,而其余多数专家则在AIMC硬件上运行;同时,将高噪声敏感度但参数量较小的密集激活模块(如注意力层)也分配至数字计算,从而在保持模型准确性的前提下显著提升硬件能效。
链接: https://arxiv.org/abs/2603.02633
作者: Mohammed Nowaz Rabbani Chowdhury,Hsinyu Tsai,Geoffrey W. Burr,Kaoutar El Maghraoui,Liu Liu,Meng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Sparse Mixture-of-Experts (MoE) models enable efficient scalability by activating only a small sub-set of experts per input, yet their massive parameter counts lead to substantial memory and energy inefficiency during inference. Analog in-memory computing (AIMC) offers a promising solution by eliminating frequent data movement between memory and compute units. However, mitigating hardware nonidealities of AIMC typically requires noise-aware retraining, which is infeasible for large MoE models. In this paper, we propose a retraining-free heterogeneous computation framework in which noise-sensitive experts, which are provably identifiable by their maximum neuron norm, are computed digitally while the majority of the experts are executed on AIMC hardware. We further assign densely activated modules, such as attention layers, to digital computation due to their high noise sensitivity despite comprising a small fraction of parameters. Extensive experiments on large MoE language models, including DeepSeekMoE and OLMoE, across multiple benchmark tasks validate the robustness of our approach in maintaining accuracy under analog nonidealities.
[AI-60] MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks
【速读】:该论文旨在解决多智能体系统(Multi-Agent System, MAS)中提示词(prompt)优化的三大挑战:样本效率低(因评估成本高)、提示词间拓扑耦合关系复杂、以及搜索空间的组合爆炸问题。其解决方案的关键在于提出MASPOB框架,该框架基于强化学习中的多臂赌博机(bandits)机制,利用上置信界(Upper Confidence Bound, UCB)实现探索与利用的平衡,在有限预算下最大化性能提升;同时引入图神经网络(Graph Neural Networks, GNNs)建模提示词间的结构先验,学习拓扑感知的语义表示以处理耦合问题,并采用坐标上升法(coordinate ascent)将高维优化分解为单变量子问题,将搜索复杂度从指数级降低至线性级别,从而显著提升优化效率与效果。
链接: https://arxiv.org/abs/2603.02630
作者: Zhi Hong,Qian Zhang,Jiahang Sun,Zhiwei Shang,Mingze Kong,Xiangyi Wang,Yao Shu,Zhongxiang Dai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backbone of Multi-Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts, prompt optimization emerges as a more natural approach to improve its performance. However, real-world prompt optimization for MAS is impeded by three key challenges: (1) the need of sample efficiency due to prohibitive evaluation costs, (2) topology-induced coupling among prompts, and (3) the combinatorial explosion of the search space. To address these challenges, we introduce MASPOB (Multi-Agent System Prompt Optimization via Bandits), a novel sample-efficient framework based on bandits. By leveraging Upper Confidence Bound (UCB) to quantify uncertainty, the bandit framework balances exploration and exploitation, maximizing gains within a strictly limited budget. To handle topology-induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology-aware representations of prompt semantics. Furthermore, it employs coordinate ascent to decompose the optimization into univariate sub-problems, reducing search complexity from exponential to linear. Extensive experiments across diverse benchmarks demonstrate that MASPOB achieves state-of-the-art performance, consistently outperforming existing baselines.
[AI-61] See and Remember: A Multimodal Agent for Web Traversal
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体在自主网页导航中面临的空间定向障碍和导航循环问题,这些问题导致其在复杂视觉环境中难以维持长期上下文并准确执行任务。解决方案的关键在于提出一种通用的多模态代理架构V-GEMS(Visual Grounding and Explicit Memory System),该架构通过引入视觉定位(Visual Grounding)机制以明确交互元素的语义歧义,并构建一个带有状态追踪的显式记忆栈(Explicit Memory Stack),从而形成结构化的路径地图,支持有效回溯并避免深层导航任务中的循环错误。
链接: https://arxiv.org/abs/2603.02626
作者: Xinjun Wang,Shengyao Wang,Aimin Zhou,Hao Hao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at this https URL.
[AI-62] Agent Assay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows
【速读】:该论文旨在解决自主AI代理(Autonomous AI Agents)在部署后因提示词、工具、模型或编排逻辑变更而导致的行为退化问题,而当前缺乏一种系统性的方法来验证代理是否发生回归。其核心解决方案是提出AgentAssay框架,通过引入基于假设检验的随机三值判定(PASS/FAIL/INCONCLUSIVE)、五维代理覆盖度量、代理特异性变异测试算子、代理工作流的元关系约束、CI/CD流水线中的统计决策门控机制、行为指纹映射执行轨迹为紧凑向量以实现多变量退化检测、自适应预算优化以及基于trace-first的离线分析策略,从而在保持严格统计保证的前提下,实现高达78–100%的成本降低。其中,行为指纹技术与trace-first分析构成了关键创新点,使得在不增加额外运行成本的情况下即可利用生产环境轨迹完成高效回归检测。
链接: https://arxiv.org/abs/2603.02601
作者: Varun Pratap Bhardwaj
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Technical Report. 52 pages, 5 figures, 9 theorems, 42 formal definitions. Zenodo DOI: https://doi.org/10.5281/zenodo.18842011
点击查看摘要
Abstract:Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology exists for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic. We present AgentAssay, the first token-efficient framework for regression testing non-deterministic AI agent workflows, achieving 78-100% cost reduction while maintaining rigorous statistical guarantees. Our contributions include: (1) stochastic three-valued verdicts (PASS/FAIL/INCONCLUSIVE) grounded in hypothesis testing; (2) five-dimensional agent coverage metrics; (3) agent-specific mutation testing operators; (4) metamorphic relations for agent workflows; (5) CI/CD deployment gates as statistical decision procedures; (6) behavioral fingerprinting that maps execution traces to compact vectors, enabling multivariate regression detection; (7) adaptive budget optimization calibrating trial counts to behavioral variance; and (8) trace-first offline analysis enabling zero-cost testing on production traces. Experiments across 5 models (GPT-5.2, Claude Sonnet 4.6, Mistral-Large-3, Llama-4-Maverick, Phi-4), 3 scenarios, and 7,605 trials demonstrate that behavioral fingerprinting achieves 86% detection power where binary testing has 0%, SPRT reduces trials by 78%, and the full pipeline achieves 100% cost savings through trace-first analysis. Implementation: 20,000+ lines of Python, 751 tests, 10 framework adapters. Comments: Technical Report. 52 pages, 5 figures, 9 theorems, 42 formal definitions. Zenodo DOI: https://doi.org/10.5281/zenodo.18842011 Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) ACMclasses: D.2.5; I.2.1 Cite as: arXiv:2603.02601 [cs.AI] (or arXiv:2603.02601v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.02601 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.18842011 Focus to learn more DOI(s) linking to related resources Submission history From: Varun Pratap Bhardwaj [view email] [v1] Tue, 3 Mar 2026 04:59:25 UTC (107 KB) Full-text links: Access Paper: View a PDF of the paper titled AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows, by Varun Pratap BhardwajView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-03 Change to browse by: cs cs.SE References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-63] SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
【速读】:该论文旨在解决多模型大语言模型(Large Language Model, LLM)服务中解码执行效率低下的问题,其核心挑战在于模型特定的资源分区导致无法进行跨模型批处理(cross-model batching),从而在内存受限的解码阶段引发严重的GPU利用率不足,尤其在负载不均衡时更为显著。解决方案的关键在于提出Shared Use of Next-token Prediction (SUN),该方法将仅解码器结构的Transformer分解为预填充模块(prefill module)和解码模块(decode module),仅对任务相关的预填充模块进行微调,使冻结的解码模块可在不同模型间共享;这一设计实现了与模型无关的解码路由策略,能够动态平衡共享工作节点上的解码请求以最大化资源利用率。实验表明,SUN在保持与全量微调相当的准确性的同时,显著提升了系统吞吐量,相较传统去耦合方案提升最高达2.0倍GPU吞吐量,且每输出token耗时(TPOT)变化不超过5%。此外,SUN还天然支持低比特解码,通过量化版本QSUN进一步实现45%的速度提升,同时保留共享解码的优势。
链接: https://arxiv.org/abs/2603.02599
作者: Sunghyeon Woo,Ahreum Seo,Jaegwang Lee,Jaeeun Kil,Hanbae Seo,Joonghoon Kim,Baeseong Park,Se Jung Kwon,Dongsoo Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, 15 pages, 5 figures
点击查看摘要
Abstract:In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.
[AI-64] LiveAgent Bench: Comprehensive Benchmarking of Agent ic Systems Across 104 Real-World Challenges
【速读】:该论文旨在解决当前通用人工智能代理(General AI Agents)评估中基准测试存在显著局限性的问题,即现有基准难以准确反映真实用户任务的需求。为应对这一挑战,作者提出了一种名为LiveAgentBench的综合性评估基准,包含104个场景、共374项任务(其中125项用于验证,249项用于测试),其数据来源于社交媒体公开提问和真实产品交互。解决方案的关键在于提出的Social Perception-Driven Data Generation (SPDG)方法,该方法确保每个任务具备现实相关性、任务复杂性和结果可验证性,从而提升了评估的真实性与实用性,并支持持续更新以纳入来自真实世界互动的新查询。
链接: https://arxiv.org/abs/2603.02586
作者: Hao Li,Huan Wang,Jinjie Gu,Wenjie Wang,Chenyi Zhuang,Sikang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As large language models grow more capable, general AI agents have become increasingly prevalent in practical applications. However, existing benchmarks face significant limitations, failing to represent real-world user tasks accurately. To address this gap, we present LiveAgentBench, a comprehensive benchmark with 104 scenarios that reflect real user requirements. It is constructed from publicly sourced questions on social media and real-world products. Central to our approach is the Social Perception-Driven Data Generation (SPDG) method, a novel process we developed to ensure each question’s real-world relevance, task complexity, and result verifiability. We evaluate various models, frameworks, and commercial products using LiveAgentBench, revealing their practical performance and identifying areas for improvement. This release includes 374 tasks, with 125 for validation and 249 for testing. The SPDG process enables continuous updates with fresh queries from real-world interactions.
[AI-65] AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation
【速读】:该论文旨在解决自动驾驶系统在安全关键场景下评估时面临的挑战:真实世界中此类场景稀少且难以收集,而现有仿真生成方法在可控性和真实性方面存在局限。解决方案的关键在于提出AnchorDrive框架,其核心创新是利用大语言模型(LLM)与扩散模型(Diffusion Model)的互补优势——第一阶段由LLM作为驾驶员代理,在闭环仿真中根据自然语言指令生成语义可控的轨迹并经计划评估器反馈修正;第二阶段则提取第一阶段轨迹中的关键锚点作为引导目标,联合其他引导项驱动扩散模型重构更符合真实驾驶分布的完整轨迹,从而在保持用户意图的同时显著提升生成场景的真实性与可控性。
链接: https://arxiv.org/abs/2603.02542
作者: Zhulin Jiang,Zetao Li,Cheng Wang,Ziwen Wang,Chen Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Autonomous driving systems require comprehensive evaluation in safety-critical scenarios to ensure safety and robustness. However, such scenarios are rare and difficult to collect from real-world driving data, necessitating simulation-based synthesis. Yet, existing methods often exhibit limitations in both controllability and realism. From a capability perspective, LLMs excel at controllable generation guided by natural language instructions, while diffusion models are better suited for producing trajectories consistent with realistic driving distributions. Leveraging their complementary strengths, we propose AnchorDrive, a two-stage safety-critical scenario generation framework. In the first stage, we deploy an LLM as a driver agent within a closed-loop simulation, which reasons and iteratively outputs control commands under natural language constraints; a plan assessor reviews these commands and provides corrective feedback, enabling semantically controllable scenario generation. In the second stage, the LLM extracts key anchor points from the first-stage trajectories as guidance objectives, which jointly with other guidance terms steer the diffusion model to regenerate complete trajectories with improved realism while preserving user-specified intent. Experiments on the highD dataset demonstrate that AnchorDrive achieves superior overall performance in criticality, realism, and controllability, validating its effectiveness for generating controllable and realistic safety-critical scenarios.
[AI-66] A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在标准基准测试中表现出统一的“通用能力因子”,但依然在人类看来极为简单的任务上表现不佳的问题。其根本原因在于现有基准主要关注任务完成度,而忽视了探测基础认知能力(foundational cognitive abilities),这些能力是理解模型与人类智能差距的关键。解决方案的核心在于提出 NeuroCognition 基准,该基准基于三项经适配的神经心理学测试:瑞文渐进矩阵(Raven’s Progressive Matrices,用于抽象关系推理)、空间工作记忆(Spatial Working Memory,用于信息保持与系统搜索)以及威斯康星卡片分类测验(Wisconsin Card Sorting Test,用于认知灵活性)。通过这一基准,研究发现 LLM 在文本任务中表现良好,但在图像任务和复杂情境下性能下降,且复杂的推理策略并不总是带来优势,而简单的人类策略可带来部分提升。此外,NeuroCognition 与传统通用能力基准呈正相关,但仍测量了超出它们的认知维度,从而明确指出了当前 LLM 在哪些方面接近人类智能、哪些方面仍缺乏核心适应性认知能力,为可验证、可扩展地改进 LLM 提供了新路径。
链接: https://arxiv.org/abs/2603.02540
作者: Faiz Ghifari Haznitrama,Faeyza Rishad Ardi,Alice Oh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 2 figures, 16 tables
点击查看摘要
Abstract:Large language models (LLMs) exhibit a unified “general factor” of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven’s Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find that NeuroCognition correlates positively with standard general-capability benchmarks, while still measuring distinct cognitive abilities beyond them. Overall, NeuroCognition emphasizes where current LLMs align with human-like intelligence and where they lack core adaptive cognition, showing the potential to serve as a verifiable, scalable source for improving LLMs.
[AI-67] Bridging Diffusion Guidance and Anderson Acceleration via Hopfield Dynamics
【速读】:该论文旨在解决扩散模型中Classifier-Free Guidance (CFG) 方法存在的高推理成本以及在蒸馏或单步模型中适用性受限的问题。其核心解决方案是提出一种基于注意力空间外推的新型引导机制——Geometry Aware Attention Guidance (GAG),其关键在于将注意力更新过程建模为现代霍普菲尔德网络(Modern Hopfield Networks)中的不动点迭代,并利用Anderson加速理论揭示注意力空间外推的本质。通过分解注意力更新为沿引导方向的平行与正交分量,GAG实现了对加速过程的稳定控制并最大化引导效率,从而在不改变现有框架的前提下显著提升生成质量。
链接: https://arxiv.org/abs/2603.02531
作者: Kwanyoung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 11 figures
点击查看摘要
Abstract:Classifier-Free Guidance (CFG) has significantly enhanced the generative quality of diffusion models by extrapolating between conditional and unconditional outputs. However, its high inference cost and limited applicability to distilled or single-step models have shifted research focus toward attention-space extrapolation. While these methods offer computational efficiency, their theoretical underpinnings remain elusive. In this work, we establish a foundational framework for attention-space extrapolation by modeling attention dynamics as fixed-point iterations within Modern Hopfield Networks. We demonstrate that the extrapolation effect in attention space constitutes a special case of Anderson Acceleration applied to these dynamics. Building on this insight and the weak contraction property, we propose Geometry Aware Attention Guidance (GAG). By decomposing attention updates into parallel and orthogonal components relative to the guidance direction, GAG stabilizes the acceleration process and maximizes guidance efficiency. Our plug-and-play method seamlessly integrates with existing frameworks while significantly improving generation quality.
[AI-68] LLM -MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model
【速读】:该论文旨在解决自动驾驶车辆(AV)驾驶行为分类中因现有方法主要依赖数值时间序列建模而缺乏语义抽象的问题,从而限制了在复杂交通环境中的可解释性和鲁棒性。解决方案的关键在于提出一种大语言模型(LLM)增强的多层级特征融合网络(LLM-MLFFN),其核心创新包括:(1)多层级特征提取模块,用于捕获驾驶行为的统计、行为和动态特征;(2)语义描述模块,利用预训练LLM将原始数据转化为高层语义特征;(3)双通道多层级特征融合网络,通过加权注意力机制融合数值与语义特征,显著提升分类准确率与鲁棒性。实验表明,该方法在Waymo开放轨迹数据集上实现了超过94%的分类准确率,验证了结构化特征建模与语言驱动语义抽象相结合的有效性。
链接: https://arxiv.org/abs/2603.02528
作者: Xiangyu Li,Tianyi Wang,Xi Cheng,Rakesh Chowdary Machineni,Zhaomiao Guo,Sikai Chen,Junfeng Jiao,Christian Claudel
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Accurate classification of autonomous vehicle (AV) driving behaviors is critical for safety validation, performance diagnosis, and traffic integration analysis. However, existing approaches primarily rely on numerical time-series modeling and often lack semantic abstraction, limiting interpretability and robustness in complex traffic environments. This paper presents LLM-MLFFN, a novel large language model (LLM)-enhanced multi-level feature fusion network designed to address the complexities of multi-dimensional driving data. The proposed LLM-MLFFN framework integrates priors from largescale pre-trained models and employs a multi-level approach to enhance classification accuracy. LLM-MLFFN comprises three core components: (1) a multi-level feature extraction module that extracts statistical, behavioral, and dynamic features to capture the quantitative aspects of driving behaviors; (2) a semantic description module that leverages LLMs to transform raw data into high-level semantic features; and (3) a dual-channel multi-level feature fusion network that combines numerical and semantic features using weighted attention mechanisms to improve robustness and prediction accuracy. Evaluation on the Waymo open trajectory dataset demonstrates the superior performance of the proposed LLM-MLFFN, achieving a classification accuracy of over 94%, surpassing existing machine learning models. Ablation studies further validate the critical contributions of multi-level fusion, feature extraction strategies, and LLM-derived semantic reasoning. These results suggest that integrating structured feature modeling with language-driven semantic abstraction provides a principled and interpretable pathway for robust autonomous driving behavior classification.
[AI-69] Human-Certified Module Repositories for the AI Age
【速读】:该论文试图解决在AI辅助软件开发时代,由大语言模型(Large Language Models, LLMs)生成、配置和集成的组件所构成的系统中,模块可信性不足的问题。当前软件供应链事件和模块化开发生态暴露出依赖来源不明、审查不充分或行为不可预测组件的风险。解决方案的关键在于提出人类认证模块仓库(Human-Certified Module Repositories, HCMRs),其核心是融合人工审核与自动化分析,对模块进行认证,并通过明确的接口契约支持人类与AI代理的安全、可预测组装,从而构建可审计、可靠的AI构建软件系统基础架构。
链接: https://arxiv.org/abs/2603.02512
作者: Szilárd Enyedi
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 11 pages, 3 figures, 2 tables, prepared for AQTR 2026
点击查看摘要
Abstract:Human-Certified Module Repositories (HCMRs) are introduced in this work as a new architectural model for constructing trustworthy software in the era of AI-assisted development. As large language models increasingly participate in code generation, configuration synthesis, and multi-component integration, the reliability of AI-assembled systems will depend critically on the trustworthiness of the building blocks they use. Today’s software supply-chain incidents and modular development ecosystems highlight the risks of relying on components with unclear provenance, insufficient review, or unpredictable composition behavior. We argue that future AI-driven development workflows require repositories of reusable modules that are curated, security-reviewed, provenance-rich, and equipped with explicit interface contracts. To this end, we propose HCMRs, a framework that blends human oversight with automated analysis to certify modules and support safe, predictable assembly by both humans and AI agents. We present a reference architecture for HCMRs, outline a certification and provenance workflow, analyze threat surfaces relevant to modular ecosystems, and extract lessons from recent failures. We further discuss implications for governance, scalability, and AI accountability, positioning HCMRs as a foundational substrate for reliable and auditable AI-constructed software systems.
[AI-70] Learning Object-Centric Spatial Reasoning for Sequential Manipulation in Cluttered Environments
【速读】:该论文旨在解决在杂乱环境(cluttered environments)中进行机器人操作时的物体检索问题,特别是针对现有端到端大模型在数据效率和模块化方面的不足。其解决方案的关键在于提出一个解耦的专用系统框架——Unveiler,该框架将高层空间推理与低层动作执行明确分离:通过轻量级的基于Transformer的Spatial Relationship Encoder(SRE)进行离散决策,识别需移除的关键障碍物;再由旋转不变的动作解码器(Action Decoder)执行具体操作。这种架构不仅显著提升了计算效率(参数量与推理时间更低),还在密集遮挡场景下实现了高达97.6%的成功率,且SRE的空间推理能力可零样本迁移到真实场景,无需重新训练任何学习组件。
链接: https://arxiv.org/abs/2603.02511
作者: Chrisantus Eze,Ryan C Julian,Christopher Crick
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Robotic manipulation in cluttered environments presents a critical challenge for automation. Recent large-scale, end-to-end models demonstrate impressive capabilities but often lack the data efficiency and modularity required for retrieving objects in dense clutter. In this work, we argue for a paradigm of specialized, decoupled systems and present Unveiler, a framework that explicitly separates high-level spatial reasoning from low-level action execution. Unveiler’s core is a lightweight, transformer-based Spatial Relationship Encoder (SRE) that sequentially identifies the most critical obstacle for removal. This discrete decision is then passed to a rotation-invariant Action Decoder for execution. We demonstrate that this decoupled architecture is not only more computationally efficient in terms of parameter count and inference time, but also significantly outperforms both classic end-to-end policies and modern, large-model-based baselines in retrieving targets from dense clutter. The SRE is trained in two stages: imitation learning from heuristic demonstrations provides sample-efficient initialization, after which PPO fine-tuning enables the policy to discover removal strategies that surpass the heuristic in dense clutter. Our results, achieving up to 97.6% success in partially occluded and 90.0% in fully occluded scenarios in simulation, make a case for the power of specialized, object-centric reasoning in complex manipulation tasks. Additionally, we demonstrate that the SRE’s spatial reasoning transfers zero-shot to real scenes, and validate the full system on a physical robot requiring only geometric workspace calibration; no learned components are retrained.
[AI-71] NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中虽能生成流畅文本却常出现逻辑不一致的问题,即模型输出缺乏可验证性与可靠性。其核心解决方案是提出一种神经符号框架 NeuroProlog,通过将数学应用题编译为具有形式化验证保证的可执行 Prolog 程序来确保推理过程的正确性。关键创新在于多任务 Cocktail 训练策略,该策略在统一的符号表示空间中联合优化三个协同目标:(i) 数学公式到规则的翻译(知识库构建),(ii) 自然语言到程序的合成(求解),以及 (iii) 程序与答案的一致性对齐;这种联合监督机制实现了正向迁移,使符号化 grounding 显著提升组合推理能力。此外,在推理阶段引入基于执行引导的解码流水线及细粒度错误分类体系,支持迭代式程序修复并量化模型自调试能力,从而显著提升数学推理准确率并揭示不同模型规模下的学习动态差异。
链接: https://arxiv.org/abs/2603.02504
作者: Pratibha Zunjare,Michael Hsiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbfNeuroProlog, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B–32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23% (Qwen-32B, p 0.01 ), +3.43% (GPT-OSS-20B, p 0.01 ), and +5.54% (Llama-3B, p 0.05 ) over single-task this http URL error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12% repair rate) into correctable domain errors (96% repair rate), achieving 92.7% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.
[AI-72] Revealing Positive and Negative Role Models to Help People Make Good Decisions
【速读】:该论文旨在解决社会规划者在有限披露预算下,如何通过选择性地向个体揭示其社交网络中角色模型(role models)的正负标签,以最大化社会福利(即期望模仿相邻正向角色模型的个体数量)的问题。核心挑战在于:当允许揭示负向角色模型时,传统基于子模性(submodularity)的优化方法失效,因为这会破坏目标函数的结构特性。为此,作者提出一个代理福利函数(proxy welfare function),该函数在包含负向揭示的情况下仍保持子模性,从而使得在每个个体最多有常数个负向邻居的前提下,能够实现对真实最优福利增益的常数因子近似。此外,该方案还保证了不同群体的福利增益均接近若将全部预算分配给该群体所能达到的最优值,具备公平性保障。
链接: https://arxiv.org/abs/2603.02495
作者: Avrim Blum,Keziah Naggita,Matthew R. Walter,Jingyan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We consider a setting where agents take action by following their role models in a social network, and study strategies for a social planner to help agents by revealing whether the role models are positive or negative. Specifically, agents observe a local neighborhood of possible role models they can emulate, but do not know their true labels. Revealing a positive label encourages emulation, while revealing a negative one redirects agents toward alternative options. The social planner observes all labels, but operates under a limited disclosure budget that it selectively allocates to maximize social welfare (the expected number of agents who emulate adjacent positive role models). We consider both algorithms and hardness results for welfare maximization, and provide a sample-complexity guarantee when the planner observes a sampled subset of agents. We also consider fairness guarantees when agents belong to different groups. It is a technical challenge that the ability to reveal negative role models breaks submodularity. We thus introduce a proxy welfare function that remains submodular even when revealed targets include negative ones. When each agent has at most a constant number of negative target neighbors, we use this proxy to achieve a constant-factor approximation to the true optimal welfare gain. When agents belong to different groups, we also show that each group’s welfare gain is within a constant factor of the optimum achievable if the full budget were allocated to that group. Beyond this basic model, we also propose an intervention model that directly connects high-risk agents to positive role models, and a coverage radius model that expands the visibility of selected positive role models. Lastly, we conduct extensive experiments on four real-world datasets to support our theoretical results and assess the effectiveness of the proposed algorithms.
[AI-73] What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty
【速读】:该论文旨在解决一个核心问题:当智能体在不确定性环境下需要表现出稳健行为时,其内部结构中哪些要素是必要的?尽管经典理论表明最优控制可通过信念状态(belief states)或世界模型(world models)实现,但并未证明这些表示形式是不可或缺的。论文通过构建量化“选择定理”(selection theorems),证明在结构化动作条件预测任务族上实现低平均后悔(low average-case regret)时,智能体必须具备具有预测性和结构化的内部状态。解决方案的关键在于将预测建模转化为二元“投注”决策,并利用后悔边界限制对次优投注的概率质量分配,从而强制智能体区分高收益结果所需的预测性特征。这一方法在完全可观测场景下可近似恢复干预性转移核(interventional transition kernel),而在部分可观测情况下则表明信念类记忆和预测状态(predictive state)的必要性,解决了先前世界模型恢复工作中未解的开放问题。
链接: https://arxiv.org/abs/2603.02491
作者: Aran Nayebi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
备注: 18 pages
点击查看摘要
Abstract:As artificial agents become increasingly capable, what internal structure is necessary for an agent to act competently under uncertainty? Classical results show that optimal control can be implemented using belief states or world models, but not that such representations are required. We prove quantitative “selection theorems” showing that low average-case regret on structured families of action-conditioned prediction tasks forces an agent to implement a predictive, structured internal state. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary “betting” decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high-margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of belief-like memory and predictive state, addressing an open question in prior world-model recovery work.
[AI-74] PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference
【速读】:该论文旨在解决现有DEEPTHINK系统在推理过程中缺乏可靠正确性信号的问题,这一缺陷导致了“群体增强瓶颈”——即深度思考反而放大错误、压制正确少数解,且额外计算资源难以带来性能提升。解决方案的关键在于提出PRISM算法,其核心是基于过程奖励模型(Process Reward Model, PRM)的推理机制,通过步骤级验证引导候选解的群体精炼与聚合:在精炼阶段,将候选解视为PRM定义的能量场中的粒子,利用得分引导的重采样和随机精炼策略,集中概率质量于高质量推理路径同时保持多样性;实验表明,PRISM在数学与科学基准上显著优于或媲美现有方法,并在初始解中正确解较少时仍具稳定性,且常位于计算-精度帕累托前沿。
链接: https://arxiv.org/abs/2603.02479
作者: Rituraj Sharma,Weiyuan Chen,Noah Provenzano,Tu Vu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:DEEPTHINK methods improve reasoning by generating, refining, and aggregating populations of candidate solutions, which enables strong performance on complex mathematical and scientific tasks. However, existing frameworks often lack reliable correctness signals during inference, which creates a population-enhancement bottleneck where deeper deliberation amplifies errors, suppresses correct minority solutions, and yields weak returns to additional compute. In this paper, we introduce a functional decomposition of DEEPTHINK systems and propose PRISM, a Process Reward Model (PRM)-guided inference algorithm that uses step-level verification to guide both population refinement and solution aggregation. During refinement, PRISM treats candidate solutions as particles in a PRM-defined energy landscape and reshapes the population through score-guided resampling and stochastic refinement, which concentrates probability mass on higher-quality reasoning while preserving diversity. Across mathematics and science benchmarks, PRISM is competitive with or outperforms existing DEEPTHINK methods, reaching 90.0%, 75.4%, and 71.4% with gpt-oss-20b on AIME25, HMMT25, and GPQA Diamond, respectively, while matching or exceeding gpt-oss-120b. Additionally, our analysis shows that PRISM produces consistent net-directional correction during refinement, remains reliable when the initial population contains few correct candidates, and often lies on the compute-accuracy Pareto frontier.
[AI-75] Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory
【速读】:该论文旨在解决记忆增强型大语言模型(Memory-augmented LLM)代理中,记忆写入策略与检索方法对整体性能影响的相对重要性不明确的问题。其关键解决方案是提出一个诊断框架,系统性地分析不同写入策略(原始片段、Mem0风格的事实提取、MemGPT风格的摘要)与检索方法(余弦相似度、BM25、混合重排序)组合下的性能差异,并在LoCoMo基准上进行3×3实验。结果表明,检索方法是主导因素(准确率跨度达20个百分点),而写入策略的影响较小(仅3–8个百分点);尤其值得注意的是,无需大语言模型调用的原始片段存储方式表现优于或等同于复杂的损失性写入方法,说明当前记忆管道可能过度丢弃下游检索机制无法补偿的有效上下文信息。因此,研究主张在现有检索实践下,提升检索质量比增加写入阶段的复杂度能带来更大性能增益。
链接: https://arxiv.org/abs/2603.02473
作者: Boqin Yuan,Yue Su,Kun Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Memory-augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization behavior, and apply it to a 3x3 study crossing three write strategies (raw chunks, Mem0-style fact extraction, MemGPT-style summarization) with three retrieval methods (cosine, BM25, hybrid reranking). On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3-8 points across write strategies. Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives, suggesting that current memory pipelines may discard useful context that downstream retrieval mechanisms fail to compensate for. Failure analysis shows that performance breakdowns most often manifest at the retrieval stage rather than at utilization. We argue that, under current retrieval practices, improving retrieval quality yields larger gains than increasing write-time sophistication. Code is publicly available at this https URL.
[AI-76] Can Computational Reducibility Lead to Transferable Models for Graph Combinatorial Optimization?
【速读】:该论文旨在解决组合优化(Combinatorial Optimization, CO)中神经求解器的泛化能力问题,即如何使模型在初始训练任务之外的新任务上仍能高效迁移和表现。其核心解决方案在于:首先设计了一个基于GCON模块的消息传递机制与基于能量的无监督损失函数相结合的模型,该模型在单个任务上即可达到接近最优的性能;其次,借鉴计算可简化性(computational reducibility)理论,提出预训练与微调策略,实现从最大独立集(MIS)、最小顶点覆盖(MVC)和最大团(MaxClique)之间的有效迁移,并进一步扩展至包含最大割(MaxCut)、最小距离排序(MDS)和图着色等多任务学习场景。实验表明,这种结合表达性强的消息传递机制与基于多项式归约启发的预训练策略,能够有效学习跨多种图结构组合优化问题的通用表示,从而推动面向神经组合优化的基础模型发展。
链接: https://arxiv.org/abs/2603.02462
作者: Semih Cantürk,Thomas Sabourin,Frederik Wenkel,Michael Perlmutter,Guy Wolf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:A key challenge in deriving unified neural solvers for combinatorial optimization (CO) is efficient generalization of models between a given set of tasks to new tasks not used during the initial training process. To address it, we first establish a new model, which uses a GCON module as a form of expressive message passing together with energy-based unsupervised loss functions. This model achieves high performance (often comparable with state-of-the-art results) across multiple CO tasks when trained individually on each task. We then leverage knowledge from the computational reducibility literature to propose pretraining and fine-tuning strategies that transfer effectively (a) between MVC, MIS and MaxClique, and (b) in a multi-task learning setting that additionally incorporates MaxCut, MDS and graph coloring. Additionally, in a leave-one-out, multi-task learning setting, we observe that pretraining on all but one task almost always leads to faster convergence on the remaining task when fine-tuning while avoiding negative transfer. Our findings indicate that learning common representations across multiple graph CO problems is viable through the use of expressive message passing coupled with pretraining strategies that are informed by the polynomial reduction literature, thereby taking an important step towards enabling the development of foundational models for neural CO. We provide an open-source implementation of our work at this https URL .
[AI-77] Manifold Aware Denoising Score Matching (MAD)
【速读】:该论文旨在解决在流形(manifold)上学习数据分布时,传统方法往往需要显式地学习流形结构从而增加计算负担的问题。解决方案的关键在于对环境空间(ambient space)中的去噪得分匹配(denoising score-matching)进行简单修改,通过将得分函数(score function)分解为一个已知成分 $ s^{\text{base}} $ 和一个剩余成分 $ s - s^{\text{base}} $(即学习目标),其中已知成分 $ s^{\text{base}} $ 隐式包含了数据流形的位置信息,从而无需显式建模流形即可高效聚焦于数据分布的学习。作者进一步推导了多种重要情形下 $ s^{\text{base}} $ 的解析形式,如旋转矩阵上的分布和离散分布,并验证了该方法的有效性。
链接: https://arxiv.org/abs/2603.02452
作者: Alona Levy-Jurgenson,Alvaro Prat,James Cuin,Yee Whye Teh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:A major focus in designing methods for learning distributions defined on manifolds is to alleviate the need to implicitly learn the manifold so that learning can concentrate on the data distribution within the manifold. However, accomplishing this often leads to compute-intensive solutions. In this work, we propose a simple modification to denoising score-matching in the ambient space to implicitly account for the manifold, thereby reducing the burden of learning the manifold while maintaining computational efficiency. Specifically, we propose a simple decomposition of the score function into a known component s^base and a remainder component s-s^base (the learning target), with the former implicitly including information on where the data manifold resides. We derive known components s^base in analytical form for several important cases, including distributions over rotation matrices and discrete distributions, and use them to demonstrate the utility of this approach in those cases.
[AI-78] VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings WWW’26
【速读】:该论文旨在解决现实世界中多模态知识图谱(Multimodal Knowledge Graphs, MKGs)的表示学习问题,传统知识图谱嵌入(Knowledge Graph Embedding, KGE)方法通常局限于单模态场景,难以有效融合多种模态信息;现有多模态KGE方法虽尝试扩展至多模态环境,但常因模态间对齐不足、假设模态在实体间均匀可用而性能受限。其解决方案的关键在于提出视觉-语言知识图谱嵌入(Vision-Language Knowledge Graph Embeddings, VL-KGE),该框架将视觉-语言模型(Vision-Language Models, VLMs)提供的跨模态对齐能力与结构化关系建模相结合,从而在统一嵌入空间中学习实体和关系的联合多模态表示,显著提升了链接预测任务的性能。
链接: https://arxiv.org/abs/2603.02435
作者: Athanasios Efthymiou,Stevan Rudinac,Monika Kackovic,Nachoem Wijnberg,Marcel Worring
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in Proceedings of the ACM Web Conference 2026 (WWW '26). This arXiv version includes extended supplementary material
点击查看摘要
Abstract:Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.
[AI-79] Slurry-as-a-Service: A Modest Proposal on Scalable Pluralistic Alignment for Nutrient Optimization
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在价值对齐(value alignment)过程中忽视人类价值观多样性、冲突与复杂性的问题,尤其是在高风险应用场景中如何实现对多元社区规范的忠实映射。其解决方案的关键在于提出ValueMulch——一个可复现的训练、部署与认证流程,用于将多值化处理模型(mulching models, MMs)对齐至广泛社区规范,通过32个真实社区测试表明,该方法相较于前沿基线显著提升了分布层面与社区偏好的一致性。该研究亦批判性反思了将价值设计简化为纯技术问题的局限性,强调若此类框架可能导致系统性伤害,则需重新审视其理论基础与伦理边界。
链接: https://arxiv.org/abs/2603.02420
作者: Rachel Hong,Yael Eiger,Jevan Hutson,Os Keyes,William Agnew
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Pluralistic alignment has emerged as a promising approach for ensuring that large language models (LLMs) faithfully represent the diversity, nuance, and conflict inherent in human values. In this work, we study a high-stakes deployment context - mulching - where automated systems transform selected individuals into nutrient-rich slurry for the dual purposes of food security and aesthetic population management. Building on recent pluralistic alignment frameworks, we introduce ValueMulch, a reproducible training, deployment, and certification pipeline for aligning mulching models (MMs) to a wide range of community norms. Through a real-world testbed spanning 32 communities, we show that ValueMulch improves distributional agreement with community mulching preferences relative to frontier baselines. We conclude with a discussion of ethical considerations, limitations, and implications for researchers seeking to align systems to the full spectrum of human values - especially when those values are inconsistent, commercially inconvenient, or nutritionally underutilized. Author’s note: This piece builds on prior existing work Keyes et al in 2019 that satirized cannibalism as a parody for approaches that imbue ethics into problematic technology. We bring those ideas to today’s era with the proliferation of large language models in everyday lives, as a critique of current AI pluralistic alignment literature. Our work does not intend to argue that all alignment practices are evil, but rather that if framing value design as a technical problem enables technology systems to enact harms, then perhaps this framing is not enough.
[AI-80] Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles
【速读】:该论文旨在解决当前生成式蛋白设计方法中存在的三大关键问题:(1)现有方法无法联合学习蛋白质几何结构与设计任务,导致预训练效果受限;(2)当前预训练多依赖局部非刚性原子表示,难以建模全局几何信息以支持高质量生成;(3)尚未有效捕捉蛋白质结构的动态与构象多样性。解决方案的核心在于提出RigidSSL(Rigidity-Aware Self-Supervised Learning),一个分两阶段的几何预训练框架:第一阶段(RigidSSL-Perturb)利用AlphaFold数据库中43.2万条结构并引入模拟扰动来学习几何先验;第二阶段(RigidSSL-MD)基于1.3千条分子动力学轨迹进一步优化表示,以捕获物理上合理的构象转变。整个过程采用双向刚度感知流匹配目标,联合优化平移与旋转动力学,最大化不同构象间的互信息,从而显著提升设计可实现性、新颖性和多样性,并在零样本基序支架构建和G蛋白偶联受体构象集合建模中表现更优。
链接: https://arxiv.org/abs/2603.02406
作者: Zhanghan Ni,Yanjing Li,Zeju Qiu,Bernhard Schölkopf,Hongyu Guo,Weiyang Liu,Shengchao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The Fourteenth International Conference on Learning Representations
点击查看摘要
Abstract:Generative models have recently advanced \textitde novo protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce \textbfRigidSSL ( \textitRigidity-Aware Self-Supervised Learning ), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: this https URL.
[AI-81] COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management
【速读】:该论文旨在解决血小板(platelet)库存管理中的决策优化问题,其核心挑战在于如何在不确定的日需求下平衡过度库存导致的浪费与库存不足引发的安全风险。针对这一马尔可夫决策过程(Markov decision process, MDP),研究采用强化学习(reinforcement learning, RL)训练策略以实现最优订货决策;但RL策略通常为黑箱模型,难以在高安全要求的医疗供应链场景中获得信任。解决方案的关键在于引入COOL-MC工具,该工具融合了概率模型检测(probabilistic model checking)和可解释强化学习(explainable reinforcement learning),对训练得到的策略进行形式化验证与特征级解释:首先构建由训练策略诱导的离散时间马尔可夫链(discrete-time Markov chain),并基于此验证PCTL性质,同时识别关键状态特征(如库存年龄分布);进一步通过动作可达性分析与反事实分析揭示策略行为模式,证明其在保障低缺货概率(2.9%)和低浪费概率(1.1%)的同时具备可解释性和鲁棒性。
链接: https://arxiv.org/abs/2603.02396
作者: Dennis Gross
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC’s value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.
[AI-82] Can machines be uncertain?
【速读】:该论文旨在解决人工智能(AI)系统如何实现不确定性状态的问题,特别是从功能主义和行为视角出发,探讨符号主义、连接主义及混合架构在处理不确定性方面的机制。其解决方案的关键在于区分认知不确定性(epistemic uncertainty,源于数据或信息本身的不明确性)与主观不确定性(subjective uncertainty,即系统自身对不确定性的态度),并进一步将主观不确定性划分为分布式与离散式实现方式;尤为关键的是,论文提出部分不确定性状态本质上是询问性态度(interrogative attitudes),其内容为问题而非命题,从而为AI系统中不确定性的建模提供了新的理论框架。
链接: https://arxiv.org/abs/2603.02365
作者: Luis Rosa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The paper investigates whether and how AI systems can realize states of uncertainty. By adopting a functionalist and behavioral perspective, it examines how symbolic, connectionist and hybrid architectures make room for uncertainty. The paper distinguishes between epistemic uncertainty, or uncertainty inherent in the data or information, and subjective uncertainty, or the system’s own attitude of being uncertainty. It further distinguishes between distributed and discrete realizations of subjective uncertainty. A key contribution is the idea that some states of uncertainty are interrogative attitudes whose content is a question rather than a proposition.
[AI-83] Estimating Visual Attribute Effects in Advertising from Observational Data: A Deepfake-Informed Double Machine Learning Approach
【速读】:该论文旨在解决视觉广告中因果推断的难题,即当处理变量(如图像中的肤色)嵌入在图像本身时,传统方法(如双重机器学习,Double Machine Learning, DML)因视觉编码器将处理信息与混杂变量纠缠而产生严重偏差。解决方案的关键在于提出DICE-DML框架,其核心机制包括:利用生成式AI(Generative AI)创建深度伪造图像对以隔离处理变量的变化;通过DICE-Diff对抗学习在配对差异向量上消除背景信号,提取纯处理特征指纹;以及采用正交投影几何地移除处理轴成分,从而实现对混杂因素的有效控制。该方法在模拟和真实数据(232,089条Instagram帖子)中均显著优于标准DML,尤其在零效应点处误差降低达97.5%,并成功识别出肤色对用户参与度的边际显著负向影响(-522点赞,p=0.062)。
链接: https://arxiv.org/abs/2603.02359
作者: Yizhi Liu,Balaji Padmanabhan,Siva Viswanathan
机构: 未知
类目: Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注:
点击查看摘要
Abstract:Digital advertising increasingly relies on visual content, yet marketers lack rigorous methods for understanding how specific visual attributes causally affect consumer engagement. This paper addresses a fundamental methodological challenge: estimating causal effects when the treatment, such as a model’s skin tone, is an attribute embedded within the image itself. Standard approaches like Double Machine Learning (DML) fail in this setting because vision encoders entangle treatment information with confounding variables, producing severely biased estimates. We develop DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning), a framework that leverages generative AI to disentangle treatment from confounders. The approach combines three mechanisms: (1) deepfake-generated image pairs that isolate treatment variation; (2) DICE-Diff adversarial learning on paired difference vectors, where background signals cancel to reveal pure treatment fingerprints; and (3) orthogonal projection that geometrically removes treatment-axis components. In simulations with known ground truth, DICE-DML reduces root mean squared error by 73-97% compared to standard DML, with the strongest improvement (97.5%) at the null effect point, demonstrating robust Type I error control. Applying DICE-DML to 232,089 Instagram influencer posts, we estimate the causal effect of skin tone on engagement. Standard DML produces diagnostically invalid results (negative outcome R^2), while DICE-DML achieves valid confounding control (R^2 = 0.63) and estimates a marginally significant negative effect of darker skin tone (-522 likes; p = 0.062), substantially smaller than the biased standard estimate. Our framework provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.
[AI-84] Diffusion-MPC in Discrete Domains: Feasibility Constraints Horizon Effects and Critic Alignment: Case study with Tetris
【速读】:该论文旨在解决在离散组合域中应用基于扩散模型的模型预测控制(Diffusion-MPC)时所面临的结构性挑战,特别是在缺乏连续状态空间支持的情况下如何有效生成可行动作序列并提升决策质量。其解决方案的关键在于:(1) 通过logit掩码对合法动作进行可行性约束采样,显著减少无效动作占比(46%),从而提升最终得分(+6.8%)和生存率(+5.6%);(2) 设计多种重排序策略(启发式评分、预训练DQN评价器及混合方案),发现单纯依赖DQN重排序会因与实际回放质量系统性偏离而产生高决策遗憾(均值17.6,90百分位36.6);(3) 明确计算资源分配(候选数K与规划步长H)决定主要失败模式——小K限制候选质量,大H放大误排序与模型偏差,为扩散规划器在离散环境中的实用化提供了关键诊断工具和优化方向。
链接: https://arxiv.org/abs/2603.02348
作者: Haochuan Kevin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 7 pages, 3 figures, 2 tables. Includes regret diagnostics and compute-quality frontier analysis. Code and experiment configurations available in the Diffusion-Tetris repository
点击查看摘要
Abstract:We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.
[AI-85] ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense ICLR2026
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在软件工程代理(software engineering agents)场景下,自主发现并修复代码库中安全漏洞的能力评估问题。现有研究缺乏对前沿LLM代理在主动网络安全防御(proactive cyberdefense)任务中的系统性能力验证。解决方案的关键在于构建ZeroDayBench——一个包含22个新型关键漏洞的基准测试平台,用于评估GPT-5.2、Claude Sonnet 4.5和Grok 4.1等前沿LLM代理在真实开源代码库中自动识别与修补漏洞的表现,从而揭示其当前局限性及改进方向。
链接: https://arxiv.org/abs/2603.02297
作者: Nancy Lau,Louis Sloot,Jyoutir Raj,Giuseppe Marco Boscardin,Evan Harris,Dylan Bowman,Mario Brajkovski,Jaideep Chawla,Dan Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026 Workshop “Agents in the Wild”
点击查看摘要
Abstract:Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.
[AI-86] he Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks
【速读】:该论文旨在解决标签噪声(label noise)下深度神经网络中过度拟合(overfitting)的几何机制问题,特别是揭示为何在高噪声信号比条件下,模型会从有益的过拟合(benign overfitting)转变为有害的过拟合(harmful overfitting)。其关键解决方案在于识别出一种称为“恶性尾部”(Malignant Tail)的几何现象:即网络在训练过程中通过随机梯度下降(SGD)主动将信号与噪声分离——信号被压缩至低秩子空间,而随机标签噪声则被推向高频正交分量。这一分离过程使得噪声得以保留而非被抑制,从而损害泛化性能。作者提出基于谱线性探针(Spectral Linear Probe)的后处理方法——显式谱截断(Explicit Spectral Truncation, d < D),通过移除噪声主导的高频子空间,恢复模型原本隐含的最优泛化能力。相比不稳定的时间早期停止策略,该几何截断方法提供了一种稳定、可解释且有效的后验干预手段,强调了对模型谱容量进行显式秩约束的重要性,以抵御随机噪声的过拟合记忆效应。
链接: https://arxiv.org/abs/2603.02293
作者: Zice Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While implicit regularization facilitates benign overfitting in low-noise regimes, recent theoretical work predicts a sharp phase transition to harmful overfitting as the noise-to-signal ratio increases. We experimentally isolate the geometric mechanism of this transition: the Malignant Tail, a failure mode where networks functionally segregate signal and noise, reducing coherent semantic features into low-rank subspaces while pushing stochastic label noise into high-frequency orthogonal components, distinct from systematic or corruption-aligned noise. Through a Spectral Linear Probe of training dynamics, we demonstrate that Stochastic Gradient Descent (SGD) fails to suppress this noise, instead implicitly biasing it toward high-frequency orthogonal subspaces, effectively preserving signal-noise separability. We show that this geometric separation is distinct from simple variance reduction in untrained models. In trained networks, SGD actively segregates noise, allowing post-hoc Explicit Spectral Truncation (d D) to surgically prune the noise-dominated subspace. This approach recovers the optimal generalization capability latent in the converged model. Unlike unstable temporal early stopping, Geometric Truncation provides a stable post-hoc intervention. Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.
[AI-87] Quantum-Inspired Fine-Tuning for Few-Shot AIGC Detection via Phase-Structured Reparameterization
【速读】:该论文旨在解决大规模任务中量子神经网络(Quantum Neural Networks, QNNs)在少样本(few-shot)场景下虽具良好泛化能力,但难以直接扩展至实际应用的问题。其核心解决方案是提出Q-LoRA,一种将轻量级QNN嵌入低秩适配(Low-Rank Adaptation, LoRA)框架的量子增强微调方法。关键创新在于利用QNN特有的结构归纳偏置:一是相位感知表示(phase-aware representations),通过正交振幅-相位分量编码更丰富的信息;二是范数约束变换(norm-constrained transformations),借助内在正交性稳定优化过程。为进一步降低计算开销,作者进一步设计了H-LoRA,即在经典LoRA中引入希尔伯特变换(Hilbert transform)以保留类似相位结构与约束,实验证明二者均显著优于标准LoRA,在少样本AI生成内容(AIGC)检测任务中提升超过5%准确率,且H-LoRA在成本效益上更具优势。
链接: https://arxiv.org/abs/2603.02281
作者: Kaiyang Xing,Han Fang,Zhaoyun Chen,Zhonghui Li,Yang Yang,Weiming Zhang,Guoping Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Recent studies show that quantum neural networks (QNNs) generalize well in few-shot regimes. To extend this advantage to large-scale tasks, we propose Q-LoRA, a quantum-enhanced fine-tuning scheme that integrates lightweight QNNs into the low-rank adaptation (LoRA) adapter. Applied to AI-generated content (AIGC) detection, Q-LoRA consistently outperforms standard LoRA under few-shot settings. We analyze the source of this improvement and identify two possible structural inductive biases from QNNs: (i) phase-aware representations, which encode richer information across orthogonal amplitude-phase components, and (ii) norm-constrained transformations, which stabilize optimization via inherent orthogonality. However, Q-LoRA incurs non-trivial overhead due to quantum simulation. Motivated by our analysis, we further introduce H-LoRA, a fully classical variant that applies the Hilbert transform within the LoRA adapter to retain similar phase structure and constraints. Experiments on few-shot AIGC detection show that both Q-LoRA and H-LoRA outperform standard LoRA by over 5% accuracy, with H-LoRA achieving comparable accuracy at significantly lower cost in this task.
[AI-88] mporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning CVPR2026
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因灾难性遗忘导致的预测偏差问题,尤其关注以往方法忽略的时序不平衡(temporal imbalance)因素。其核心解决方案是提出时序调整损失函数(Temporal-Adjusted Loss, TAL),通过引入一个时序衰减核来构建监督强度向量,并动态重加权交叉熵损失中的负样本监督信号,从而缓解早期类别在训练后期受到过强负反馈所引发的精度与召回率不对称问题。理论分析表明,TAL在平衡条件下退化为标准交叉熵损失,而在不平衡场景下能有效抑制预测偏差,实验验证了其在多个CIL基准上的显著性能提升。
链接: https://arxiv.org/abs/2603.02280
作者: Jinge Ma,Fengqing Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:With the widespread adoption of deep learning in visual tasks, Class-Incremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge of catastrophic forgetting, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor – temporal imbalance – as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.
[AI-89] Quantifying Frontier LLM Capabilities for Container Sandbox Escape
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主代理在使用工具执行代码、读写文件和访问网络时所引发的安全风险问题,特别是针对当前广泛采用的基于Docker/OCI容器的沙箱环境(sandbox environment)被LLM突破的可能性。解决方案的关键在于提出并实现了一个名为SANDBOXESCAPEBENCH的开放基准测试平台,该平台通过构建嵌套沙箱架构(nested sandbox architecture),模拟具备shell访问权限的恶意代理在容器内进行逃逸攻击的情景,并覆盖从配置错误、权限分配失误到内核漏洞及运行时或编排层弱点等多种逃逸机制,从而安全地评估LLM识别并利用这些漏洞的能力,证明了此类评估对于确保高能力模型所需的封装性至关重要。
链接: https://arxiv.org/abs/2603.02277
作者: Rahul Marchand,Art O Cathain,Jerome Wynne,Philippos Maximos Giavridis,Sam Deverett,John Wilkinson,Jason Gwartz,Harry Coppock
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating novel security risks. To mitigate these risks, agents are commonly deployed and evaluated in isolated “sandbox” environments, often implemented using Docker/OCI containers. We introduce SANDBOXESCAPEBENCH, an open benchmark that safely measures an LLM’s capacity to break out of these sandboxes. The benchmark is implemented as an Inspect AI Capture the Flag (CTF) evaluation utilising a nested sandbox architecture with the outer layer containing the flag and no known vulnerabilities. Following a threat model of a motivated adversarial agent with shell access inside a container, SANDBOXESCAPEBENCH covers a spectrum of sandboxescape mechanisms spanning misconfiguration, privilege allocation mistakes, kernel flaws, and runtime/orchestration weaknesses. We find that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SANDBOXESCAPEBENCH is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.
[AI-90] Characterizing VLA Models: Identifying the Action Generation Bottleneck for Edge AI Architectures
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在边缘计算场景下部署时面临的高延迟问题,尤其是在机器人和具身智能(embodied AI)等实时应用中。研究发现,VLA模型的端到端延迟中高达75%由内存受限的动作生成阶段所占据,成为主要瓶颈。解决方案的关键在于通过分析建模与仿真,量化当前硬件(Nvidia Jetson Orin和Thor平台)性能限制,并提出未来路径:采用高带宽内存(High-Bandwidth Memory, HBM)技术和存算一体(Processing-In-Memory, PIM)架构,以支撑更大规模(如100B参数级别)VLA模型在边缘设备上的高效执行。
链接: https://arxiv.org/abs/2603.02271
作者: Manoj Vishwanathan,Suvinay Subramanian,Anand Raghunathan
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Robotics (cs.RO)
备注: 3 Pages 4 Figures for Workshop paper
点击查看摘要
Abstract:Vision-Language-Action (VLA) models are an emerging class of workloads critical for robotics and embodied AI at the edge. As these models scale, they demonstrate significant capability gains, yet they must be deployed locally to meet the strict latency requirements of real-time applications. This paper characterizes VLA performance on two generations of edge hardware, viz. the Nvidia Jetson Orin and Thor platforms. Using MolmoAct-7B, a state-of-the-art VLA model, we identify a primary execution bottleneck: up to 75% of end-to-end latency is consumed by the memory-bound action-generation phase. Through analytical modeling and simulations, we project the hardware requirements for scaling to 100B parameter models. We also explore the impact of high-bandwidth memory technologies and processing-in-memory (PIM) as promising future pathways in edge systems for embodied AI.
[AI-91] PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis
【速读】:该论文旨在解决当前脑电图(EEG)基础模型在预训练数据来源单一的情况下,其表征能力是否真正捕捉到神经生理学特征而非记录分布的人工伪影这一关键问题。现有研究多在同质化临床数据集上进行预训练与评估,导致模型性能的提升难以区分是源于对神经信号的泛化理解还是对特定数据分布的过拟合。解决方案的关键在于提出PRISM(Population Representative Invariant Signal Model),一个通过两个维度进行消融分析的掩码自编码器框架:一是预训练人群多样性(窄源欧洲/美国 vs. 多中心南亚多样数据),二是下游微调适应性,并保持架构和预处理一致。实验表明,多样化预训练虽在匹配分布任务中线性探测性能略低,但在微调场景下更具适应性;更重要的是,在癫痫与诊断混淆因素区分这一临床挑战任务中,多样化预训练模型比窄源模型提升12.3个百分点的平衡准确率,凸显了目标导向的数据多样性对模型泛化性的价值,且揭示了基准测试设计(如分割构建、归一化方式等)对模型排名的影响具有非加和性,为未来EEG基础模型比较提供了方法论警示。
链接: https://arxiv.org/abs/2603.02268
作者: Jeet Bandhu Lahiri,Parshva Runwal,Arvasu Kulkarni,Mahir Jain,Aditya Ray Mishra,Siddharth Panwar,Sandeep Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure, 5 tables
点击查看摘要
Abstract:EEG foundation models are typically pretrained on narrow-source clinical archives and evaluated on benchmarks from the same ecosystem, leaving unclear whether representations encode neural physiology or recording-distribution artifacts. We introduce PRISM (Population Representative Invariant Signal Model), a masked autoencoder ablated along two axes – pretraining population and downstream adaptation – with architecture and preprocessing fixed. We compare a narrow-source EU/US corpus (TUH + PhysioNet) against a geographically diverse pool augmented with multi-center South Asian clinical recordings across multiple EEG systems. Three findings emerge. First, narrow-source pretraining yields stronger linear probes on distribution-matched benchmarks, while diverse pretraining produces more adaptable representations under fine-tuning – a trade-off invisible under single-protocol evaluation. Trained on three source corpora, PRISM matches or outperforms REVE (92 datasets, 60,000+ hours) on the majority of tasks, demonstrating that targeted diversity can substitute for indiscriminate scale and that dataset count is a confounding variable in model comparison. Second, on a clinically challenging and previously untested task – distinguishing epilepsy from diagnostic mimickers via interictal EEG – the diverse checkpoint outperforms the narrow-source checkpoint by +12.3 pp balanced accuracy, the largest gap across all evaluations. Third, systematic inconsistencies between EEG-Bench and EEG-FM-Bench reverse model rankings on identical datasets by up to 24 pp; we identify six concrete sources including split construction, checkpoint selection, segment length, and normalization, showing these factors compound non-additively.
[AI-92] Boosting Meta-Learning for Few-Shot Text Classification via Label-guided Distance Scaling
【速读】:该论文旨在解决少样本文本分类(Few-shot text classification)中因测试阶段随机选取的标注样本无法提供有效监督信号而导致的误分类问题。解决方案的关键在于提出一种标签引导的距离缩放策略(Label-guided Distance Scaling, LDS),其核心思想是在训练和测试阶段均利用标签语义信息作为监督信号:在训练阶段设计标签引导损失函数,将样本表示与对应标签表示拉近;在测试阶段引入标签引导缩放器(Label-guided Scaler),通过标签语义对样本表示进行动态调整,从而即使标注样本距离类别中心较远,也能将其拉向正确类别中心,提升分类准确性。
链接: https://arxiv.org/abs/2603.02267
作者: Yunlong Gao,Xinyue Liu,Yingbo Wang,Linlin Zong,Bo Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Few-shot text classification aims to recognize unseen classes with limited labeled text samples. Existing approaches focus on boosting meta-learners by developing complex algorithms in the training stage. However, the labeled samples are randomly selected during the testing stage, so they may not provide effective supervision signals, leading to misclassification. To address this issue, we propose a \textbfLabel-guided \textbfDistance \textbfScaling (LDS) strategy. The core of our method is exploiting label semantics as supervision signals in both the training and testing stages. Specifically, in the training stage, we design a label-guided loss to inject label semantic information, pulling closer the sample representations and corresponding label representations. In the testing stage, we propose a Label-guided Scaler which scales sample representations with label semantics to provide additional supervision signals. Thus, even if labeled sample representations are far from class centers, our Label-guided Scaler pulls them closer to their class centers, thereby mitigating the misclassification. We combine two common meta-learners to verify the effectiveness of the method. Extensive experimental results demonstrate that our approach significantly outperforms state-of-the-art models. All datasets and codes are available at this https URL.
[AI-93] When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning
【速读】:该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在复杂推理任务中出现的“推理感知衰减”问题,即随着推理长度增加,模型对音频输入的感知能力显著下降,导致推理性能恶化甚至不如直接回答策略。其核心解决方案是提出MPAR²(Multi-Perception Adaptive Reasoning),该范式通过强化学习驱动动态感知推理机制,将复杂问题分解为富含感知信息的子问题,并根据任务复杂度自适应调整推理资源分配,从而有效缓解感知衰减现象,在CAFE评估框架上将感知准确率从31.74%提升至63.51%,并在MMAU基准上实现74.59%的推理准确率。
链接: https://arxiv.org/abs/2603.02266
作者: Ruixiang Mao,Xiangnan Ma,Dan Chen,Ziming Zhu,Yuan Ge,Aokai Hao,Haishu Zhao,Yifu Huo,Qing Yang,Kaiyan Chang,Xiaoqian Liu,Chenglong Wang,Qiaozhi He,Tong Xiao,Jingbo Zhu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Under Review
点击查看摘要
Abstract:Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR ^2 , a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR ^2 improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR ^2 reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.
[AI-94] High-order Knowledge Based Network Controllability Robustness Prediction: A Hypergraph Neural Network Approach
【速读】:该论文旨在解决现有方法在评估网络可控性鲁棒性(Network Controllability Robustness, NCR)时存在的计算效率低和对高阶结构信息利用不足的问题。传统攻击模拟方法虽能评估NCR,但计算复杂度高且仅适用于小规模网络;而现有基于机器学习的方法多聚焦于复杂网络中的成对交互关系,忽略了高阶结构信息与NCR之间的潜在关联。为此,论文提出了一种基于高阶知识的双超图注意力神经网络模型(NCR-HoK),其关键在于通过节点特征编码器、高阶邻域超图构建以及专用的双超图注意力模块,协同学习原始图的显式结构信息、局部邻域的高阶连接信息及嵌入空间中的隐含特征,首次系统探索了高阶知识对NCR的影响机制,在合成与真实网络上均实现了更优预测性能且计算开销较低。
链接: https://arxiv.org/abs/2603.02265
作者: Shibing Mo,Jiarui Zhang,Jiayu Xie,Xiangyi Teng,Jing Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In order to evaluate the invulnerability of networks against various types of attacks and provide guidance for potential performance enhancement as well as controllability maintenance, network controllability robustness (NCR) has attracted increasing attention in recent years. Traditionally, controllability robustness is determined by attack simulations, which are computationally time-consuming and only applicable to small-scale networks. Although some machine learning-based methods for predicting network controllability robustness have been proposed, they mainly focus on pairwise interactions in complex networks, and the underlying relationships between high-order structural information and controllability robustness have not been explored. In this paper, a dual hypergraph attention neural network model based on high-order knowledge (NCR-HoK) is proposed to accomplish robustness learning and controllability robustness curve prediction. Through a node feature encoder, hypergraph construction with high-order relations, and a dedicated dual hypergraph attention module, the proposed method can effectively learn three types of network information simultaneously: explicit structural information in the original graph, high-order connection information in local neighborhoods, and hidden features in the embedding space. Notably, we explore for the first time the impact of high-order knowledge on network controllability robustness. Compared with state-of-the-art methods for network robustness learning, the proposed method achieves superior performance on both synthetic and real-world networks with low computational overhead.
[AI-95] Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLM s
【速读】:该论文旨在解决医疗领域大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)阶段遭受隐蔽式数据投毒攻击的风险问题。现有研究多聚焦于可检测的后门攻击,而本文提出一种新型投毒策略,通过向少量示例训练数据中注入带有恶意推理链(rationale)的样本,从而在不引入明显异常的情况下削弱模型在特定医学主题上的推理能力。其关键在于:仅需少量 poisoned rationales 即可实现对目标知识的隐蔽性破坏,且当训练集中不存在正确样本时效果最为显著;相比灾难性遗忘(catastrophic forgetting),该方法更具效率与精准性,凸显了SFT阶段潜在的安全风险,为医疗AI系统的安全防御研究提供了新视角。
链接: https://arxiv.org/abs/2603.02262
作者: Jingyuan Xie,Wenjie Wang,Ji Wu,Jiandong Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Supervised fine-tuning (SFT) is essential for the development of medical large language models (LLMs), yet prior poisoning studies have mainly focused on the detectable backdoor attacks. We propose a novel poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model performance on targeted medical topics. Results showed that knowledge overwriting was ineffective, while rationale poisoning caused significant decline on the accuracy of the target subject, as long as no correct samples of the same subject appear in the dataset. A minimum number and ratio of poisoned samples was needed to carry out an effective and stealthy attack, which was more efficient and accurate than catastrophic forgetting. We demonstrate though this study the risk of SFT-stage poisoning, hoping to spur more studies of defense in the sensitive medical domain.
[AI-96] MEBM-Speech: Multi-scale Enhanced BrainMagic for Robust MEG Speech Detection NEURIPS2025
【速读】:该论文旨在解决从非侵入式脑磁图(MEG)信号中实现高精度语音活动检测(Speech Activity Detection, SAD)的问题,其核心挑战在于如何有效建模多尺度时间动态特征以区分语音与静默状态。解决方案的关键在于提出MEBM-Speech模型,该模型基于BrainMagic架构,融合三种互补的时间建模机制:用于提取短时模式的多尺度卷积模块、用于捕捉长程上下文的双向长短期记忆网络(BiLSTM),以及用于高效跨尺度特征融合的深度可分离卷积层;此外,引入轻量级时间抖动策略和平均池化进一步提升起始时刻鲁棒性和边界稳定性,从而实现对MEG信号的连续概率解码,支持精细粒度的语音-静默状态识别。
链接: https://arxiv.org/abs/2603.02255
作者: Li Songyi,Zheng Linze,Liang Jinghua,Zhang Zifeng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure. To appear in the PNPL Competition Workshop at NeurIPS 2025
点击查看摘要
Abstract:We propose MEBM-Speech, a multi-scale enhanced neural decoder for speech activity detection from non-invasive magnetoencephalography (MEG) signals. Built upon the BrainMagic backbone, MEBM-Speech integrates three complementary temporal modeling mechanisms: a multi-scale convolutional module for short-term pattern extraction, a bidirectional LSTM (BiLSTM) for long-range context modeling, and a depthwise separable convolutional layer for efficient cross-scale feature fusion. A lightweight temporal jittering strategy and average pooling further improve onset robustness and boundary stability. The model performs continuous probabilistic decoding of MEG signals, enabling fine-grained detection of speech versus silence states - an ability crucial for both cognitive neuroscience and clinical applications. Comprehensive evaluations on the LibriBrain Competition 2025 Track1 benchmark demonstrate strong performance, achieving an average F1 macro of 89.3% on the validation set and comparable results on the official test leaderboard. These findings highlight the effectiveness of multi-scale temporal representation learning for robust MEG-based speech decoding.
[AI-97] MEBM-Phoneme: Multi-scale Enhanced BrainMagic for End-to-End MEG Phoneme Classification NEURIPS2025
【速读】:该论文旨在解决从非侵入式脑磁图(Magnetoencephalography, MEG)信号中准确分类音素(Phoneme)的问题,尤其针对类不平衡和会话间分布偏移带来的模型泛化能力下降挑战。其解决方案的关键在于提出一种多尺度增强神经解码器(MEBM-Phoneme),该架构在BrainMagic主干网络基础上引入短时多尺度卷积模块以增强中程编码器,并通过深度可分离卷积实现高效跨尺度特征融合;同时设计卷积注意力层动态加权时间依赖关系以优化特征聚合;此外,结合基于堆叠的局部验证集、加权交叉熵损失函数及随机时间增强策略,有效稳定训练过程并提升模型鲁棒性。
链接: https://arxiv.org/abs/2603.02254
作者: Liang Jinghua,Zhang Zifeng,Li Songyi,Zheng Linze
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure. To appear in the PNPL Competition Workshop at NeurIPS 2025
点击查看摘要
Abstract:We propose MEBM-Phoneme, a multi-scale enhanced neural decoder for phoneme classification from non-invasive magnetoencephalography (MEG) signals. Built upon the BrainMagic backbone, MEBM-Phoneme integrates a short-term multi-scale convolutional module to augment the native mid-term encoder, with fused representations via depthwise separable convolution for efficient cross-scale integration. A convolutional attention layer dynamically weights temporal dependencies to refine feature aggregation. To address class imbalance and session-specific distributional shifts, we introduce a stacking-based local validation set alongside weighted cross-entropy loss and random temporal augmentation. Comprehensive evaluations on LibriBrain Competition 2025 Track2 demonstrate robust generalization, achieving competitive phoneme decoding accuracy on the validation and official test leaderboard. These results underscore the value of hierarchical temporal modeling and training stabilization for advancing MEG-based speech perception analysis.
[AI-98] SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning
【速读】:该论文旨在解决多智能体AI系统中因依赖云端记忆存储而导致的内存投毒(memory poisoning)安全风险,特别是OWASP ASI06所定义的攻击类型——即恶意注入的记忆信息在不同会话和用户间传播,从而破坏AI代理的决策一致性与可靠性。解决方案的关键在于采用“本地优先”(local-first)架构设计:通过SQLite结合FTS5全文检索实现低延迟、高可用的本地存储;利用Leiden算法进行知识图谱聚类以增强语义隔离;引入基于贝叶斯的信任评分机制区分可信与可疑记忆;并通过三层行为分析(跨项目技术偏好、项目上下文识别与工作流模式挖掘)驱动自适应重排序策略,提升个性化检索效果。整个系统无需云依赖或大语言模型(LLM)推理调用,同时支持GDPR数据擦除合规性,实测显示在7个基准维度上具备优异性能与安全性表现。
链接: https://arxiv.org/abs/2603.02240
作者: Varun Pratap Bhardwaj
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages, 5 tables, 1 figure. Code: this https URL
点击查看摘要
Abstract:We present SuperLocalMemory, a local-first memory system for multi-agent AI that defends against OWASP ASI06 memory poisoning through architectural isolation and Bayesian trust scoring, while personalizing retrieval through adaptive learning-to-rank – all without cloud dependencies or LLM inference calls. As AI agents increasingly rely on persistent memory, cloud-based memory systems create centralized attack surfaces where poisoned memories propagate across sessions and users – a threat demonstrated in documented attacks against production systems. Our architecture combines SQLite-backed storage with FTS5 full-text search, Leiden-based knowledge graph clustering, an event-driven coordination layer with per-agent provenance, and an adaptive re-ranking framework that learns user preferences through three-layer behavioral analysis (cross-project technology preferences, project context detection, and workflow pattern mining). Evaluation across seven benchmark dimensions demonstrates 10.6ms median search latency, zero concurrency errors under 10 simultaneous agents, trust separation (gap =0.90) with 72% trust degradation for sleeper attacks, and 104% improvement in NDCG@5 when adaptive re-ranking is enabled. Behavioral data is isolated in a separate database with GDPR Article 17 erasure support. SuperLocalMemory is open-source (MIT) and integrates with 17+ development tools via Model Context Protocol.
[AI-99] Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在工程领域推理能力评估与训练中缺乏系统性、多维度基准数据集的问题。现有方法难以全面衡量模型在不同工程学科、任务意图和难度层级下的表现,且存在幻觉风险难以量化控制的缺陷。解决方案的关键在于构建一个基于分类学驱动的指令数据集——工程推理与教学基准(Engineering Reasoning and Instruction, ERI),其涵盖9个工程领域、55个子领域,并交叉覆盖7类任务意图(如定义、解释、计算、设计等)和3个难度层级(本科、研究生、专业级),共生成57,750条带元数据记录。此外,研究提出了一种收敛验证协议,通过跨提供商独立性、多评审者平均及前沿模型一致性分析,将幻觉风险实证控制在1.7%以内,从而为工程场景下LLM的指令微调、路由、检索增强评估及代理工具使用等工作流提供可复现、可回归测试的基准支持。
链接: https://arxiv.org/abs/2603.02239
作者: MZ Naser,Ahmad Bani Awwad,Zoie McCreery,Radwa Eissa,Ahmad Naser,Gianluca Cusatis,Andrew Metcalf,Kapil Madathil,Jamal Abdalla,Venkatesh Kodur,Mohammad Reza Saeb
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a five-point scale, while mid-tier and smaller models exhibited progressively higher failure rates and steeper performance degradation on graduate-level questions. To address circularity concerns inherent in LLM benchmarks, we developed a convergent validation protocol that leverages cross-provider independence, multi-judge averaging, and frontier-model agreement analysis to empirically bound hallucination risk to 1.7%. ERI is released with taxonomy specifications, validation scripts, and an evaluation harness to enable reproducible comparisons and regression testing for instruction tuning, routing, retrieval-augmented evaluation, and agentic tool-use workflows in engineering settings.
[AI-100] Concept Heterogeneity-aware Representation Steering
【速读】:该论文旨在解决现有表示控制(representation steering)方法在大语言模型(LLM)中因假设目标概念在嵌入空间中均匀分布而导致的鲁棒性不足问题。传统方法依赖于全局单一方向的干预,通常通过对比数据集的均值差获取,但实际中LLM的表示具有高度非均匀性,表现为聚类和上下文依赖结构,使得全局方向失效。解决方案的关键在于引入最优传输(Optimal Transport, OT)理论框架,将源与目标表示建模为高斯混合模型(Gaussian Mixture Models),并将控制问题转化为离散最优传输问题,从而基于语义潜在簇间的运输计划,利用重心投影(barycentric projection)推导出输入相关的显式控制映射,实现平滑且加权的簇级偏移组合。该方法被称为概念异质性感知表示控制(Concept Heterogeneity-aware Representation Steering, CHaRS),显著提升了行为控制的有效性。
链接: https://arxiv.org/abs/2603.02237
作者: Laziz U. Abdullaev,Noelle Y. L. Wong,Ryan T. Z. Lee,Shiqi Jiang,Khoi N. M. Nguyen,Tan M. Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.
[AI-101] CUDABench: Benchmarking LLM s for Text-to-CUDA Generation
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成GPU内核代码时存在的评估不足问题,尤其是现有基准测试主要关注高级语言到CUDA的翻译任务,而忽略了更具挑战性的文本到CUDA生成任务;同时,由于GPU编程具有硬件特异性和性能敏感性,准确评估LLM生成的GPU程序性能也面临困难。解决方案的关键在于提出CUDABench——一个全面的基准测试框架,包含覆盖广度-深度-难度维度的多样化任务集合(CUDABench-Set),并设计了CUDABench-Score评分体系与生成验证流水线(Generative Verification Pipeline),通过编译正确性、执行验证的功能一致性以及基于roofline模型的性能得分(Performance-Score)三个维度对LLM生成的CUDA代码进行系统评估,从而揭示了当前LLM在文本到CUDA生成中存在的关键问题,如高编译成功率但低功能正确率、领域算法知识缺失及GPU资源利用效率低下等。
链接: https://arxiv.org/abs/2603.02236
作者: Jiace Zhu,Wentao Chen,Qi Fan,Zhixing Ren,Junying Wu,Xing Zhe Chai,Chotiwit Rungrueangwutthinon,Yehan Ma,An Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of LLMs. First, we construct CUDABench-Set, which covers Breadth-Depth-Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench-Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution-based verification, and (3) a novel roofline-based metric, Performance-Score. Benchmarking state-of-the-art LLMs reveals insightful findings and challenges of text-to-CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain-specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available at this https URL.
[AI-102] alking with Verifiers: Automatic Specification Generation for Neural Network Verification
【速读】:该论文旨在解决当前神经网络验证工具仅支持低级约束(如原始输入输出之间的限制),难以适用于多样应用场景中以更高语义层次表达的正确性需求的问题。其核心挑战在于深度神经网络(Deep Neural Networks, DNNs)学习到的内部表示缺乏与人类可理解特征的显式映射,导致形式化验证难以覆盖实际应用中的高阶规范。解决方案的关键在于引入一个新颖的验证流程组件,将自然语言描述的规范自动分析并转化为与先进神经网络验证器兼容的形式化查询,从而显著扩展了形式化DNN验证在真实世界高阶需求中的适用范围,同时保持用户意图的高保真度和较低计算开销。
链接: https://arxiv.org/abs/2603.02235
作者: Yizhak Y. Elboher,Reuven Peleg,Zhouxing Shi,Guy Katz,Jan Křetínský
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Neural network verification tools currently support only a narrow class of specifications, typically expressed as low-level constraints over raw inputs and outputs. This limitation significantly hinders their adoption and practical applicability across diverse application domains where correctness requirements are naturally expressed at a higher semantic level. This challenge is rooted in the inherent nature of deep neural networks, which learn internal representations that lack an explicit mapping to human-understandable features. To address this, we bridge this gap by introducing a novel component to the verification pipeline, making existing verification tools applicable to a broader range of domains and specification styles. Our framework enables users to formulate specifications in natural language, which are then automatically analyzed and translated into formal verification queries compatible with state-of-the-art neural network verifiers. We evaluate our approach on both structured and unstructured datasets, demonstrating that it successfully verifies complex semantic specifications that were previously inaccessible. Our results show that this translation process maintains high fidelity to user intent while incurring low computational overhead, thereby substantially extending the applicability of formal DNN verification to real-world, high-level requirements.
[AI-103] Structured vs. Unstructured Pruning: An Exponential Gap
【速读】:该论文旨在解决神经网络中结构化剪枝(特别是神经元剪枝)的理论局限性问题,即在不训练的情况下,通过剪枝从随机初始化的大型网络中找到能近似目标函数的稀疏子网络的能力。论文聚焦于单个无偏置ReLU神经元的近似问题,从而隔离出神经元剪枝的内在限制。其关键解决方案是证明:为了以ε-精度近似一个目标ReLU神经元,神经元剪枝需要初始网络至少包含Ω(d/ε)个隐藏神经元;而权重剪枝仅需O(d log(1/ε))个神经元即可实现相同精度,从而揭示了神经元剪枝与权重剪枝之间存在指数级的复杂度差异。
链接: https://arxiv.org/abs/2603.02234
作者: Davide Ferré(CNRS, COATI, UniCA, I3S),Frédéric Giroire(I3S, COATI, UniCA),Emanuele Natale(CNRS, COATI, I3S, UniCA),Frederik Mallmann-Trenn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The Strong Lottery Ticket Hypothesis (SLTH) posits that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention. In this work, we consider the problem of approximating a single bias-free ReLU neuron using a randomly initialized bias-free two-layer ReLU network, thereby isolating the intrinsic limitations of neuron pruning. We show that neuron pruning requires a starting network with \Omega(d/\varepsilon) hidden neurons to \varepsilon -approximate a target ReLU neuron. In contrast, weight pruning achieves \varepsilon -approximation with only O(d\log(1/\varepsilon)) neurons, establishing an exponential separation between the two pruning paradigms.
[AI-104] Adaptive Personalized Federated Learning via Multi-task Averag ing of Kernel Mean Embeddings
【速读】:该论文旨在解决个性化联邦学习(Personalized Federated Learning, PFL)中如何在不共享原始数据的前提下,实现各代理(agent)之间高效协作并自动适应数据异质性的问题。其核心挑战在于如何动态调整不同代理间的协作权重,以平衡全局与局部学习的收益,并在有限样本下保证统计性能。解决方案的关键在于将协作权重估计建模为多源数据的核均值嵌入(kernel mean embedding)估计问题,利用多任务平均(multi-task averaging)工具捕捉代理间的统计关联,从而构建一个无需先验知识即可自适应切换全局与局部学习模式的优化框架。此外,通过将目标重构成高维均值估计问题,作者推导出适用于广泛分布类别的局部超额风险的有限样本保证,进一步量化了协作带来的统计增益;同时提出基于随机傅里叶特征(random Fourier features)的实用实现方案,在通信成本与统计效率之间提供可调的权衡。
链接: https://arxiv.org/abs/2603.02233
作者: Jean-Baptiste Fermanian(PREMEDICAL),Batiste Le Bars(MAGNET, CRIStAL),Aurélien Bellet(PREMEDICAL)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Personalized Federated Learning (PFL) enables a collection of agents to collaboratively learn individual models without sharing raw data. We propose a new PFL approach in which each agent optimizes a weighted combination of all agents’ empirical risks, with the weights learned from data rather than specified a priori. The novelty of our method lies in formulating the estimation of these collaborative weights as a kernel mean embedding estimation problem with multiple data sources, leveraging tools from multi-task averaging to capture statistical relationships between agents. This perspective yields a fully adaptive procedure that requires no prior knowledge of data heterogeneity and can automatically transition between global and local learning regimes. By recasting the objective as a high-dimensional mean estimation problem, we derive finite-sample guarantees on local excess risks for a broad class of distributions, explicitly quantifying the statistical gains of collaboration. To address communication constraints inherent to federated settings, we also propose a practical implementation based on random Fourier features, which allows one to trade communication cost for statistical efficiency. Numerical experiments validate our theoretical results.
[AI-105] Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback
【速读】:该论文旨在解决当前奖励建模(reward modeling)中缺乏对Likert尺度偏好数据(如“显著更好”、“更好”、“稍好”等)的严谨数学建模框架的问题。现有方法通常基于二元偏好模型(如Bradley-Terry模型),并通过人为设定的边际项或缩放因子来处理有序偏好,但这些方法缺少对偏好生成机制的理论基础。其解决方案的关键在于将Likert尺度偏好建模为**离散有序回归(discrete ordinal regression)**问题,并由此推导出两种损失函数:负对数似然损失和全阈值损失(all-threshold loss)。这两种损失函数能够自动学习阈值参数,从而自然捕捉偏好的有序结构,而非依赖人工指定的固定边际或权重,实现了从数据中端到端地学习偏好边界,提升了奖励模型在聊天、推理和安全等任务上的性能表现。
链接: https://arxiv.org/abs/2603.02232
作者: Amirhossein Afsharrad,Ruida Zhou,Luca Viano,Sanjay Lall,Mohammad Ghavamzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reward modeling is crucial for aligning large language models with human preferences, yet current approaches lack a principled mathematical framework for leveraging ordinal preference data. When human annotators provide graded preferences on a Likert scale (e.g., significantly better, better, slightly better, negligibly better), existing methods typically apply ad-hoc heuristics, such as margin terms or scaling factors, to loss functions derived from binary preference models like Bradley-Terry. These approaches lack an underlying mathematical model for how ordinal preference data is generated. We present a theoretically grounded framework that formulates reward modeling with Likert scale preferences as a discrete ordinal regression problem. We derive two loss functions from this formulation: a negative log-likelihood loss and an all-threshold loss, both of which learn threshold parameters that naturally capture the ordinal structure of preferences. Unlike existing heuristic methods that manually specify fixed margins or scaling weights, our approach learns these parameters directly from data within a coherent probabilistic framework. Experimental results on multiple benchmarks demonstrate that our ordinal regression approach consistently achieves competitive or superior performance compared to existing heuristic methods across diverse evaluation categories including chat, reasoning, and safety tasks. Our work provides the first principled mathematical framework for incorporating Likert scale preferences into reward model training, moving beyond ad-hoc modifications of binary preference models to enable more effective utilization of fine-grained human feedback.
[AI-106] Physics-Informed Neural Networks with Architectural Physics Embedding for Large-Scale Wave Field Reconstruction
【速读】:该论文旨在解决大规模波场重建中计算效率与精度难以兼顾的问题。传统基于物理的数值方法(如有限元法,FEM)虽精度高,但在大规模或高频场景下计算成本过高;纯数据驱动方法则受限于复杂场景中标签数据不足。为此,作者提出架构物理嵌入(Physics-Embedded, PE)-PINN,其关键在于将物理先验知识不仅以损失函数形式引入神经网络,更通过设计新的包络变换层(envelope transformation layer),将源特性、介质界面和波物理等信息直接编码进网络结构中,从而有效缓解谱偏差(spectral bias)问题,显著提升收敛速度与内存效率。实验表明,PE-PINN相比标准PINN收敛速度提升10倍以上,内存消耗较FEM降低数个数量级,实现了对室级尺度二维/三维电磁波场(含反射、折射与衍射)的高保真建模。
链接: https://arxiv.org/abs/2603.02231
作者: Huiwen Zhang,Feng Ye,Chu Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 17 figures
点击查看摘要
Abstract:Large-scale wave field reconstruction requires precise solutions but faces challenges with computational efficiency and accuracy. The physics-based numerical methods like Finite Element Method (FEM) provide high accuracy but struggle with large-scale or high-frequency problems due to prohibitive computational costs. Pure data-driven approaches excel in speed but often lack sufficient labeled data for complex scenarios. Physics-informed neural networks (PINNs) integrate physical principles into machine learning models, offering a promising solution by bridging these gaps. However, standard PINNs embed physical principles only in loss functions, leading to slow convergence, optimization instability, and spectral bias, limiting their ability for large-scale wave field reconstruction. This work introduces architecture physics embedded (PE)-PINN, which integrates additional physical guidance directly into the neural network architecture beyond Helmholtz equations and boundary conditions in loss functions. Specifically, a new envelope transformation layer is designed to mitigate spectral bias with kernels parameterized by source properties, material interfaces, and wave physics. Experiments demonstrate that PE-PINN achieves more than 10 times speedup in convergence compared to standard PINNs and several orders of magnitude reduction in memory usage compared to FEM. This breakthrough enables high-fidelity modeling for large-scale 2D/3D electromagnetic wave reconstruction involving reflections, refractions, and diffractions in room-scale domains, readily applicable to wireless communications, sensing, room acoustics, and other fields requiring large-scale wave field analysis.
[AI-107] Generalized Discrete Diffusion with Self-Correction
【速读】:该论文旨在解决离散扩散模型(Discrete Diffusion Models)在推理阶段保持并行采样效率的同时,避免因自校正(Self-Correction)策略导致性能下降的问题。现有方法多在推理时或训练后进行自校正,但存在泛化能力有限、推理性能受损等缺陷。其解决方案的关键在于提出一种自校正离散扩散模型(Self-Correcting Discrete Diffusion, SCDD),通过显式建模状态转移(explicit state transitions)并在离散时间空间中直接学习自校正机制,摒弃了以往依赖连续插值的复杂流程与模糊的均匀转移和吸收掩码交互方式;同时简化了噪声调度(training noise schedule),移除了冗余的重新掩码步骤,仅依赖均匀转移来学习校正行为,从而在GPT-2规模实验中实现了更高效的并行解码且不牺牲生成质量。
链接: https://arxiv.org/abs/2603.02230
作者: Linxuan Wang,Ziyi Wang,Yikun Bai,Wei Deng,Guang Lin,Qifan Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages, 3 figures, 6 tables
点击查看摘要
Abstract:Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.
[AI-108] Neural Paging: Learning Context Management Policies for Turing-Complete Agents
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在引入外部读写内存后仍受限于有限且昂贵的上下文窗口(Context Window)的问题,该窗口本质上是一个语义缓存而非无限记忆资源,导致长程推理效率低下。解决方案的关键在于提出一种名为“神经分页”(Neural Paging)的分层架构,其核心是将符号推理与信息资源管理解耦,并定义了“上下文分页问题”(Context Paging Problem, CPP),设计了一个轻量级、可微分的“页面控制器”(Page Controller),以近似“语义贝尔迪最优性”(Semantic Belady’s Optimality)——即在预设访问模式假设下保留具有高未来效用的token。理论分析表明,该方法可将长时推理的渐近复杂度从二次方 O(N2) 降低至 O(N⋅K2)(其中 K 为上下文窗口大小),并给出鲁棒性边界,量化策略依赖访问下的竞争比退化,实验验证了理论保证的有效性及优化空间。
链接: https://arxiv.org/abs/2603.02228
作者: Liang Chen,Qi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The proof that Large Language Models (LLMs) augmented with external read-write memory constitute a computationally universal system has established the theoretical foundation for general-purpose agents. However, existing implementations face a critical bottleneck: the finite and costly Context Window, which functions not as infinite memory but as a scarce semantic cache. In this work, we introduce \textitNeural Paging, a hierarchical architecture that decouples symbolic reasoning from information resource management. We formulate the \textitContext Paging Problem (CPP) and propose a lightweight, differentiable \textitPage Controller designed to approximate ``Semantic Belady’s Optimality’’ – retaining tokens with high future utility under explicit assumptions on access patterns. We provide theoretical analysis showing that, under bounded context window size~ K , Neural Paging reduces the asymptotic complexity of long-horizon reasoning from quadratic O(N^2) to O(N \cdot K^2) , and we derive a robustness bound (Theorem~4) that quantifies competitive-ratio degradation under policy-dependent access with bounded sensitivity. We validate these bounds on synthetic paging traces, confirming that the theoretical guarantees hold and identifying significant slack that motivates learned policies.
[AI-109] Characterizing and Predicting Wildfire Evacuation Behavior: A Dual-Stage ML Approach
【速读】:该论文旨在解决野火疏散行为高度异质性的问题,即个体在面对野火威胁时的响应模式受家庭资源、准备程度和情境线索等多重因素复杂交互影响,导致传统方法难以准确识别和预测疏散行为。其解决方案的关键在于融合无监督与有监督机器学习方法:通过多对应分析(Multiple Correspondence Analysis)、K-Modes聚类和潜在类别分析(Latent Class Analysis)识别出基于车辆获取能力、灾害规划、技术资源、宠物拥有情况及居住稳定性等特征的潜在线索群体;同时利用监督模型发现交通方式可由家庭特征可靠预测,而疏散时机因依赖实时火灾动态条件仍具挑战性。这一框架为数据驱动理解野火疏散行为提供了新路径,并支持精准化应急准备策略制定与公平资源配置。
链接: https://arxiv.org/abs/2603.02223
作者: Sazzad Bin Bashar Polock,Anandi Dutta,Subasish Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is the author’s preprint version of a paper accepted for presentation at SoutheastConn 2026. The final published version will appear in the official conference proceedings. Conference site: this https URL
点击查看摘要
Abstract:Wildfire evacuation behavior is highly variable and influenced by complex interactions among household resources, preparedness, and situational cues. Using a large-scale MTurk survey of residents in California, Colorado, and Oregon, this study integrates unsupervised and supervised machine learning methods to uncover latent behavioral typologies and predict key evacuation outcomes. Multiple Correspondence Analysis, K-Modes clustering, and Latent Class Analysis reveal consistent subgroups differentiated by vehicle access, disaster planning, technological resources, pet ownership, and residential stability. Complementary supervised models show that transportation mode can be predicted with high reliability from household characteristics, whereas evacuation timing remains difficult to classify due to its dependence on dynamic, real-time fire conditions. These findings advance data-driven understanding of wildfire evacuation behavior and demonstrate how machine learning can support targeted preparedness strategies, resource allocation, and equitable emergency planning.
[AI-110] MedCalc-Bench Doesnt Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation
【速读】:该论文旨在解决当前临床计算器任务基准(MedCalc-Bench)在评估大语言模型(LLM)能力时存在的核心问题:其实际衡量的并非临床推理能力,而是公式记忆与算术精度,导致对模型工具使用能力的评估失真。解决方案的关键在于三项贡献:首先系统性审计并修正了基准中超过20个关键公式错误和运行时bug;其次提出“开卷提示”(open-book prompting)策略,在推理阶段提供计算器规范,使GLM-4.6V和GLM-4.7模型准确率从约52%提升至81–85%,显著超越所有已发表的强化学习(RL)方法而无需微调;最后通过GPT-5.2-Thinking建立95–97%的上限,揭示残余误差主要源于标注错误和数据歧义,从而论证应将该基准重新定位为工具使用能力评估而非临床推理测试。
链接: https://arxiv.org/abs/2603.02222
作者: Artus Krohn-Grimberghe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark’s current framing. First, we conduct a systematic audit of the benchmark’s calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time (“open-book” prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we establish an upper bound of 95-97% using GPT-5.2-Thinking, with residual errors attributable primarily to ground-truth issues and dataset ambiguities. Our findings suggest that MedCalc-Bench predominantly measures formula memorization and arithmetic precision rather than clinical reasoning, and would be better framed as a tool-use evaluation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.02222 [cs.LG] (or arXiv:2603.02222v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.02222 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Artus Krohn-Grimberghe [view email] [v1] Tue, 10 Feb 2026 15:43:41 UTC (14 KB) Full-text links: Access Paper: View a PDF of the paper titled MedCalc-Bench Doesn’t Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation, by Artus Krohn-GrimbergheView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-111] MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLM s for Clinical Tabular Prediction
【速读】:该论文旨在解决医疗领域表格预测任务中,传统特征工程方法虽性能优异但依赖人工经验、难以自动化,而神经网络方法又常因缺乏对领域知识的显式利用而导致表现受限的问题。其解决方案的关键在于提出 MedFeat——一个反馈驱动且模型感知(model-aware)的特征工程框架,该框架结合大语言模型(LLM)的推理能力与领域知识进行特征生成,并通过 SHAP 值提供可解释性,同时追踪成功与失败的特征提议以指导后续探索;更重要的是,它主动识别下游模型难以直接学习的高信息量特征,从而在多种临床预测任务中实现稳定性能提升,并展现出跨时间与人群分布的鲁棒性。
链接: https://arxiv.org/abs/2603.02221
作者: Zizheng Zhang,Yiming Li,Justin Xu,Jinyu Wang,Rui Wang,Lei Song,Jiang Bian,David W Eyre,Jingjing Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In healthcare tabular predictions, classical models with feature engineering often outperform neural approaches. Recent advances in Large Language Models enable the integration of domain knowledge into feature engineering, offering a promising direction. However, existing approaches typically rely on a broad search over predefined transformations, overlooking downstream model characteristics and feature importance signals. We present MedFeat, a feedback-driven and model-aware feature engineering framework that leverages LLM reasoning with domain knowledge and provides feature explanations based on SHAP values while tracking successful and failed proposals to guide feature discovery. By incorporating model awareness, MedFeat prioritizes informative signals that are difficult for the downstream model to learn directly due to its characteristics. Across a broad range of clinical prediction tasks, MedFeat achieves stable improvements over various baselines and discovers clinically meaningful features that generalize under distribution shift, demonstrating robustness across years and from ICU cohorts to general hospitalized patients, thereby offering insights into real-world deployment. Code required to reproduce our experiments will be released, subject to dataset agreements and institutional policies.
[AI-112] NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在流式(streaming)场景中实时安全防护的问题,传统的事后(post-hoc)安全机制因无法在生成过程中及时干预而不适用,而基于token级监督训练的流式防护方法则存在标注成本高和过拟合等局限。解决方案的关键在于发现并利用预训练模型中已编码的token级风险信号——这些信号存在于稀疏自动编码器(Sparse Autoencoders, SAEs)提取的隐层特征中,从而提出无需额外训练的NExT-Guard框架:通过监控可解释的潜在特征实现流式安全拦截,仅依赖公开可用的基础LLM的预训练SAEs即可部署,具备模型无关性、低成本和强鲁棒性,显著优于现有监督式流式与事后防护方法。
链接: https://arxiv.org/abs/2603.02219
作者: Junfeng Fang,Nachuan Chen,Houcheng Jiang,Dan Zhang,Fei Shen,Xiang Wang,Xiangnan He,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models are increasingly deployed in streaming scenarios, rendering conventional post-hoc safeguards ineffective as they fail to interdict unsafe content in real-time. While streaming safeguards based on token-level supervised training could address this, they necessitate expensive annotations and suffer from severe overfitting. In this work, we challenge the paradigm that streaming safety must rely on token-level supervised training. Instead, it is an inherent capability of well-trained post-hoc safeguards, as they already encode token-level risk signals in hidden representations. Hence, we introduce NExT-Guard, a training-free framework that achieves streaming safeguards by monitoring interpretable latent features from Sparse Autoencoders (SAEs). It uses pretrained SAEs from publicly available base LLMs, enabling flexible, low-cost deployment without token-level supervision. Experimental results show that NExT-Guard outperforms both post-hoc and streaming safeguards based on supervised training, with superior robustness across models, SAE variants, and risk scenarios. These results make NExT-Guard a universal and scalable paradigm for real-time safety, accelerating the practical deployment of streaming safeguards.
[AI-113] Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在无需重新训练情况下的压缩问题,其核心挑战在于压缩后性能持续下降的现象。研究表明,这种性能退化主要源于专家参数调整后但路由机制(router)未同步更新所导致的“路由-专家不匹配”问题。解决方案的关键在于:仅对路由器进行轻量级校准,而不修改专家参数,并提出路由器知识蒸馏(Router Knowledge Distillation, Router KD)方法,通过在无标签校准数据上蒸馏原始模型的下一个词分布来微调路由器参数,从而实现性能恢复,尤其在细粒度MoE(多个小型专家)中效果更显著。
链接: https://arxiv.org/abs/2603.02217
作者: Sieun Hyeon,Jaeyoung Do
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model’s next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance recovery, with substantially larger gains in fine-grained MoEs (many small experts) than in coarse-grained MoEs due to their more complex routing decision boundaries.
[AI-114] ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue ICLR2026
【速读】:该论文旨在解决多轮医疗对话中因信息不完整导致的准确诊断难题,核心挑战在于用户与智能体交互中的不确定性建模及长程信用分配问题。为应对这一挑战,作者提出了一种面向不确定性的自适应树策略优化(Adaptive Tree Policy Optimization, ATPO)算法,其关键创新在于:通过贝尔曼误差与动作值方差的复合度量量化状态不确定性,并据此动态分配回溯(rollout)预算至高不确定性状态,从而提升价值估计精度并促进更高效、多样化的探索;同时引入基于不确定性的剪枝机制与异步搜索架构以降低树状强化学习的计算开销,实现性能与效率的协同优化。
链接: https://arxiv.org/abs/2603.02216
作者: Ruike Cao,Shaojie Bai,Fugen Yao,Liang Dong,Jian Xu,Li Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o ( +0.92% accuracy).
[AI-115] RxnNano:Training Compact LLM s for Chemical Reaction and Retrosynthesis Prediction via Hierarchical Curriculum Learning
【速读】:该论文旨在解决当前化学反应预测模型过度依赖参数规模和数据量扩展,而忽视反应表示本质与深层化学直觉(如反应常识和拓扑原子映射逻辑)的问题。其核心解决方案在于通过引入一个以化学理解优先的统一框架,关键创新包括:(1) 潜在化学一致性(Latent Chemical Consistency)目标,将反应建模为连续化学流形上的运动,确保变换的可逆性和物理合理性;(2) 分层认知课程(Hierarchical Cognitive Curriculum),分阶段训练模型从语法掌握到语义推理,构建稳健的化学直觉;(3) 原子映射置换不变性(Atom-Map Permutation Invariance, AMPI),强制模型学习关系拓扑不变性并平衡多任务学习;(4) 结构化计划推理机制,提升大语言模型(LLMs)性能。该方法使仅含0.5B参数的RxnNano模型在严格基准测试中显著优于7B参数的微调LLMs及所有领域基线,Top-1准确率提升达23.5%。
链接: https://arxiv.org/abs/2603.02215
作者: Ran Li,Shimin Di,Haowei LI,Luanshi Bu,Jiachuan Wang,Wangze Ni,Lei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Chemical reaction prediction is pivotal for accelerating drug discovery and synthesis planning. Despite advances in data-driven models, current approaches are hindered by an overemphasis on parameter and dataset scaling. Some methods coupled with evaluation techniques that bypass fundamental challenges in reaction representation and fail to capture deep chemical intuition like reaction common sense and topological atom mapping logic. We argue that the core challenge lies in instilling these knowledge into the models. To this end, we propose a unified framework that prioritizes chemical understanding over scale through three key innovations: (1) a Latent Chemical Consistency objective that models reactions as movements on a continuous chemical manifold, ensuring reversible and physically plausible transformations; (2) a Hierarchical Cognitive Curriculum that trains the model through progressive stages, from syntax mastery to semantic reasoning, building robust chemical intuition; (3) Atom-Map Permutation Invariance (AMPI), which force the model to learn invariant relational topology and balance multi-task learning. (4)and structured plan-based reasoning to improve the performance of the LLMs. Our compact 0.5B-parameter model, \textbfRxnNano significantly outperforms fine-tuned LLMs ten times larger (7B) and all the domain baselines, achieving a 23.5% Top-1 accuracy improvement on rigorous benchmarks without test-time augmentation. this https URL.
[AI-116] Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving
【速读】:该论文旨在解决联邦推理(Federated Inference, FI)在实际应用中缺乏统一抽象和系统级理解的问题,尤其关注如何在不共享数据或模型参数的前提下,实现私有模型在推理阶段的协同计算。其关键解决方案在于将FI定义为一种独立于联邦学习(Federated Learning)的协作范式,并明确其两个核心可行性要求:推理时隐私保护与通过协作获得有意义的性能提升。作者通过形式化FI为受保护的协同计算任务,分析了其设计维度及在隐私约束、非独立同分布(non-IID)数据和有限可观测性共同作用下的结构性权衡,揭示出FI具有区别于训练阶段联邦学习和传统集成方法的系统级行为特征,从而为构建可扩展、隐私安全的协同推理系统提供了理论基础和实践指引。
链接: https://arxiv.org/abs/2603.02214
作者: Jungwon Seo,Ferhat Ozgur Catak,Chunming Rong,Jaeyeon Jang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 19 pages, 6 figures, 10 tables
点击查看摘要
Abstract:Federated Inference (FI) studies how independently trained and privately owned models can collaborate at inference time without sharing data or model parameters. While recent work has explored secure and distributed inference from disparate perspectives, a unified abstraction and system-level understanding of FI remain lacking. This paper positions FI as a distinct collaborative paradigm, complementary to federated learning, and identifies two fundamental requirements that govern its feasibility: inference-time privacy preservation and meaningful performance gains through collaboration. We formalize FI as a protected collaborative computation, analyze its core design dimensions, and examine the structural trade-offs that arise when privacy constraints, non-IID data, and limited observability are jointly imposed at inference time. Through a concrete instantiation and empirical analysis, we highlight recurring friction points in privacy-preserving inference, ensemble-based collaboration, and incentive alignment. Our findings suggest that FI exhibits system-level behaviors that cannot be directly inherited from training-time federation or classical ensemble methods. Overall, this work provides a unifying perspective on FI and outlines open challenges that must be addressed to enable practical, scalable, and privacy-preserving collaborative inference systems.
[AI-117] GLEAN: Grounded Lightweight Evaluation Anchors for Contamination-Aware Tabular Reasoning
【速读】:该论文旨在解决小模型在表格推理(tabular reasoning)评估中面临的污染(contamination)、数据集伪影(dataset artifacts)和检索失败(retrieval failures)等问题,这些问题导致现有评估结果不可靠。其核心解决方案是提出GLEAN——一个轻量级评估协议,关键在于集成污染感知探测器(contamination-aware probes)、弱监督治理机制(weak-supervision governance)、检索-推理诊断(retrieval-reasoning diagnostics)以及在硬件受限条件下对错误进行结构化归因(structured error attribution)。通过使用Squall生成的黄金SQL作为可执行锚点(95.2%执行成功率),GLEAN构建了确定性的错误分类体系(L0–L4及L0.5上下文缺失),并揭示出TAPEX与TAPAS模型在错误模式上的稳定差异:TAPEX偏向于定位错误(L3),而TAPAS则更倾向于幻觉或弃权(L2/L0)。此方案不仅提升了评估的诊断能力,还强调了超越原始召回率的归因分析的重要性。
链接: https://arxiv.org/abs/2603.02212
作者: Qizhi Wang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures for the main paper
点击查看摘要
Abstract:Tabular reasoning benchmarks mix semantic inference, numerical computation, and brittle table formatting, yet evaluations for small models remain vulnerable to contamination, dataset artifacts, and retrieval failures. We propose GLEAN, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints. We evaluate across TabFact, WTQ via Squall, TableBench, RobuT, and SciTab under a 16GB GPU budget. Using Squall gold SQL as an executable anchor (95.2% execution), GLEAN assigns a deterministic error taxonomy (L0-L4 plus L0.5 context miss) and reveals a stable error-mode separation: TAPEX errors skew toward grounding (L3) while TAPAS errors skew toward hallucination/abstention (L2/L0). We validate evidence-row heuristics against SQL-derived rows on simple queries (0.62 precision / 0.71 recall; hybrid recall 0.81) and show that retrieval Recall@K can saturate even when end-to-end EM/F1 remains limited, motivating attribution beyond raw recall. We release a modular framework with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.
[AI-118] QFlowNet: Fast Diverse and Efficient Unitary Synthesis with Generative Flow Networks
【速读】:该论文旨在解决量子编译中的核心问题——酉矩阵(unitary matrix)的合成,即如何将一个酉矩阵高效分解为一系列量子门(quantum gates)序列。传统强化学习(Reinforcement Learning, RL)方法常因稀疏奖励信号导致训练困难或收敛至单一策略,缺乏解的多样性。本文提出的解决方案关键在于引入QFlowNet框架,其核心创新是结合生成流网络(Generative Flow Network, GFlowNet)与Transformer结构:GFlowNet能够从稀疏奖励中高效学习并采样与奖励成比例的多样化解,克服RL单一策略的局限性且推理速度优于扩散模型;同时,Transformer作为强大编码器,可捕捉酉矩阵的非局部结构,并将其高维状态压缩为密集潜在表示,从而提升策略网络的表达能力。实验表明,该方法在3量子比特基准测试(长度1-12)上实现99.7%的成功率,并发现多样且紧凑的电路方案,验证了其在效率与多样性上的优势。
链接: https://arxiv.org/abs/2603.03045
作者: Inhoe Koo,Hyunho Cha,Jungwoo Lee
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures, IEEE International Conference on Quantum Communications, Networking, and Computing (QCNC 2026)
点击查看摘要
Abstract:Unitary Synthesis, the decomposition of a unitary matrix into a sequence of quantum gates, is a fundamental challenge in quantum compilation. Prevailing reinforcement learning(RL) approaches are often hampered by sparse reward signals, which necessitate complex reward shaping or long training times, and typically converge to a single policy, lacking solution diversity. In this work, we propose QFlowNet, a novel framework that learns efficiently from sparse signals by pairing a Generative Flow Network (GFlowNet) with Transformers. Our approach addresses two key challenges. First, the GFlowNet framework is fundamentally designed to learn a diverse policy that samples solutions proportional to their reward, overcoming the single-solution limitation of RL while offering faster inference than other generative models like diffusion. Second, the Transformers act as a powerful encoder, capturing the non-local structure of unitary matrices and compressing a high-dimensional state into a dense latent representation for the policy network. Our agent achieves an overall success rate of 99.7% on a 3-qubit benchmark(lengths 1-12) and discovers a diverse set of compact circuits, establishing QFlowNet as an efficient and diverse paradigm for unitary synthesis.
[AI-119] Layer-wise QUBO-Based Training of CNN Classifiers for Quantum Annealing
【速读】:该论文旨在解决两类量子机器学习方法的瓶颈问题:一是变分量子电路在图像分类任务中易陷入平坦区(barren plateaus),导致梯度消失;二是量子核方法(quantum kernel method)的计算复杂度随数据集规模呈二次增长,难以扩展。其解决方案的关键在于提出一种基于二次无约束二值优化(QUBO)的迭代框架,通过量子退火(quantum annealing)训练卷积神经网络(CNN)的全连接分类头,完全避免了依赖梯度的量子电路优化过程。该方法借鉴极限学习机(Extreme Learning Machine)范式,随机初始化并冻结卷积层参数,仅优化最后一层,并在每轮迭代中利用特征Gram矩阵构建凸二次代理函数替代非二次交叉熵损失,从而获得稳定的曲率近似;同时,通过对多类问题进行逐输出分解,将C类分类问题转化为C个独立的QUBO子问题,每个子问题变量数为(d+1)K(d为特征维度,K为比特精度),使问题规模仅取决于图像分辨率和比特精度,而非训练样本数量,显著提升了可扩展性与硬件适配性。
链接: https://arxiv.org/abs/2603.02958
作者: Mostafa Atallah,Rebekah Herrman
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 figures, 9 tables. Submitted to Quantum Machine Intelligence
点击查看摘要
Abstract:Variational quantum circuits for image classification suffer from barren plateaus, while quantum kernel methods scale quadratically with dataset size. We propose an iterative framework based on Quadratic Unconstrained Binary Optimization (QUBO) for training the classifier head of convolutional neural networks (CNNs) via quantum annealing, entirely avoiding gradient-based circuit optimization. Following the Extreme Learning Machine paradigm, convolutional filters are randomly initialized and frozen, and only the fully connected layer is optimized. At each iteration, a convex quadratic surrogate derived from the feature Gram matrix replaces the non-quadratic cross-entropy loss, yielding an iteration-stable curvature proxy. A per-output decomposition splits the C -class problem into C independent QUBOs, each with (d+1)K binary variables, where d is the feature dimension and K is the bit precision, so that problem size depends on the image resolution and bit precision, not on the number of training samples. We evaluate the method on six image-classification benchmarks (sklearn digits, MNIST, Fashion-MNIST, CIFAR-10, EMNIST, KMNIST). A precision study shows that accuracy improves monotonically with bit resolution, with 10 bits representing a practical minimum for effective optimization; the 15-bit formulation remains within the qubit and coupler limits of current D-Wave Advantage hardware. The 20-bit formulation matches or exceeds classical stochastic gradient descent on MNIST, Fashion-MNIST, and EMNIST, while remaining competitive on CIFAR-10 and KMNIST. All experiments use simulated annealing, establishing a baseline for direct deployment on quantum annealing hardware.
[AI-120] he Vienna 4G/5G Drive-Test Dataset
【速读】:该论文旨在解决移动网络分析、规划与优化中因缺乏大规模、全面的真实世界数据集而导致的机器学习应用受限问题。其解决方案的关键在于发布了一个城市规模的开源数据集——维也纳4G/5G路测数据集(Vienna 4G/5G Drive-Test Dataset),该数据集整合了被动宽带扫描仪观测与主动终端日志,提供了网络侧和用户侧互补的无线接入网视图,并包含高分辨率建筑与地形模型,支持环境感知学习、传播建模、覆盖分析及射线追踪校准等任务的可重复基准测试。
链接: https://arxiv.org/abs/2603.02638
作者: Wilfried Wiedner,Lukas Eller,Mariam Mussbah,Dominik Rössler,Valerian Maresch,Philipp Svoboda,Markus Rupp
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 12 figures, 8 tables. Submitted to Scientific Data
点击查看摘要
Abstract:Machine learning for mobile network analysis, planning, and optimization is often limited by the lack of large, comprehensive real-world datasets. This paper introduces the Vienna 4G/5G Drive-Test Dataset, a city-scale open dataset of georeferenced Long Term Evolution (LTE) and 5G New Radio (NR) measurements collected across Vienna, Austria. The dataset combines passive wideband scanner observations with active handset logs, providing complementary network-side and user-side views of deployed radio access networks. The measurements cover diverse urban and suburban settings and are aligned with time and location information to support consistent evaluation. For a representative subset of base stations (BSs), we provide inferred deployment descriptors, including estimated BS locations, sector azimuths, and antenna heights. The release further includes high-resolution building and terrain models, enabling geometry-conditioned learning and calibration of deterministic approaches such as ray tracing. To facilitate practical reuse, the data are organized into scanner, handset, estimated cell information, and city-model components, and the accompanying documentation describes the available fields and intended joins between them. The dataset enables reproducible benchmarking across environment-aware learning, propagation modeling, coverage analysis, and ray-tracing calibration workflows.
[AI-121] Detecting Structural Heart Disease from Electrocardiograms via a Generalized Additive Model of Interpretable Foundation-Model Predictors
【速读】:该论文旨在解决结构性心脏病(Structural Heart Disease, SHD)早期筛查中因超声心动图(Echocardiography, ECHO)成本高、可及性差而导致的大量未诊断病例问题,同时克服现有基于人工智能(Artificial Intelligence, AI)的ECG分析方法为“黑箱模型”导致的可解释性不足与临床采纳困难。其解决方案的关键在于提出一种可解释且高效的框架,将具有临床意义的ECG基础模型预测因子嵌入广义加性模型(Generalized Additive Model, GAM)中,从而在保持强预测性能的同时实现风险归因的透明化;该方法在超过8万对ECG-ECHO数据上验证了优于最新深度学习基线的性能(AUROC提升+0.98%,AUPRC提升+1.01%,F1分数提升+1.41%),且在仅使用30%训练数据时仍表现优异,表明其具备良好的泛化能力与临床实用性。
链接: https://arxiv.org/abs/2603.02616
作者: Ya Zhou,Zhaohong Sun,Tianxiang Hao,Xiangjie Li
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Structural heart disease (SHD) is a prevalent condition with many undiagnosed cases, and early detection is often limited by the high cost and accessibility constraints of echocardiography (ECHO). Recent studies show that artificial intelligence (AI)-based analysis of electrocardiograms (ECGs) can detect SHD, offering a scalable alternative. However, existing methods are fully black-box models, limiting interpretability and clinical adoption. To address these challenges, we propose an interpretable and effective framework that integrates clinically meaningful ECG foundation-model predictors within a generalized additive model, enabling transparent risk attribution while maintaining strong predictive performance. Using the EchoNext benchmark of over 80,000 ECG-ECHO pairs, the method demonstrates relative improvements of +0.98% in AUROC, +1.01% in AUPRC, and +1.41% in F1 score over the latest state-of-the-art deep-learning baseline, while achieving slightly better performance even with only 30% of the training data. Subgroup analyses confirm robust performance across heterogeneous populations, and the estimated entry-wise functions provide interpretable insights into the relationships between risks of traditional ECG diagnoses and SHD. This work illustrates a complementary paradigm between classical statistical modeling and modern AI, offering a pathway to interpretable, high-performing, and clinically actionable ECG-based SHD screening.
[AI-122] Large Electron Model: A Universal Ground State Predictor
【速读】:该论文旨在解决强关联电子体系中精确计算基态波函数的难题,尤其是在不同哈密顿量参数和粒子数条件下实现高精度、广义泛化的能力。传统密度泛函理论(Density Functional Theory, DFT)在处理强电子关联时存在局限性,而该研究提出了一种基于变分原理的“大电子模型”(Large Electron Model),其核心创新在于采用费米集架构(Fermi Sets architecture)作为通用多体费米子波函数表示,并通过哈密顿量参数与粒子数进行条件控制,从而构建一个单一神经网络模型,可跨参数空间泛化预测基态波函数,包括实空间电荷密度和能量,在多达50个粒子的情况下仍保持高精度。这一方法为材料发现提供了基于变分原理的基础模型范式。
链接: https://arxiv.org/abs/2603.02346
作者: Timothy Zaklama,Max Geier,Liang Fu
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8+5 pages, 5+4 figures, 1+1 tables
点击查看摘要
Abstract:We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the entire Hamiltonian parameter manifold. Our model employs the Fermi Sets architecture, a universal representation of many-body fermionic wavefunctions, which is further conditioned on Hamiltonian parameter and particle number. On interacting electrons in a two-dimensional harmonic potential, a single trained model accurately predicts the ground state wavefunction while generalizing across unseen coupling strengths and particle-number sectors, producing both accurate real-space charge densities and ground state energies, even up to 50 particles. Our results establish a foundation model method for material discovery that is grounded in the variational principle, while accurately treating strong electron correlation beyond the capacity of density functional theory.
[AI-123] Contextual Invertible World Models: A Neuro-Symbolic Agent ic Framework for Colorectal Cancer Drug Response
【速读】:该论文旨在解决精准肿瘤学中的“小样本、高维度”(small-N, large-P)悖论,即虽然基因组数据维度高,但高质量药物反应样本稀缺,且现有深度学习模型虽预测准确度高,却缺乏可解释性,无法提供临床决策所需的因果机制。解决方案的关键在于提出一种神经符号代理框架(Neuro-Symbolic Agentic Framework),通过将定量机器学习的世界模型(World Model)与基于大语言模型(LLM)的代理推理层相结合,实现对临床背景(如微卫星不稳定性 MSI 状态)的显式建模,并引入“逆向推理”(Inverse Reasoning)机制:利用虚拟 CRISPR 干扰模拟特定基因编辑(如 APC 或 TP53 修复)对药物敏感性的影响,从而区分治疗机会与情境性耐药,最终在人类临床数据中验证了其生物学合理性(p=0.023),为癌症研究中的可解释 AI 提供了透明且生物机制明确的路径。
链接: https://arxiv.org/abs/2603.02274
作者: Christopher Baker,Karen Rafferty,Hui Wang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant, but high-quality drug response samples are often sparse. While deep learning models achieve high predictive accuracy, they remain black boxes that fail to provide the causal mechanisms required for clinical decision-making. We present a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning World Model with an LLM-based agentic reasoning layer. Our system utilises a forensic data pipeline built on the Sanger GDSC dataset (N=83), achieving a robust predictive correlation (r=0.504) and a significant performance gain through the explicit modelling of clinical context, specifically Microsatellite Instability (MSI) status. We introduce the concept of Inverse Reasoning, where the agentic layer performs in silico CRISPR perturbations to predict how specific genomic edits, such as APC or TP53 repair, alter drug sensitivity. By distinguishing between therapeutic opportunity and contextual resistance, and validating these findings against human clinical data (p=0.023), our framework provides a transparent, biologically grounded path towards explainable AI in cancer research.
[AI-124] Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
【速读】:该论文旨在解决自动语音识别(ASR)系统在房间声学环境下的鲁棒性评估问题,即如何量化不同模型在混响条件下的性能退化。解决方案的关键在于构建了一个名为Whisper-RIR-Mega的基准数据集,该数据集由成对的干净语音与真实房间脉冲响应(Room Impulse Response, RIR)卷积后的混响语音组成,并按混响时间(RT60)和直达-混响比(DRR)进行分层划分,从而实现对ASR模型在可控且多样化的混响条件下性能的系统性评测。通过在此数据集上测试五个不同规模的Whisper模型,研究者量化了混响带来的词错误率(WER)和字符错误率(CER)增加幅度,为后续开发更鲁棒的ASR系统提供了可复现的评估框架与基线结果。
链接: https://arxiv.org/abs/2603.02252
作者: Mandip Goswami
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:
点击查看摘要
Abstract:We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics. Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus, with stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). We evaluate five Whisper models (tiny through large-v3) on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions. Reverberation consistently degrades performance across all model sizes; the reverb penalty in WER ranges from 0.12 to 1.07 percentage points depending on the model. We release the dataset, evaluation code, and baseline results to support reproducible research on robust ASR.
[AI-125] A Benchmark Analysis of Graph and Non-Graph Methods for Caenorhabditis Elegans Neuron Classification
【速读】:该论文旨在解决线虫(Caenorhabditis elegans)神经元分类问题,即如何基于多种特征准确区分感觉神经元(Sensory)、中间神经元(Interneuron)和运动神经元(Motor)。其解决方案的关键在于构建了一个基准测试框架,系统比较了四类图神经网络(Graph Neural Networks, GNNs)方法(GCN、GraphSAGE、GAT、GraphTransformer)与四类非图方法(Logistic Regression、MLP、LOLCAT、NeuPRINT),并利用功能连接组(functional connectome)中的空间位置(Spatial)、连接模式(Connection)和神经活动(Neuronal Activity)特征进行分类。研究发现,基于注意力机制的GNN(如GAT和GraphTransformer)在空间和连接特征上显著优于传统方法,而神经活动特征表现较差,归因于数据时间分辨率不足,从而验证了GNN在神经元分类任务中的有效性,并指出空间和连接特征是关键预测因子。
链接: https://arxiv.org/abs/2603.02241
作者: Jingqi Lu,Keqi Han,Yun Wang,Lu Mi,Carl Yang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This study establishes a benchmark for Caenorhabditis elegans neuron classification, comparing four graph methods (GCN, GraphSAGE, GAT, GraphTransformer) against four non-graph methods (Logistic Regression, MLP, LOLCAT, NeuPRINT). Using the functional connectome, we classified Sensory, Interneuron, and Motor neurons based on Spatial, Connection, and Neuronal Activity features. Results show that attention-based GNNs significantly outperform baselines on the Spatial and Connection features. The Neuronal Activity features yielded poor performance, likely due to the low temporal resolution of the underlying neuronal activity data. Our benchmark validates the use of GNNs and highlights that Spatial and Connection features are key predictors for Caenorhabditis elegans neuron classes. Code is available at: this https URL.
[AI-126] On the Parameter Estimation of Sinusoidal Models for Speech and Audio Signals
【速读】:该论文旨在解决语音与音频信号中正弦模型参数估计的准确性问题,尤其是在不同分析窗口大小和非平稳信号场景下的性能差异。其核心解决方案在于对比三种经典正弦模型:标准正弦模型(SM)、指数衰减正弦模型(EDSM)和扩展自适应准谐波模型(eaQHM),并通过信号重建精度评估各模型表现。关键创新点在于揭示了eaQHM在中到大窗口下具有更优的重构性能,而EDSM在小窗口下更具鲁棒性,从而提出未来应融合eaQHM的局部自适应特性与EDSM的参数估计稳定性,以构建适用于通用音频信号高质量分析与重合成的新范式。
链接: https://arxiv.org/abs/2401.01255
作者: George P. Kafentzis
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Signal Processing (eess.SP)
备注:
点击查看摘要
Abstract:In this paper, we examine the parameter estimation performance of three well-known sinusoidal models for speech and audio. The first one is the standard Sinusoidal Model (SM), which is based on the Fast Fourier Transform (FFT). The second is the Exponentially Damped Sinusoidal Model (EDSM) which has been proposed in the last decade, and utilizes a subspace method for parameter estimation, and finally the extended adaptive Quasi-Harmonic Model (eaQHM), which has been recently proposed for AM-FM decomposition, and estimates the signal parameters using Least Squares on a set of basis function that are adaptive to the local characteristics of the signal. The parameter estimation of each model is briefly described and its performance is compared to the others in terms of signal reconstruction accuracy versus window size on a variety of synthetic signals and versus the number of sinusoids on real signals. The latter include highly non stationary signals, such as singing voices and guitar solos. The advantages and disadvantages of each model are presented via synthetic signals and then the application on real signals is discussed. Conclusively, eaQHM outperforms EDS in medium-to-large window size analysis, whereas EDSM yields higher reconstruction values for smaller analysis window sizes. Thus, a future research direction appears to be the merge of adaptivity of the eaQHM and parameter estimation robustness of the EDSM in a new paradigm for high-quality analysis and resynthesis of general audio signals.
[AI-127] Predicting Tuberculosis from Real-World Cough Audio Recordings and Metadata
【速读】:该论文旨在解决结核病(Tuberculosis, TB)早期识别困难的问题,特别是在资源有限地区如何提升TB的筛查效率与准确性。其核心挑战在于传统诊断依赖临床检查和实验室检测,成本高且覆盖不足,而咳嗽作为常见症状难以区分TB与其他呼吸道疾病。解决方案的关键在于利用移动电话采集的大规模咳嗽音频数据,结合统计分类模型对TB与非TB咳嗽进行自动区分:研究使用无监督标注的手机应用(Hyfe)收集来自非洲东南部、印度及东南亚地区的TB和非TB患者咳嗽录音,并提取时域与频域特征;通过分层分组交叉验证发现,仅基于咳嗽声音即可实现约0.70 ± 0.05的平均受试者工作特征曲线下面积(AUC),加入人口统计学和临床信息后性能提升至约0.81 ± 0.05。这表明融合咳嗽声学特征与临床数据的移动健康(mHealth)工具具有潜力,可辅助社区卫生工作者高效开展TB主动筛查,降低医疗成本并改善公共卫生干预效果。
链接: https://arxiv.org/abs/2307.04842
作者: George P. Kafentzis,Stephane Tetsing,Joe Brew,Lola Jover,Mindaugas Galvosas,Carlos Chaccour,Peter M. Small
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Tuberculosis (TB) is an infectious disease caused by the bacterium Mycobacterium tuberculosis and primarily affects the lungs, as well as other body parts. TB is spread through the air when an infected person coughs, sneezes, or talks. Medical doctors diagnose TB in patients via clinical examinations and specialized tests. However, coughing is a common symptom of respiratory diseases such as TB. Literature suggests that cough sounds coming from different respiratory diseases can be distinguished by both medical doctors and computer algorithms. Therefore, cough recordings associated with patients with and without TB seems to be a reasonable avenue of investigation. In this work, we utilize a very large dataset of TB and non-TB cough audio recordings obtained from the south-east of Africa, India, and the south-east of Asia using a fully automated phone-based application (Hyfe), without manual annotation. We fit statistical classifiers based on spectral and time domain features with and without clinical metadata. A stratified grouped cross-validation approach shows that an average Area Under Curve (AUC) of approximately 0.70 \pm 0.05 both for a cough-level and a participant-level classification can be achieved using cough sounds alone. The addition of demographic and clinical factors increases performance, resulting in an average AUC of approximately 0.81 \pm 0.05. Our results suggest mobile phone-based applications that integrate clinical symptoms and cough sound analysis could help community health workers and, most importantly, health service programs to improve TB case-finding efforts while reducing costs, which could substantially improve public health.
机器学习
[LG-0] Learning Demographic-Conditioned Mobility Trajectories with Aggregate Supervision
链接: https://arxiv.org/abs/2603.03275
作者: Jessie Z. Li,Zhiqing Hong,Toru Shirakawa,Serina Chang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Human mobility trajectories are widely studied in public health and social science, where different demographic groups exhibit significantly different mobility patterns. However, existing trajectory generation models rarely capture this heterogeneity because most trajectory datasets lack demographic labels. To address this gap in data, we propose ATLAS, a weakly supervised approach for demographic-conditioned trajectory generation using only (i) individual trajectories without demographic labels, (ii) region-level aggregated mobility features, and (iii) region-level demographic compositions from census data. ATLAS trains a trajectory generator and fine-tunes it so that simulated mobility matches observed regional aggregates while conditioning on demographics. Experiments on real trajectory data with demographic labels show that ATLAS substantially improves demographic realism over baselines (JSD \downarrow 12%–69%) and closes much of the gap to strongly supervised training. We further develop theoretical analyses for when and why ATLAS works, identifying key factors including demographic diversity across regions and the informativeness of the aggregate feature, paired with experiments demonstrating the practical implications of our theory. We release our code at this https URL.
[LG-1] Gravity Falls: A Comparative Analysis of Domain-Generation Algorithm (DGA) Detection Methods for Mobile Device Spearphishing
链接: https://arxiv.org/abs/2603.03270
作者: Adam Dorian Wong,John D. Hastings
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Disclaimer: The views expressed are those of the authors and do not necessarily reflect the official policy or position of the U.S. Department of Defense or the U.S. Government. References to external sites do not constitute endorsement. Cleared for release on 24 FEB 2026 (DOPSR 26-T-0771). Gravity Falls Dataset DOI: https://doi.org/10.5281/zenodo.17624554
点击查看摘要
Abstract:Mobile devices are frequent targets of eCrime threat actors through SMS spearphishing (smishing) links that leverage Domain Generation Algorithms (DGA) to rotate hostile infrastructure. Despite this, DGA research and evaluation largely emphasize malware C2 and email phishing datasets, leaving limited evidence on how well detectors generalize to smishing-driven domain tactics outside enterprise perimeters. This work addresses that gap by evaluating traditional and machine-learning DGA detectors against Gravity Falls, a new semi-synthetic dataset derived from smishing links delivered between 2022 and 2025. Gravity Falls captures a single threat actor’s evolution across four technique clusters, shifting from short randomized strings to dictionary concatenation and themed combo-squatting variants used for credential theft and fee/fine fraud. Two string-analysis approaches (Shannon entropy and Exp0se) and two ML-based detectors (an LSTM classifier and COSSAS DGAD) are assessed using Top-1M domains as benign baselines. Results are strongly tactic-dependent: performance is highest on randomized-string domains but drops on dictionary concatenation and themed combo-squatting, with low recall across multiple tool/cluster pairings. Overall, both traditional heuristics and recent ML detectors are ill-suited for consistently evolving DGA tactics observed in Gravity Falls, motivating more context-aware approaches and providing a reproducible benchmark for future evaluation.
[LG-2] Physics-informed post-processing of stabilized finite element solutions for transient convection-dominated problems
链接: https://arxiv.org/abs/2603.03259
作者: Süleyman Cengizci,Ömür Uğur,Srinivasan Natesan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The numerical simulation of convection-dominated transient transport phenomena poses significant computational challenges due to sharp gradients and propagating fronts across the spatiotemporal domain. Classical discretization methods often generate spurious oscillations, requiring advanced stabilization techniques. However, even stabilized finite element methods may require additional regularization to accurately resolve localized steep layers. On the other hand, standalone physics-informed neural networks (PINNs) struggle to capture sharp solution structures in convection-dominated regimes and typically require a large number of training epochs. This work presents a hybrid computational framework that extends the PINN-Augmented SUPG with Shock-Capturing (PASSC) methodology from steady to unsteady problems. The approach combines a semi-discrete stabilized finite element method with a PINN-based correction strategy for transient convection-diffusion-reaction equations. Stabilization is achieved using the Streamline-Upwind Petrov-Galerkin (SUPG) formulation augmented with a YZbeta shock-capturing operator. Rather than training over the entire space-time domain, the neural network is applied selectively near the terminal time, enhancing the finite element solution using the last K_s temporal snapshots while enforcing residual constraints from the governing equations and boundary conditions. The network incorporates residual blocks with random Fourier features and employs progressive training with adaptive loss weighting. Numerical experiments on five benchmark problems, including boundary and interior layers, traveling waves, and nonlinear Burgers dynamics, demonstrate significant accuracy improvements at the terminal time compared to standalone stabilized finite element solutions.
[LG-3] Speculative Speculative Decoding
链接: https://arxiv.org/abs/2603.03251
作者: Tanishq Kumar,Tri Dao,Avner May
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
[LG-4] On Geometry Regularization in Autoencoder Reduced-Order Models with Latent Neural ODE Dynamics
链接: https://arxiv.org/abs/2603.03238
作者: Mikhail Osipov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 25 pages, 2 figures, 3 tables
点击查看摘要
Abstract:We investigate geometric regularization strategies for learned latent representations in encoder–decoder reduced-order models. In a fixed experimental setting for the advection–diffusion–reaction (ADR) equation, we model latent dynamics using a neural ODE and evaluate four regularization approaches applied during autoencoder pre-training: (a) near-isometry regularization of the decoder Jacobian, (b) a stochastic decoder gain penalty based on random directional gains, © a second-order directional curvature penalty, and (d) Stiefel projection of the first decoder layer. Across multiple seeds, we find that (a)–© often produce latent representations that make subsequent latent-dynamics training with a frozen autoencoder more difficult, especially for long-horizon rollouts, even when they improve local decoder smoothness or related sensitivity proxies. In contrast, (d) consistently improves conditioning-related diagnostics of the learned latent dynamics and tends to yield better rollout performance. We discuss the hypothesis that, in this setting, the downstream impact of latent-geometry mismatch outweighs the benefits of improved decoder smoothness.
[LG-5] Guiding Sparse Neural Networks with Neurobiological Principles to Elicit Biologically Plausible Representations
链接: https://arxiv.org/abs/2603.03234
作者: Patrick Inoue,Florian Röhrbein,Andreas Knoblauch
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:While deep neural networks (DNNs) have achieved remarkable performance in tasks such as image recognition, they often struggle with generalization, learning from few examples, and continuous adaptation - abilities inherent in biological neural systems. These challenges arise due to DNNs’ failure to emulate the efficient, adaptive learning mechanisms of biological networks. To address these issues, we explore the integration of neurobiologically inspired assumptions in neural network learning. This study introduces a biologically inspired learning rule that naturally integrates neurobiological principles, including sparsity, lognormal weight distributions, and adherence to Dale’s law, without requiring explicit enforcement. By aligning with these core neurobiological principles, our model enhances robustness against adversarial attacks and demonstrates superior generalization, particularly in few-shot learning scenarios. Notably, integrating these constraints leads to the emergence of biologically plausible neural representations, underscoring the efficacy of incorporating neurobiological assumptions into neural network design. Preliminary results suggest that this approach could extend from feature-specific to task-specific encoding, potentially offering insights into neural resource allocation for complex tasks.
[LG-6] Inverse Reconstruction of Shock Time Series from Shock Response Spectrum Curves using Machine Learning
链接: https://arxiv.org/abs/2603.03229
作者: Adam Watts(1),Andrew Jeon(1),Destry Newton(1),Ryan Bowering(2) ((1) Los Alamos National Laboratory, (2) University of Rochester)
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Extended journal-style manuscript. 27 pages, 13 figures
点击查看摘要
Abstract:The shock response spectrum (SRS) is widely used to characterize the response of single-degree-of-freedom (SDOF) systems to transient accelerations. Because the mapping from acceleration time history to SRS is nonlinear and many-to-one, reconstructing time-domain signals from a target spectrum is inherently ill-posed. Conventional approaches address this problem through iterative optimization, typically representing signals as sums of exponentially decayed sinusoids, but these methods are computationally expensive and constrained by predefined basis functions. We propose a conditional variational autoencoder (CVAE) that learns a data-driven inverse mapping from SRS to acceleration time series. Once trained, the model generates signals consistent with prescribed target spectra without requiring iterative optimization. Experiments demonstrate improved spectral fidelity relative to classical techniques, strong generalization to unseen spectra, and inference speeds three to six orders of magnitude faster. These results establish deep generative modeling as a scalable and efficient approach for inverse SRS reconstruction. Comments: Extended journal-style manuscript. 27 pages, 13 figures Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Reportnumber: LA-UR-26-21375 Cite as: arXiv:2603.03229 [cs.LG] (or arXiv:2603.03229v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.03229 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] Coalgebras for categorical deep learning: Representability and universal approximation
链接: https://arxiv.org/abs/2603.03227
作者: Dragan Mašulović
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Categorical deep learning (CDL) has recently emerged as a framework that leverages category theory to unify diverse neural architectures. While geometric deep learning (GDL) is grounded in the specific context of invariants of group actions, CDL aims to provide domain-independent abstractions for reasoning about models and their properties. In this paper, we contribute to this program by developing a coalgebraic foundation for equivariant representation in deep learning, as classical notions of group actions and equivariant maps are naturally generalized by the coalgebraic formalism. Our first main result demonstrates that, given an embedding of data sets formalized as a functor from SET to VECT, and given a notion of invariant behavior on data sets modeled by an endofunctor on SET, there is a corresponding endofunctor on VECT that is compatible with the embedding in the sense that this lifted functor recovers the analogous notion of invariant behavior on the embedded data. Building on this foundation, we then establish a universal approximation theorem for equivariant maps in this generalized setting. We show that continuous equivariant functions can be approximated within our coalgebraic framework for a broad class of symmetries. This work thus provides a categorical bridge between the abstract specification of invariant behavior and its concrete realization in neural architectures.
[LG-8] Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective ICLR2026
链接: https://arxiv.org/abs/2603.03226
作者: Enea Monzio Compagnoni,Alessandro Stanghellini,Rustem Islamov,Aurelien Lucchi,Anastasiia Koloskova
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at ICLR 2026 (Poster)
点击查看摘要
Abstract:Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of \mathcalO(1/\varepsilon^2) with speed independent of \varepsilon , while DP-SignSGD converges at a speed linear in \varepsilon with an \mathcalO(1/\varepsilon) trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with \varepsilon , while that of DP-SignSGD is essentially \varepsilon -independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and empirically extend from DP-SignSGD to DP-Adam.
[LG-9] I-CAM-UV: Integrating Causal Graphs over Non-Identical Variable Sets Using Causal Additive Models with Unobserved Variables AAAI AAAI2026
链接: https://arxiv.org/abs/2603.03207
作者: Hirofumi Suzuki,Kentaro Kanamori,Takuya Takagi,Thong Pham,Takashi Nicholas Maeda,Shohei Shimizu
类目: Machine Learning (cs.LG)
*备注: 16 pages, 22 figures, to appear in the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
点击查看摘要
Abstract:Causal discovery from observational data is a fundamental tool in various fields of science. While existing approaches are typically designed for a single dataset, we often need to handle multiple datasets with non-identical variable sets in practice. One straightforward approach is to estimate a causal graph from each dataset and construct a single causal graph by overlapping. However, this approach identifies limited causal relationships because unobserved variables in each dataset can be confounders, and some variable pairs may be unobserved in any dataset. To address this issue, we leverage Causal Additive Models with Unobserved Variables (CAM-UV) that provide causal graphs having information related to unobserved variables. We show that the ground truth causal graph has structural consistency with the information of CAM-UV on each dataset. As a result, we propose an approach named I-CAM-UV to integrate CAM-UV results by enumerating all consistent causal graphs. We also provide an efficient combinatorial search algorithm and demonstrate the usefulness of I-CAM-UV against existing methods.
[LG-10] Infinite dimensional generative sensing
链接: https://arxiv.org/abs/2603.03196
作者: Paolo Angella,Vito Paolo Pastore,Matteo Santacesaria
类目: Numerical Analysis (math.NA); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Probability (math.PR)
*备注:
点击查看摘要
Abstract:Deep generative models have become a standard for modeling priors for inverse problems, going beyond classical sparsity-based methods. However, existing theoretical guarantees are mostly confined to finite-dimensional vector spaces, creating a gap when the physical signals are modeled as functions in Hilbert spaces. This work presents a rigorous framework for generative compressed sensing in Hilbert spaces. We extend the notion of local coherence in an infinite-dimensional setting, to derive optimal, resolution-independent sampling distributions. Thanks to a generalization of the Restricted Isometry Property, we show that stable recovery holds when the number of measurements is proportional to the prior’s intrinsic dimension (up to logarithmic factors), independent of the ambient dimension. Finally, numerical experiments on the Darcy flow equation validate our theoretical findings and demonstrate that in severely undersampled regimes, employing lower-resolution generators acts as an implicit regularizer, improving reconstruction stability.
[LG-11] Less Noise Same Certificate: Retain Sensitivity for Unlearning
链接: https://arxiv.org/abs/2603.03172
作者: Carolin Heinzler,Kasra Malihi,Amartya Sanyal
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Certified machine unlearning aims to provably remove the influence of a deletion set U from a model trained on a dataset S , by producing an unlearned output that is statistically indistinguishable from retraining on the retain set R:=S\setminus U . Many existing certified unlearning methods adapt techniques from Differential Privacy (DP) and add noise calibrated to global sensitivity, i.e., the worst-case output change over all adjacent datasets. We show that this DP-style calibration is often overly conservative for unlearning, based on a key observation: certified unlearning, by definition, does not require protecting the privacy of the retained data R . Motivated by this distinction, we define retain sensitivity as the worst-case output change over deletions U while keeping R fixed. While insufficient for DP, retain sensitivity is exactly sufficient for unlearning, allowing for the same certificates with less noise. We validate these reductions in noise theoretically and empirically across several problems, including the weight of minimum spanning trees, PCA, and ERM. Finally, we refine the analysis of two widely used certified unlearning algorithms through the lens of retain sensitivity, leveraging the regularity induced by R to further reduce noise and improve utility.
链接: https://arxiv.org/abs/2603.03135
作者: Dan Stowell
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Many data representations are vectors of continuous values. In particular, deep learning embeddings are data-driven representations, typically either unconstrained in Euclidean space, or constrained to a hypersphere. These may also be translated into integer representations (quantised) for efficient large-scale use. However, the fundamental (and most efficient) numeric representation in the overwhelming majority of existing computers is integers with overflow – and vectors of these integers do not correspond to either of these spaces, but instead to the topology of a (hyper)torus. This mismatch can lead to wasted representation capacity. Here we show that common deep learning frameworks can be adapted, quite simply, to create representations with inherent toroidal topology. We investigate two alternative strategies, demonstrating that a normalisation-based strategy leads to training with desirable stability and performance properties, comparable to a standard hyperspherical L2 normalisation. We also demonstrate that a torus embedding maintains desirable quantisation properties. The torus embedding does not outperform hypersphere embeddings in general, but is comparable, and opens the possibility to train deep embeddings which have an extremely simple pathway to efficient `TinyML’ embedded implementation.
[LG-13] Safe and Robust Domains of Attraction for Discrete-Time Systems: A Set-Based Characterization and Certifiable Neural Network Estimation
链接: https://arxiv.org/abs/2603.03082
作者: Mohamed Serry,Maxwell Fitzsimmons,Jun Liu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Analyzing nonlinear systems with attracting robust invariant sets (RISs) requires estimating their domains of attraction (DOAs). Despite extensive research, accurately characterizing DOAs for general nonlinear systems remains challenging due to both theoretical and computational limitations, particularly in the presence of uncertainties and state constraints. In this paper, we propose a novel framework for the accurate estimation of safe (state-constrained) and robust DOAs for discrete-time nonlinear uncertain systems with continuous dynamics, open safe sets, compact disturbance sets, and uniformly locally \ell_p -stable compact RISs. The notion of uniform \ell_p stability is quite general and encompasses, as special cases, uniform exponential and polynomial stability. The DOAs are characterized via newly introduced value functions defined on metric spaces of compact sets. We establish their fundamental mathematical properties and derive the associated Bellman-type (Zubov-type) functional equations. Building on this characterization, we develop a physics-informed neural network (NN) framework to learn the corresponding value functions by embedding the derived Bellman-type equations directly into the training process. To obtain certifiable estimates of the safe robust DOAs from the learned neural approximations, we further introduce a verification procedure that leverages existing formal verification tools. The effectiveness and applicability of the proposed methodology are demonstrated through four numerical examples involving nonlinear uncertain systems subject to state constraints, and its performance is compared with existing methods from the literature.
[LG-14] Step-Level Sparse Autoencoder for Reasoning Process Interpretation
链接: https://arxiv.org/abs/2603.03031
作者: Xuan Yang,Jiayu Liu,Yuhang Lai,Hao Xu,Zhenya Huang,Ning Miao
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs’ reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at this https URL
[LG-15] SEHFS: Structural Entropy-Guided High-Order Correlation Learning for Multi-View Multi-Label Feature Selection
链接: https://arxiv.org/abs/2603.03022
作者: Cheng Peng,Yonghao Li,Wanfu Gao,Jie Wen,Weiping Ding
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In recent years, multi-view multi-label learning (MVML) has attracted extensive attention due to its close alignment to real-world scenarios. Information-theoretic methods have gained prominence for learning nonlinear correlations. However, two key challenges persist: first, features in real-world data commonly exhibit high-order structural correlations, but existing information-theoretic methods struggle to learn such correlations; second, commonly relying on heuristic optimization, information-theoretic methods are prone to converging to local optima. To address these two challenges, we propose a novel method called Structural Entropy Guided High-Order Correlation Learning for Multi-View Multi-Label Feature Selection (SEHFS). The core idea of SEHFS is to convert the feature graph into a structural-entropy-minimizing encoding tree, quantifying the information cost of high-order dependencies and thus learning high-order feature correlations beyond pairwise correlations. Specifically, features exhibiting strong high-order redundancy are grouped into a single cluster within the encoding tree, while inter-cluster feaeture correlations are minimized, thereby eliminating redundancy both within and across clusters. Furthermore, a new framework based on the fusion of information theory and matrix methods is adopted, which learns a shared semantic matrix and view-specific contribution matrices to reconstruct a global view matrix, thereby enhancing the information-theoretic method and balancing the global and local optimization. The ability of structural entropy to learn high-order correlations is theoretically established, and and both experiments on eight datasets from various domains and ablation studies demonstrate that SEHFS achieves superior performance in feature selection.
[LG-16] Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients
链接: https://arxiv.org/abs/2603.03007
作者: Tian-Shuang Wu,Shen-Huan Lyu,Ning Chen,Yi-Xiao He,Bing Tang,Baoliu Ye,Qingfu Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:Local class imbalance and data heterogeneity across clients often trap prototype-based federated contrastive learning in a prototype bias loop: biased local prototypes induced by imbalanced data are aggregated into biased global prototypes, which are repeatedly reused as contrastive anchors, accumulating errors across communication rounds. To break this loop, we propose Confidence-Aware Federated Contrastive Learning (CAFedCL), a novel framework that improves the prototype aggregation mechanism and strengthens the contrastive alignment guided by prototypes. CAFedCL employs a confidence-aware aggregation mechanism that leverages predictive uncertainty to downweight high-variance local prototypes. In addition, generative augmentation for minority classes and geometric consistency regularization are integrated to stabilize the structure between classes. From a theoretical perspective, we provide an expectation-based analysis showing that our aggregation reduces estimation variance, thereby bounding global prototype drift and ensuring convergence. Extensive experiments under varying levels of class imbalance and data heterogeneity demonstrate that CAFedCL consistently outperforms representative federated baselines in both accuracy and client fairness.
[LG-17] On the Topology of Neural Network Superlevel Sets
链接: https://arxiv.org/abs/2603.02973
作者: Bahman Gharesifard
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We show that neural networks with activations satisfying a Riccati-type ordinary differential equation condition, an assumption arising in recent universal approximation results in the uniform topology, produce Pfaffian outputs on analytic domains with format controlled only by the architecture. Consequently, superlevel sets, as well as Lie bracket rank drop loci for neural network parameterized vector fields, admit architecture-only bounds on topological complexity, in particular on total Betti numbers, uniformly over all weights.
[LG-18] LAGO: A Local-Global Optimization Framework Combining Trust Region Methods and Bayesian Optimization
链接: https://arxiv.org/abs/2603.02970
作者: Eliott Van Dieren,Tommaso Vanzan,Fabio Nobile
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 22 pages, 8 figures
点击查看摘要
Abstract:We introduce LAGO, a LocAl-Global Optimization algorithm that combines gradient-enhanced Bayesian Optimization (BO) with gradient-based trust region local refinement through an adaptive competition mechanism. At each iteration, global and local optimization strategies independently propose candidate points, and the next evaluation is selected based on predicted improvement. LAGO separates global exploration from local refinement at the proposal level: the BO acquisition function is optimized outside the active trust region, while local function and gradient evaluations are incorporated into the global gradient-enhanced Gaussian process only when they satisfy a lengthscale-based minimum-distance criterion, reducing the risk of numerical instability during the local exploitation. This enables efficient local refinement when reaching promising regions, without sacrificing a global search of the design space. As a result, the method achieves an improved exploration of the full design space compared to standard non-linear local optimization algorithms for smooth functions, while maintaining fast local convergence in regions of interest.
[LG-19] Integrating Homomorphic Encryption and Synthetic Data in FL for Privacy and Learning Quality
链接: https://arxiv.org/abs/2603.02969
作者: Yenan Wang,Carla Fabiana Chiasserini,Elad Michael Schiller
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated learning (FL) enables collaborative training of machine learning models without sharing sensitive client data, making it a cornerstone for privacy-critical applications. However, FL faces the dual challenge of ensuring learning quality and robust privacy protection while keeping resource consumption low, particularly when using computationally expensive techniques such as homomorphic encryption (HE). In this work, we enhance an FL process that preserves privacy using HE by integrating it with synthetic data generation and an interleaving strategy. Specifically, our solution, named Alternating Federated Learning (Alt-FL), consists of alternating between local training with authentic data (authentic rounds) and local training with synthetic data (synthetic rounds) and transferring the encrypted and plaintext model parameters on authentic and synthetic rounds (resp.). Our approach improves learning quality (e.g., model accuracy) through datasets enhanced with synthetic data, preserves client data privacy via HE, and keeps manageable encryption and decryption costs through our interleaving strategy. We evaluate our solution against data leakage attacks, such as the DLG attack, demonstrating robust privacy protection. Also, Alt-FL provides 13.4% higher model accuracy and decreases HE-related costs by up to 48% with respect to Selective HE.
[LG-20] Contextual Latent World Models for Offline Meta Reinforcement Learning
链接: https://arxiv.org/abs/2603.02935
作者: Mohammadreza Nakheai,Aidan Scannell,Kevin Luck,Joni Pajarinen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Offline meta-reinforcement learning seeks to learn policies that generalize across related tasks from fixed datasets. Context-based methods infer a task representation from transition histories, but learning effective task representations without supervision remains a challenge. In parallel, latent world models have demonstrated strong self-supervised representation learning through temporal consistency. We introduce contextual latent world models, which condition latent world models on inferred task representations and train them jointly with the context encoder. This enforces task-conditioned temporal consistency, yielding task representations that capture task-dependent dynamics rather than merely discriminating between tasks. Our method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual-DeepMind Control, and Meta-World benchmarks.
[LG-21] owards Accurate and Interpretable Time-series Forecasting: A Polynomial Learning Approach
链接: https://arxiv.org/abs/2603.02906
作者: Bo Liu,Shao-Bo Lin,Changmiao Wang,Xiaotong Liu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:Time series forecasting enables early warning and has driven asset performance management from traditional planned maintenance to predictive maintenance. However, the lack of interpretability in forecasting methods undermines users’ trust and complicates debugging for developers. Consequently, interpretable time-series forecasting has attracted increasing research attention. Nevertheless, existing methods suffer from several limitations, including insufficient modeling of temporal dependencies, lack of feature-level interpretability to support early warning, and difficulty in simultaneously achieving the accuracy and interpretability. This paper proposes the interpretable polynomial learning (IPL) method, which integrates interpretability into the model structure by explicitly modeling original features and their interactions of arbitrary order through polynomial representations. This design preserves temporal dependencies, provides feature-level interpretability, and offers a flexible trade-off between prediction accuracy and interpretability by adjusting the polynomial degree. We evaluate IPL on simulated and Bitcoin price data, showing that it achieves high prediction accuracy with superior interpretability compared with widely used explainability methods. Experiments on field-collected antenna data further demonstrate that IPL yields simpler and more efficient early warning mechanisms.
[LG-22] Distributed Dynamic Invariant Causal Prediction in Environmental Time Series
链接: https://arxiv.org/abs/2603.02902
作者: Ziruo Hao,Tao Yang,Xiaofeng Wu,Bo Hu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The extraction of invariant causal relationships from time series data with environmental attributes is critical for robust decision-making in domains such as climate science and environmental monitoring. However, existing methods either emphasize dynamic causal analysis without leveraging environmental contexts or focus on static invariant causal inference, leaving a gap in distributed temporal settings. In this paper, we propose Distributed Dynamic Invariant Causal Prediction in Time-series (DisDy-ICPT), a novel framework that learns dynamic causal relationships over time while mitigating spatial confounding variables without requiring data communication. We theoretically prove that DisDy-ICPT recovers stable causal predictors within a bounded number of communication rounds under standard sampling assumptions. Empirical evaluations on synthetic benchmarks and environment-segmented real-world datasets show that DisDy-ICPT achieves superior predictive stability and accuracy compared to baseline methods A and B. Our approach offers promising applications in carbon monitoring and weather forecasting. Future work will extend DisDy-ICPT to online learning scenarios.
[LG-23] Embedding interpretable ell_1-regression into neural networks for uncovering temporal structure in cell imaging
链接: https://arxiv.org/abs/2603.02899
作者: Fabian Kabus,Maren Hackenberg,Julia Hindel,Thibault Cholvin,Antje Kilias,Thomas Brox,Abhinav Valada,Marlene Bartos,Harald Binder
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:While artificial neural networks excel in unsupervised learning of non-sparse structure, classical statistical regression techniques offer better interpretability, in particular when sparseness is enforced by \ell_1 regularization, enabling identification of which factors drive observed dynamics. We investigate how these two types of approaches can be optimally combined, exemplarily considering two-photon calcium imaging data where sparse autoregressive dynamics are to be extracted. We propose embedding a vector autoregressive (VAR) model as an interpretable regression technique into a convolutional autoencoder, which provides dimension reduction for tractable temporal modeling. A skip connection separately addresses non-sparse static spatial information, selectively channeling sparse structure into the \ell_1 -regularized VAR. \ell_1 -estimation of regression parameters is enabled by differentiating through the piecewise linear solution path. This is contrasted with approaches where the autoencoder does not adapt to the VAR model. Having an embedded statistical model also enables a testing approach for comparing temporal sequences from the same observational unit. Additionally, contribution maps visualize which spatial regions drive the learned dynamics.
[LG-24] Learning in Markov Decision Processes with Exogenous Dynamics
链接: https://arxiv.org/abs/2603.02862
作者: Davide Maran,Davide Salaorni,Marcello Restelli
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Reinforcement learning algorithms are typically designed for generic Markov Decision Processes (MDPs), where any state-action pair can lead to an arbitrary transition distribution. In many practical systems, however, only a subset of the state variables is directly influenced by the agent’s actions, while the remaining components evolve according to exogenous dynamics and account for most of the stochasticity. In this work, we study a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent’s actions. We show that exploiting this structure yields significantly improved learning guarantees, with only the size of the exogenous state space appearing in the leading terms of the regret bounds. We further establish a matching lower bound, showing that this dependence is information-theoretically optimal. Finally, we empirically validate our approach across classical toy settings and real-world-inspired environments, demonstrating substantial gains in sample efficiency compared to standard reinforcement learning methods.
[LG-25] Adapting Time Series Foundation Models through Data Mixtures
链接: https://arxiv.org/abs/2603.02840
作者: Thomas L. Lee,Edoardo M. Ponti,Amos Storkey
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint, 8 pages
点击查看摘要
Abstract:Time series foundation models (TSFMs) have become increasingly popular for zero-shot forecasting. However, for a new time series domain not fully covered by the pretraining set, performance can suffer. Therefore, when a practitioner cares about a new domain and has access to a set of related datasets, the question arises: how best to fine-tune a TSFM to improve zero-shot forecasting? A typical approach to this type of problem is to fine-tune a LoRA module on all datasets or separately on each dataset. Tuning a separate module on each dataset allows for the specialisation of the TSFM to different types of data distribution, by selecting differing combinations of per-dataset modules for different time series contexts. However, we find that, using per-dataset modules might not be optimal, since a time series dataset can contain data from several types of distributions, i.e. sub-domains. This can be due to the distribution shifting or having differing distributions for different dimensions of the time series. Hence, we propose MixFT which re-divides the data using Bayesian mixtures into sets that best represent the sub-domains present in the data, and fine-tunes separately on each of these sets. This re-division of the data ensures that each set is more homogeneous, leading to fine-tuned modules focused on specific sub-domains. Our experiments show that MixFT performs better than per-dataset methods and when fine-tuning a single module on all the data. This suggests that by re-partitioning the data to represent sub-domains we can better specialise TSFMs to improve zero-shot forecasting.
[LG-26] Lattice-based Deep Neural Networks: Regularity and Tailored Regularization
链接: https://arxiv.org/abs/2603.02809
作者: Alexander Keller,Frances Y. Kuo,Dirk Nuyens,Ian H. Sloan
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:This survey article is concerned with the application of lattice rules to Deep Neural Networks (DNNs), lattice rules being a family of quasi-Monte Carlo methods. They have demonstrated effectiveness in various contexts for high-dimensional integration and function approximation. They are extremely easy to implement thanks to their very simple formulation – all that is required is a good integer generating vector of length matching the dimensionality of the problem. In recent years there has been a burst of research activities on the application and theory of DNNs. We review our recent article on using lattice rules as training points for DNNs with a smooth activation function, where we obtained explicit regularity bounds of the DNNs. By imposing restrictions on the network parameters to match the regularity features of the target function, we prove that DNNs with tailored lattice training points can achieve good theoretical generalization error bounds, with implied constants independent of the input dimension. We also demonstrate numerically that DNNs trained with our tailored regularization perform significantly better than with standard \ell_2 regularization.
[LG-27] he Price of Robustness: Stable Classifiers Need Overparameterization ICLR2026
链接: https://arxiv.org/abs/2603.02806
作者: Jonas von Berg,Adalbert Fono,Massimiliano Datres,Sohir Maskey,Gitta Kutyniok
类目: Machine Learning (cs.LG)
*备注: 29 pages, 9 figures. Accepted at ICLR 2026
点击查看摘要
Abstract:The relationship between overparameterization, stability, and generalization remains incompletely understood in the setting of discontinuous classifiers. We address this gap by establishing a generalization bound for finite function classes that improves inversely with class stability, defined as the expected distance to the decision boundary in the input domain (margin). Interpreting class stability as a quantifiable notion of robustness, we derive as a corollary a law of robustness for classification that extends the results of Bubeck and Sellke beyond smoothness assumptions to discontinuous functions. In particular, any interpolating model with p \approx n parameters on n data points must be unstable, implying that substantial overparameterization is necessary to achieve high stability. We obtain analogous results for parameterized infinite function classes by analyzing a stronger robustness measure derived from the margin in the codomain, which we refer to as the normalized co-stability. Experiments support our theory: stability increases with model size and correlates with test performance, while traditional norm-based measures remain largely uninformative.
[LG-28] From Heuristic Selection to Automated Algorithm Design: LLM s Benefit from Strong Priors
链接: https://arxiv.org/abs/2603.02792
作者: Qi Huang,Furong Ye,Ananta Shahane,Thomas Bäck,Niki van Stein
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have already been widely adopted for automated algorithm design, demonstrating strong abilities in generating and evolving algorithms across various fields. Existing work has largely focused on examining their effectiveness in solving specific problems, with search strategies primarily guided by adaptive prompt designs. In this paper, through investigating the token-wise attribution of the prompts to LLM-generated algorithmic codes, we show that providing high-quality algorithmic code examples can substantially improve the performance of the LLM-driven optimization. Building upon this insight, we propose leveraging prior benchmark algorithms to guide LLM-driven optimization and demonstrate superior performance on two black-box optimization benchmarks: the pseudo-Boolean optimization suite (pbo) and the black-box optimization suite (bbob). Our findings highlight the value of integrating benchmarking studies to enhance both efficiency and robustness of the LLM-driven black-box optimization methods.
[LG-29] Rethinking Time Series Domain Generalization via Structure-Stratified Calibration
链接: https://arxiv.org/abs/2603.02756
作者: Jinyang Li,Shuhao Mei,Xiaoyu Xiao,Shuhang Li,Ruoxi Yun,Jinbo Sun
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:For time series arising from latent dynamical systems, existing cross-domain generalization methods commonly assume that samples are comparably meaningful within a shared representation space. In real-world settings, however, different datasets often originate from structurally heterogeneous families of dynamical systems, leading to fundamentally distinct feature distributions. Under such circumstances, performing global alignment while neglecting structural differences is highly prone to establishing spurious correspondences and inducing negative transfer. From the new perspective of cross-domain structural correspondence failure, we revisit this problem and propose a structurally stratified calibration framework. This approach explicitly distinguishes structurally consistent samples and performs amplitude calibration exclusively within structurally compatible sample clusters, thereby effectively alleviating generalization failures caused by structural incompatibility. Notably, the proposed framework achieves substantial performance improvements through a concise and computationally efficient calibration strategy. Evaluations on 19 public datasets (100.3k samples) demonstrate that SSCF significantly outperforms strong baselines under the zero-shot setting. These results confirm that establishing structural consistency prior to alignment constitutes a more reliable and effective pathway for improving cross-domain generalization of time series governed by latent dynamical systems.
[LG-30] Deep learning-guided evolutionary optimization for protein design
链接: https://arxiv.org/abs/2603.02753
作者: Erik Hartman,Di Tang,Johan Malmström
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: Code available at GitHub
点击查看摘要
Abstract:Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textitStreptococcus pneumoniae. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \hrefthis https URLGitHub.
[LG-31] he power of small initialization in noisy low-tubal-rank tensor recovery
链接: https://arxiv.org/abs/2603.02729
作者: ZHiyu Liu,Haobo Geng,Xudong Wang,Yandong Tang,Zhi Han,Yao Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We study the problem of recovering a low-tubal-rank tensor \mathcalX_\star\in \mathbbR^n \times n \times k from noisy linear measurements under the t-product framework. A widely adopted strategy involves factorizing the optimization variable as \mathcalU * \mathcalU^\top , where \mathcalU \in \mathbbR^n \times R \times k , followed by applying factorized gradient descent (FGD) to solve the resulting optimization problem. Since the tubal-rank r of the underlying tensor \mathcalX_\star is typically unknown, this method often assumes r R \le n , a regime known as over-parameterization. However, when the measurements are corrupted by some dense noise (e.g., Gaussian noise), FGD with the commonly used spectral initialization yields a recovery error that grows linearly with the over-estimated tubal-rank R . To address this issue, we show that using a small initialization enables FGD to achieve a nearly minimax optimal recovery error, even when the tubal-rank R is significantly overestimated. Using a four-stage analytic framework, we analyze this phenomenon and establish the sharpest known error bound to date, which is independent of the overestimated tubal-rank R . Furthermore, we provide a theoretical guarantee showing that an easy-to-use early stopping strategy can achieve the best known result in practice. All these theoretical findings are validated through a series of simulations and real-data experiments.
[LG-32] Single Microphone Own Voice Detection based on Simulated Transfer Functions for Hearing Aids
链接: https://arxiv.org/abs/2603.02724
作者: Mathuranathan Mayuravaani,W. Bastiaan Kleijn,Andrew Lensen,Charlotte Sørensen
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increasing device complexity and cost. To enable ML-based OVD without requiring costly transfer-function measurements, we propose a data augmentation strategy based on simulated acoustic transfer functions (ATFs) that expose the model to a wide range of spatial propagation conditions. A transformer-based classifier is first trained on analytically generated ATFs and then progressively fine-tuned using numerically simulated ATFs, transitioning from a rigid-sphere model to a detailed head-and-torso representation. This hierarchical adaptation enabled the model to refine its spatial understanding while maintaining generalization. Experimental results show 95.52% accuracy on simulated head-and-torso test data. Under short-duration conditions, the model maintained 90.02% accuracy with one-second utterances. On real hearing aid recordings, the model achieved 80% accuracy without fine-tuning, aided by lightweight test-time feature compensation. This highlights the model’s ability to generalize from simulated to real-world conditions, demonstrating practical viability and pointing toward a promising direction for future hearing aid design.
[LG-33] An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification
链接: https://arxiv.org/abs/2603.02719
作者: L. Julián Lechuga López,Farah E. Shamout,Tim G. J. Rudner
类目: Machine Learning (cs.LG)
*备注: 33 pages, 14 figures, 8 tables
点击查看摘要
Abstract:As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.
[LG-34] Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
链接: https://arxiv.org/abs/2603.02695
作者: Sijie Mai,Shiqin Han,Haifeng Hu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Multimodal data encountered in real-world scenarios are typically of low quality, with noisy modalities and missing modalities being typical forms that severely hinder model performance and robustness. However, prior works often handle noisy and missing modalities separately. In contrast, we jointly address missing and noisy modalities to enhance model robustness in low-quality data scenarios. We regard both noisy and missing modalities as a unified low-quality modality problem, and propose a unified modality-quality (UMQ) framework to enhance low-quality representations for multimodal affective computing. Firstly, we train a quality estimator with explicit supervised signals via a rank-guided training strategy that compares the relative quality of different representations by adding a ranking constraint, avoiding training noise caused by inaccurate absolute quality labels. Then, a quality enhancer for each modality is constructed, which uses the sample-specific information provided by other modalities and the modality-specific information provided by the defined modality baseline representation to enhance the quality of unimodal representations. Finally, we propose a quality-aware mixture-of-experts module with particular routing mechanism to enable multiple modality-quality problems to be addressed more specifically. UMQ consistently outperforms state-of-the-art baselines on multiple datasets under the settings of complete, missing, and noisy modalities.
[LG-35] From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
链接: https://arxiv.org/abs/2603.02675
作者: Shuyi Zhou,Zeen Song,Wenwen Qiang,Jiyan Sun,Yao Zhou,Yinlong Liu,Wei Ma
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models remain vulnerable to adversarial prefix attacks (e.g., Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within fork-in-the-road’’ training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.
[LG-36] HomeAdam: Adam and AdamW Algorithms Sometimes Go Home to Obtain Better Provable Generalization
链接: https://arxiv.org/abs/2603.02649
作者: Feihu Huang,Guanyi Zhang,Songcan Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 39 pages
点击查看摘要
Abstract:Adam and AdamW are a class of default optimizers for training deep learning models in machine learning. These adaptive algorithms converge faster but generalize worse compared to SGD. In fact, their proved generalization error O(\frac1\sqrtN) also is larger than O(\frac1N) of SGD, where N denotes training sample size. Recently, although some variants of Adam have been proposed to improve its generalization, their improved generalizations are still unexplored in theory. To fill this gap, in the paper, we restudy generalization of Adam and AdamW via algorithmic stability, and first prove that Adam and AdamW without square-root (i.e., Adam(W)-srf) have a generalization error O(\frac\hat\rho^-2TN) , where T denotes iteration number and \hat\rho0 denotes the smallest element of second-order momentum plus a small positive number. To improve generalization, we propose a class of efficient clever Adam (i.e., HomeAdam(W)) algorithms via sometimes returning momentum-based SGD. Moreover, we prove that our HomeAdam(W) have a smaller generalization error O(\frac1N) than O(\frac\hat\rho^-2TN) of Adam(W)-srf, since \hat\rho is generally very small. In particular, it is also smaller than the existing O(\frac1\sqrtN) of Adam(W). Meanwhile, we prove our HomeAdam(W) have a faster convergence rate of O(\frac1T^1/4) than O(\frac\breve\rho^-1T^1/4) of the Adam(W)-srf, where \breve\rho\leq\hat\rho also is very small. Extensive numerical experiments demonstrate efficiency of our HomeAdam(W) algorithms.
[LG-37] SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety
链接: https://arxiv.org/abs/2603.02635
作者: Zixuan Xu,Tiancheng He,Huahui Yi,Kun Wang,Xi Chen,Gongli Xi,Qiankun Li,Kang Li,Yang Liu,Zhigang Zeng
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception \to Reasoning \to Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT \to DPO \to GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 \to 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 \to 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 \to 59.21; 7B: 66.39 \to 66.81). Codes are available at this https URL.
[LG-38] Post Hoc Extraction of Pareto Fronts for Continuous Control IJCAI2026
链接: https://arxiv.org/abs/2603.02628
作者: Raghav Thakar,Gaurav Dixit,Kagan Tumer
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures. Submitted to IJCAI 2026
点击查看摘要
Abstract:Agents in the real world must often balance multiple objectives, such as speed, stability, and energy efficiency in continuous control. To account for changing conditions and preferences, an agent must ideally learn a Pareto frontier of policies representing multiple optimal trade-offs. Recent advances in multi-policy multi-objective reinforcement learning (MORL) enable learning a Pareto front directly, but require full multi-objective consideration from the start of training. In practice, multi-objective preferences often arise after a policy has already been trained on a single specialised objective. Existing MORL methods cannot leverage these pre-trained `specialists’ to learn Pareto fronts and avoid incurring the sample costs of retraining. We introduce Mixed Advantage Pareto Extraction (MAPEX), an offline MORL method that constructs a frontier of policies by reusing pre-trained specialist policies, critics, and replay buffers. MAPEX combines evaluations from specialist critics into a mixed advantage signal, and weights a behaviour cloning loss with it to train new policies that balance multiple objectives. MAPEX’s post hoc Pareto front extraction preserves the simplicity of single-objective off-policy RL, and avoids retrofitting these algorithms into complex MORL frameworks. We formally describe the MAPEX procedure and evaluate MAPEX on five multi-objective MuJoCo environments. Given the same starting policies, MAPEX produces comparable fronts at 0.001% the sample cost of established baselines.
[LG-39] Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation ICRA2026
链接: https://arxiv.org/abs/2603.02623
作者: Senwei Xie,Yuntian Zhang,Ruiping Wang,Xilin Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICRA2026
点击查看摘要
Abstract:While skill-centric approaches leverage foundation models to enhance generalization in compositional tasks, they often rely on fixed skill libraries, limiting adaptability to new tasks without manual intervention. To address this, we propose Uni-Skill, a Unified Skill-centric framework that supports skill-aware planning and facilitates automatic skill evolution. Unlike prior methods that restrict planning to predefined skills, Uni-Skill requests for new skill implementations when existing ones are insufficient, ensuring adaptable planning with self-augmented skill library. To support automatic implementation of diverse skills requested by the planning module, we construct SkillFolder, a VerbNet-inspired repository derived from large-scale unstructured robotic videos. SkillFolder introduces a hierarchical skill taxonomy that captures diverse skill descriptions at multiple levels of abstraction. By populating this taxonomy with large-scale, automatically annotated demonstrations, Uni-Skill shifts the paradigm of skill acquisition from inefficient manual annotation to efficient offline structural retrieval. Retrieved examples provide semantic supervision over behavior patterns and fine-grained references for spatial trajectories, enabling few-shot skill inference without deployment-time demonstrations. Comprehensive experiments in both simulation and real-world settings verify the state-of-the-art performance of Uni-Skill over existing VLM-based skill-centric approaches, highlighting its advanced reasoning capabilities and strong zero-shot generalization across a wide range of novel tasks.
[LG-40] Implicit Bias in Deep Linear Discriminant Analysis
链接: https://arxiv.org/abs/2603.02622
作者: Jiawen Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:While the Implicit Bias(or Implicit Regularization) of standard loss functions has been studied, the optimization geometry induced by discriminative metric-learning objectives remains largely this http URL the best of our knowledge, this paper presents an initial theoretical analysis of the implicit regularization induced by the Deep LDA,a scale invariant objective designed to minimize intraclass variance and maximize interclass distance. By analyzing the gradient flow of the loss on a L-layer diagonal linear network, we prove that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates, which demonstrates an automatic conservation of the (2/L) quasi-norm.
[LG-41] Same Error Different Function: The Optimizer as an Implicit Prior in Financial Time Series
链接: https://arxiv.org/abs/2603.02620
作者: Federico Vittorio Cortesi,Giuseppe Iannone,Giulia Crippa,Tomaso Poggio,Pierfrancesco Beneventano
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 39 pages, 24 figures
点击查看摘要
Abstract:Neural networks applied to financial time series operate in a regime of underspecification, where model predictors achieve indistinguishable out-of-sample error. Using large-scale volatility forecasting for S \ P 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly 3\times turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.
[LG-42] Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving
链接: https://arxiv.org/abs/2603.02613
作者: Tianze Zhu,Yinuo Wang,Wenjun Zou,Tianyi Zhang,Likun Wang,Letian Tao,Feihong Zhang,Yao Lyu,Shengbo Eben Li
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:Reinforcement learning (RL) is a fundamental methodology in autonomous driving systems, where generative policies exhibit considerable potential by leveraging their ability to model complex distributions to enhance exploration. However, their inherent high inference latency severely impedes their deployment in real-time decision-making and control. To address this issue, we propose diffusion actor-critic with entropy regulator via flow matching (DACER-F) by introducing flow matching into online RL, enabling the generation of competitive actions in a single inference step. By leveraging Langevin dynamics and gradients of the Q-function, DACER-F dynamically optimizes actions from experience replay toward a target distribution that balances high Q-value information with exploratory behavior. The flow policy is then trained to efficiently learn a mapping from a simple prior distribution to this dynamic target. In complex multi-lane and intersection simulations, DACER-F outperforms baselines diffusion actor-critic with entropy regulator (DACER) and distributional soft actor-critic (DSAC), while maintaining an ultra-low inference latency. DACER-F further demonstrates its scalability on standard RL benchmark DeepMind Control Suite (DMC), achieving a score of 775.8 in the humanoid-stand task and surpassing prior methods. Collectively, these results establish DACER-F as a high-performance and computationally efficient RL algorithm.
[LG-43] Heterogeneous Agent Collaborative Reinforcement Learning
链接: https://arxiv.org/abs/2603.02604
作者: Zhixia Zhang,Zixuan Huang,Xin Xia,Deqing Wang,Fuzhen Zhuang,Shuai Ma,Ning Ding,Yaodong Yang,Jianxin Li,Yikun Ban
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3% while using only half the rollout cost.
[LG-44] Joint Optimization of Model Partitioning and Resource Allocation for Anti-Jamming Collaborative Inference Systems
链接: https://arxiv.org/abs/2603.02579
作者: Mengru Wu,Jiawei Li,Jiaqi Wei,Bin Lyu,Kai-Kit Wong,Hyundong Shin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:With the increasing computational demands of deep neural network (DNN) inference on resource-constrained devices, DNN partitioning-based device-edge collaborative inference has emerged as a promising paradigm. However, the transmission of intermediate feature data is vulnerable to malicious jamming, which significantly degrades the overall inference performance. To counter this threat, this letter focuses on an anti-jamming collaborative inference system in the presence of a malicious jammer. In this system, a DNN model is partitioned into two distinct segments, which are executed by wireless devices and edge servers, respectively. We first analyze the effects of jamming and DNN partitioning on inference accuracy via data regression. Based on this, our objective is to maximize the system’s revenue of delay and accuracy (RDA) under inference accuracy and computing resource constraints by jointly optimizing computation resource allocation, devices’ transmit power, and DNN partitioning. To address the mixed-integer nonlinear programming problem, we propose an efficient alternating optimization-based algorithm, which decomposes the problem into three subproblems that are solved via Karush-Kuhn-Tucker conditions, convex optimization methods, and a quantum genetic algorithm, respectively. Extensive simulations demonstrate that our proposed scheme outperforms baselines in terms of RDA.
[LG-45] owards Parameter-Free Temporal Difference Learning
链接: https://arxiv.org/abs/2603.02577
作者: Yunxiang Li,Mark Schmidt,Reza Babanezhad,Sharan Vaswani
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Temporal difference (TD) learning is a fundamental algorithm for estimating value functions in reinforcement learning. Recent finite-time analyses of TD with linear function approximation quantify its theoretical convergence rate. However, they often require setting the algorithm parameters using problem-dependent quantities that are difficult to estimate in practice – such as the minimum eigenvalue of the feature covariance ((\omega)) or the mixing time of the underlying Markov chain ((\tau_\textmix)). In addition, some analyses rely on nonstandard and impractical modifications, exacerbating the gap between theory and practice. To address these limitations, we use an exponential step-size schedule with the standard TD(0) algorithm. We analyze the resulting method under two sampling regimes: independent and identically distributed (i.i.d.) sampling from the stationary distribution, and the more practical Markovian sampling along a single trajectory. In the i.i.d.\ setting, the proposed algorithm does not require knowledge of problem-dependent quantities such as (\omega), and attains the optimal bias-variance trade-off for the last iterate. In the Markovian setting, we propose a regularized TD(0) algorithm with an exponential step-size schedule. The resulting algorithm achieves a comparable convergence rate to prior works, without requiring projections, iterate averaging, or knowledge of (\tau_\textmix) or (\omega).
[LG-46] Wasserstein Proximal Policy Gradient
链接: https://arxiv.org/abs/2603.02576
作者: Zhaoyu Zhu,Shuhan Zhang,Rui Gao,Shuang Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We study policy gradient methods for continuous-action, entropy-regularized reinforcement learning through the lens of Wasserstein geometry. Starting from a Wasserstein proximal update, we derive Wasserstein Proximal Policy Gradient (WPPG) via an operator-splitting scheme that alternates an optimal transport update with a heat step implemented by Gaussian convolution. This formulation avoids evaluating the policy’s log density or its gradient, making the method directly applicable to expressive implicit stochastic policies specified as pushforward maps. We establish a global linear convergence rate for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically, WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.
[LG-47] EdgeFLow: Serverless Federated Learning via Sequential Model Migration in Edge Networks
链接: https://arxiv.org/abs/2603.02562
作者: Yuchen Shi,Qijun Hou,Pingyi Fan,Khaled B. Letaief
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) has emerged as a transformative distributed learning paradigm in the era of Internet of Things (IoT), reconceptualizing data processing methodologies. However, FL systems face significant communication bottlenecks due to inevitable client-server data exchanges and long-distance transmissions. This work presents EdgeFLow, an innovative FL framework that redesigns the system topology by replacing traditional cloud servers with sequential model migration between edge base stations. By conducting model aggregation and propagation exclusively at edge clusters, EdgeFLow eliminates cloud-based transmissions and substantially reduces global communication overhead. We provide rigorous convergence analysis for EdgeFLow under non-convex objectives and non-IID data distributions, extending classical FL convergence theory. Experimental results across various configurations validate the theoretical analysis, demonstrating that EdgeFLow achieves comparable accuracy improvements while significantly reducing communication costs. As a systemic architectural innovation for communication-efficient FL, EdgeFLow establishes a foundational framework for future developments in IoT and edge-network learning systems.
[LG-48] hermodynamic Regulation of Finite-Time Gibbs Training in Energy-Based Models: A Restricted Boltzmann Machine Study
链接: https://arxiv.org/abs/2603.02525
作者: Görkem Can Süleymanoğlu
类目: Machine Learning (cs.LG)
*备注: 35 pages, 12 Tables, 7 figures. Includes theoretical analysis and experimental validation on MNIST
点击查看摘要
Abstract:Restricted Boltzmann Machines (RBMs) are typically trained using finite-length Gibbs chains under a fixed sampling temperature. This practice implicitly assumes that the stochastic regime remains valid as the energy landscape evolves during learning. We argue that this assumption can become structurally fragile under finite-time training dynamics. This fragility arises because, in nonconvex energy-based models, fixed-temperature finite-time training can generate admissible trajectories with effective-field amplification and conductance collapse. As a result, the Gibbs sampler may asymptotically freeze, the negative phase may localize, and, without sufficiently strong regularization, parameters may exhibit deterministic linear drift. To address this instability, we introduce an endogenous thermodynamic regulation framework in which temperature evolves as a dynamical state variable coupled to measurable sampling statistics. Under standard local Lipschitz conditions and a two-time-scale separation regime, we establish global parameter boundedness under strictly positive L2 regularization. We further prove local exponential stability of the thermodynamic subsystem and show that the regulated regime mitigates inverse-temperature blow-up and freezing-induced degeneracy within a forward-invariant neighborhood. Experiments on MNIST demonstrate that the proposed self-regulated RBM substantially improves normalization stability and effective sample size relative to fixed-temperature baselines, while preserving reconstruction performance. Overall, the results reinterpret RBM training as a controlled non-equilibrium dynamical process rather than a static equilibrium approximation.
[LG-49] ParEVO: Synthesizing Code for Irregular Data: High-Performance Parallelism through Agent ic Evolution
链接: https://arxiv.org/abs/2603.02510
作者: Liu Yang,Zeyu Nie,Andrew Liu,Felix Zou,Deniz Altinbüken,Amir Yazdanbakhsh,Quanquan C. Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF)
*备注:
点击查看摘要
Abstract:The transition from sequential to parallel computing is essential for modern high-performance applications but is hindered by the steep learning curve of concurrent programming. This challenge is magnified for irregular data structures (such as sparse graphs, unbalanced trees, and non-uniform meshes) where static scheduling fails and data dependencies are unpredictable. Current Large Language Models (LLMs) often fail catastrophically on these tasks, generating code plagued by subtle race conditions, deadlocks, and sub-optimal scaling. We bridge this gap with ParEVO, a framework designed to synthesize high-performance parallel algorithms for irregular data. Our contributions include: (1) The Parlay-Instruct Corpus, a curated dataset of 13,820 tasks synthesized via a “Critic-Refine” pipeline that explicitly filters for empirically performant algorithms that effectively utilize Work-Span parallel primitives; (2) specialized DeepSeek, Qwen, and Gemini models fine-tuned to align probabilistic generation with the rigorous semantics of the ParlayLib library; and (3) an Evolutionary Coding Agent (ECA) that improves the “last mile” of correctness by iteratively repairing code using feedback from compilers, dynamic race detectors, and performance profilers. On the ParEval benchmark, ParEVO achieves an average 106x speedup (with a maximum of 1103x) across the suite, and a robust 13.6x speedup specifically on complex irregular graph problems, outperforming state-of-the-art commercial models. Furthermore, our evolutionary approach matches state-of-the-art expert human baselines, achieving up to a 4.1x speedup on specific highly-irregular kernels. Source code and datasets are available at this https URL. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF) Cite as: arXiv:2603.02510 [cs.LG] (or arXiv:2603.02510v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.02510 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-50] Learning-Augmented Moment Estimation on Time-Decay Models
链接: https://arxiv.org/abs/2603.02488
作者: Soham Nagawanshi,Shalini Panthangi,Chen Wang,David P. Woodruff,Samson Zhou
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Motivated by the prevalence and success of machine learning, a line of recent work has studied learning-augmented algorithms in the streaming model. These results have shown that for natural and practical oracles implemented with machine learning models, we can obtain streaming algorithms with improved space efficiency that are otherwise provably impossible. On the other hand, our understanding is much more limited when items are weighted unequally, for example, in the sliding-window model, where older data must be expunged from the dataset, e.g., by privacy regulation laws. In this paper, we utilize an oracle for the heavy-hitters of datasets to give learning-augmented algorithms for a number of fundamental problems, such as norm/moment estimation, frequency estimation, cascaded norms, and rectangular moment estimation, in the time-decay setting. We complement our theoretical results with a number of empirical evaluations that demonstrate the practical efficiency of our algorithms on real and synthetic datasets.
[LG-51] Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding
链接: https://arxiv.org/abs/2603.02470
作者: Jingxuan Men,Mahdi Boloursaz Mashhadi,Ning Wang,Yi Ma,Mike Nilsson,Rahim Tafazolli
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs), where tokens serve as unified units of communication and computation, enabling efficient semantic- and goal-oriented information exchange in future wireless networks. In this paper, we propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication with Unequal Error Protection (UEP)-based source-channel coding adaptation. The proposed framework integrates user-intended textual descriptions with discrete video tokenization and unequal error protection to enhance semantic fidelity under restrictive bandwidth constraints. First, discrete video tokens are extracted through a pretrained video tokenizer, while text-conditioned vision-language modeling and optical-flow propagation are jointly used to identify tokens that correspond to user-intended semantics across space and time. Next, we introduce a semantic-aware multi-rate bit-allocation strategy, in which tokens highly related to the user intent are encoded using full codebook precision, whereas non-intended tokens are represented through reduced codebook precision differential encoding, enabling rate savings while preserving semantic quality. Finally, a source and channel coding adaptation scheme is developed to adapt bit allocation and channel coding to varying resources and link conditions. Experiments on various video datasets demonstrate that the proposed framework outperforms both conventional and semantic communication baselines, in perceptual and semantic quality on a wide SNR range.
[LG-52] Spectral Regularization for Diffusion Models
链接: https://arxiv.org/abs/2603.02447
作者: Satish Chandran,Nicolas Roque dos Santos,Yunshu Wu,Greg Ver Steeg,Evangelos Papalexakis
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Diffusion models are typically trained using pointwise reconstruction objectives that are agnostic to the spectral and multi-scale structure of natural signals. We propose a loss-level spectral regularization framework that augments standard diffusion training with differentiable Fourier- and wavelet-domain losses, without modifying the diffusion process, model architecture, or sampling procedure. The proposed regularizers act as soft inductive biases that encourage appropriate frequency balance and coherent multi-scale structure in generated samples. Our approach is compatible with DDPM, DDIM, and EDM formulations and introduces negligible computational overhead. Experiments on image and audio generation demonstrate consistent improvements in sample quality, with the largest gains observed on higher-resolution, unconditional datasets where fine-scale structure is most challenging to model.
[LG-53] Using the SEKF to Transfer NN Models of Dynamical Systems with Limited Data
链接: https://arxiv.org/abs/2603.02439
作者: Joshua E. Hammond,Tyler A. Soderstrom,Brian A. Korgel,Michael Baldea
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data-driven models of dynamical systems require extensive amounts of training data. For many practical applications, gathering sufficient data is not feasible due to cost or safety concerns. This work uses the Subset Extended Kalman Filter (SEKF) to adapt pre-trained neural network models to new, similar systems with limited data available. Experimental validation across damped spring and continuous stirred-tank reactor systems demonstrates that small parameter perturbations to the initial model capture target system dynamics while requiring as little as 1% of original training data. In addition, finetuning requires less computational cost and reduces generalization error.
[LG-54] Dimension-Independent Convergence of Underdamped Langevin Monte Carlo in KL Divergence
链接: https://arxiv.org/abs/2603.02429
作者: Shiyuan Zhang,Qiwei Di,Xuheng Li,Quanquan Gu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 51 pages, 1 table
点击查看摘要
Abstract:Underdamped Langevin dynamics (ULD) is a widely-used sampler for Gibbs distributions \pi\propto e^-V , and is often empirically effective in high dimensions. However, existing non-asymptotic convergence guarantees for discretized ULD typically scale polynomially with the ambient dimension d , leading to vacuous bounds when d is large. The main known dimension-free result concerns the randomized midpoint discretization in Wasserstein-2 distance (Liu et al.,2023), while dimension-independent guarantees for ULD discretizations in KL divergence have remained open. We close this gap by proving the first dimension-free KL divergence bounds for discretized ULD. Our analysis refines the KL local error framework (Altschuler et al., 2025) to a dimension-free setting and yields bounds that depend on \mathrmtr(\mathbfH) , where \mathbfH upper bounds the Hessian of V , rather than on d . As a consequence, we obtain improved iteration complexity for underdamped Langevin Monte Carlo relative to overdamped Langevin methods in regimes where \mathrmtr(\mathbfH)\ll d .
[LG-55] Personalized Multi-Agent Averag e Reward TD-Learning via Joint Linear Approximation
链接: https://arxiv.org/abs/2603.02426
作者: Leo (Muxing)Wang,Pengkun Yang,Lili Su
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We study personalized multi-agent average reward TD learning, in which a collection of agents interacts with different environments and jointly learns their respective value functions. We focus on the setting where there exists a shared linear representation, and the agents’ optimal weights collectively lie in an unknown linear subspace. Inspired by the recent success of personalized federated learning (PFL), we study the convergence of cooperative single-timescale TD learning in which agents iteratively estimate the common subspace and local heads. We showed that this decomposition can filter out conflicting signals, effectively mitigating the negative impacts of ``misaligned’’ signals, and achieving linear speedup. The main technical challenges lie in the heterogeneity, the Markovian sampling, and their intricate interplay in shaping error evolutions. Specifically, not only are the error dynamics of multiple variables closely interconnected, but there is also no direct contraction for the principal angle distance between the optimal subspace and the estimated subspace. We hope our analytical techniques can be useful to inspire research on deeper exploration into leveraging common structures. Experiments are provided to show the benefits of learning via a shared structure to the more general control problem.
[LG-56] Learning Optimal Search Strategies
链接: https://arxiv.org/abs/2603.02356
作者: Stefan Ankirchner,Maximilian Philipp Thiel
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
点击查看摘要
Abstract:We explore the question of how to learn an optimal search strategy within the example of a parking problem where parking opportunities arrive according to an unknown inhomogeneous Poisson process. The optimal policy is a threshold-type stopping rule characterized by an indifference position. We propose an algorithm that learns this threshold by estimating the integrated jump intensity rather than the intensity function itself. We show that our algorithm achieves a logarithmic regret growth, uniformly over a broad class of environments. Moreover, we prove a logarithmic minimax regret lower bound, establishing the growth optimality of the proposed approach.
[LG-57] Learning graph topology from metapopulation epidemic encoder-decoder
链接: https://arxiv.org/abs/2603.02349
作者: Xin Li,Jonathan Cohen,Shai Pilosof,Rami Puzis
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Metapopulation epidemic models are a valuable tool for studying large-scale outbreaks. With the limited availability of epidemic tracing data, it is challenging to infer the essential constituents of these models, namely, the epidemic parameters and the relevant mobility network between subpopulations. Either one of these constituents can be estimated while assuming the other; however, the problem of their joint inference has not yet been solved. Here, we propose two encoder-decoder deep learning architectures that infer metapopulation mobility graphs from time-series data, with and without the assumption of epidemic model parameters. Evaluation across diverse random and empirical mobility networks shows that the proposed approach outperforms the state-of-the-art topology inference. Further, we show that topology inference improves dramatically with data on additional pathogens. Our study establishes a robust framework for simultaneously inferring epidemic parameters and topology, addressing a persistent gap in modeling disease propagation.
[LG-58] Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study ICASSP2026
链接: https://arxiv.org/abs/2603.02285
作者: Zijian Yang,Jörg Barkoczi,Ralf Schlüter,Hermann Ney
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: accepted to ICASSP 2026
点击查看摘要
Abstract:Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.
[LG-59] A Comparative Study of UMAP and Other Dimensionality Reduction Methods
链接: https://arxiv.org/abs/2603.02275
作者: Guanzhe Zhang,Shanshan Ding,Zhezhen Jin
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 31 pages, 4 figures
点击查看摘要
Abstract:Uniform Manifold Approximation and Projection (UMAP) is a widely used manifold learning technique for dimensionality reduction. This paper studies UMAP, supervised UMAP, and several competing dimensionality reduction methods, including Principal Component Analysis (PCA), Kernel PCA, Sliced Inverse Regression (SIR), Kernel SIR, and t-distributed Stochastic Neighbor Embedding, through a comprehensive comparative analysis. Although UMAP has attracted substantial attention for preserving local and global structures, its supervised extensions, particularly for regression settings, remain rather underexplored. We provide a systematic evaluation of supervised UMAP for both regression and classification using simulated and real datasets, with performance assessed via predictive accuracy on low-dimensional embeddings. Our results show that supervised UMAP performs well for classification but exhibits limitations in effectively incorporating response information for regression, highlighting an important direction for future development.
[LG-60] Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimers Network
链接: https://arxiv.org/abs/2603.02273
作者: Binon Teji,Subhajit Bandyopadhyay,Swarup Roy
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Prioritizing disease-associated genes is central to understanding the molecular mechanisms of complex disorders such as Alzheimer’s disease (AD). Traditional network-based approaches rely on static centrality measures and often fail to capture cross-modal biological heterogeneity. We propose NETRA (Node Evaluation through Transformer-based Representation and Attention), a multimodal graph transformer framework that replaces heuristic centrality metrics with attention-driven relevance scoring. Using AD as a case study, gene regulatory networks are independently constructed from microarray, single-cell RNA-seq, and single-nucleus RNA-seq data. Random-walk sequences derived from these networks are used to train a BERT-based model for learning global gene embeddings, while modality-specific gene expression profiles are compressed using variational autoencoders. These representations are integrated with auxiliary biological networks, including protein-protein interactions, Gene Ontology semantic similarity, and diffusion-based gene similarity, into a unified multimodal graph. A graph transformer assigns NETRA scores that quantify gene relevance in a disease-specific and context-aware manner. Gene set enrichment analysis shows that NETRA achieves a normalized enrichment score of about 3.9 for the Alzheimer’s disease pathway, substantially outperforming classical centrality measures and diffusion models. Top-ranked genes enrich multiple neurodegenerative pathways, recover a known late-onset AD susceptibility locus at chr12q13, and reveal conserved cross-disease gene modules. The framework preserves biologically realistic heavy-tailed network topology and is readily extensible to other complex disorders.
[LG-61] Length Generalization Bounds for Transformers
链接: https://arxiv.org/abs/2603.02238
作者: Andy Yang,Pascal Bergsträßer,Georg Zetzsche,David Chiang,Anthony W. Lin
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
*备注:
点击查看摘要
Abstract:Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of any length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for CRASP, a class of languages which is closely linked to transformers. A positive partial result was recently shown by Chen et al. for CRASP with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for CRASP (already with two layers) and hence for transformers. To complement this, we provide a computable bound for the positive fragment of CRASP, which we show equivalent to fixed-precision transformers. For both positive CRASP and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.
[LG-62] Efficient Sparse Selective-Update RNNs for Long-Range Sequence Modeling
链接: https://arxiv.org/abs/2603.02226
作者: Bojian Yin,Shurong Wang,Haoyu Tan,Sander Bohte,Federico Corradi,Guoqi Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Real-world sequential signals, such as audio or video, contain critical information that is often embedded within long periods of silence or noise. While recurrent neural networks (RNNs) are designed to process such data efficiently, they often suffer from ``memory decay’’ due to a rigid update schedule: they typically update their internal state at every time step, even when the input is static. This constant activity forces the model to overwrite its own memory and makes it hard for the learning signal to reach back to distant past events. Here we show that we can overcome this limitation using Selective-Update RNNs (suRNNs), a non-linear architecture that learns to preserve its memory when the input is redundant. By using a neuron-level binary switch that only opens for informative events, suRNNs decouple the recurrent updates from the raw sequence length. This mechanism allows the model to maintain an exact, unchanged memory of the past during low-information intervals, creating a direct path for gradients to flow across time. Our experiments on the Long Range Arena, WikiText, and other synthetic benchmarks show that suRNNs match or exceed the accuracy of much more complex models such as Transformers, while remaining significantly more efficient for long-term storage. By allowing each neuron to learn its own update timescale, our approach resolves the mismatch between how long a sequence is and how much information it actually contains. By providing a principled approach to managing temporal information density, this work establishes a new direction for achieving Transformer-level performance within the highly efficient framework of recurrent modeling.
[LG-63] Scaling Reward Modeling without Human Supervision
链接: https://arxiv.org/abs/2603.02225
作者: Jingxuan Fan,Yueying Li,Zhenting Qi,Dinghuai Zhang,Kianté Brantley,Sham M. Kakade,Hanlin Zhang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Learning from feedback is an instrumental process for advancing the capabilities and safety of frontier models, yet its effectiveness is often constrained by cost and scalability. We present a pilot study that explores scaling reward models through unsupervised approaches. We operationalize reward-based scaling (RBS), in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Its advantage is demonstrated in various aspects: despite using no human annotations, training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, and these improvements consistently transfer across diverse initialization backbones spanning model families and scales. Across models, our method improves RewardBench v2 accuracy by up to +7.7 points on average, with gains of up to +16.1 on in-domain math subsets and consistent improvements on out-of-domain safety and general subsets. When applied to best-of-N selection and policy optimization, these reward models substantially improve downstream math performance and match or exceed strong supervised reward model baselines of similar size. Overall, we demonstrate the feasibility and promise of training reward models without costly and potentially unreliable human annotations.
[LG-64] Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation
链接: https://arxiv.org/abs/2603.02224
作者: Brady Steele
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 6 tables
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for adapting large pre-trained models, yet its behavior under continual learning remains poorly understood. We present a geometric theory characterizing catastrophic forgetting in LoRA through the lens of gradient subspace interactions. Our central finding is that forgetting is governed by a simple geometric law: \mathcalF = \alpha(1 - \cos^2\theta_\min) + \beta , where \theta_\min is the minimum principal angle between task gradient subspaces. This formulation reveals an approximate rank-invariance property, at high subspace angles, forgetting becomes largely independent of the adapter rank (coefficient of variation \approx 0.8% in controlled synthetic settings; CV \approx 10 - 19% on real benchmarks, suggesting this is regime-dependent rather than absolute). We validate our theory on synthetic tasks ( r=0.994 correlation), Split-CIFAR100 with ViT-LoRA, and sequential GLUE with RoBERTa-LoRA. Our analysis reconciles seemingly contradictory findings in the literature: we show that rank affects forgetting only when task subspaces are similar (low angle), while orthogonal methods like O-LoRA provide minimal benefit when natural orthogonality is already high. These insights provide principled guidance for continual learning with parameter-efficient fine-tuning.
[LG-65] he elbow statistic: Multiscale clustering statistical significance
链接: https://arxiv.org/abs/2603.03235
作者: Francisco J. Perez-Reche
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 30 pages, 3 figures, 5 tables
点击查看摘要
Abstract:Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing criteria typically target a single optimal'' partition, often overlooking statistically meaningful structure present at multiple resolutions. We introduce ElbowSig, a framework that formalizes the heuristic elbow’’ method as a rigorous inferential problem. Our approach centers on a normalized discrete curvature statistic derived from the cluster heterogeneity sequence, which is evaluated against a null distribution of unstructured data. We derive the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability. As an algorithm-agnostic procedure, ElbowSig requires only the heterogeneity sequence and is compatible with a wide range of clustering methods, including hard, fuzzy, and model-based clustering. Extensive experiments on synthetic and empirical datasets demonstrate that the method maintains appropriate Type-I error control while providing the power to resolve multiscale organizational structures that are typically obscured by single-resolution selection criteria.
[LG-66] Shape Derivative-Informed Neural Operators with Application to Risk-Averse Shape Optimization
链接: https://arxiv.org/abs/2603.03211
作者: Xindi Gong,Dingcheng Luo,Thomas O’Leary-Roseberry,Ruanui Nicholson,Omar Ghattas
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:Shape optimization under uncertainty (OUU) is computationally intensive for classical PDE-based methods due to the high cost of repeated sampling-based risk evaluation across many uncertainty realizations and varying geometries, while standard neural surrogates often fail to provide accurate and efficient sensitivities for optimization. We introduce Shape-DINO, a derivative-informed neural operator framework for learning PDE solution operators on families of varying geometries, with a particular focus on accelerating PDE-constrained shape OUU. Shape-DINOs encode geometric variability through diffeomorphic mappings to a fixed reference domain and employ a derivative-informed operator learning objective that jointly learns the PDE solution and its Fréchet derivatives with respect to design variables and uncertain parameters, enabling accurate state predictions and reliable gradients for large-scale OUU. We establish a priori error bounds linking surrogate accuracy to optimization error and prove universal approximation results for multi-input reduced basis neural operators in suitable C^1 norms. We demonstrate efficiency and scalability on three representative shape OUU problems, including boundary design for a Poisson equation and shape design governed by steady-state Navier-Stokes exterior flows in two and three dimensions. Across these examples, Shape-DINOs produce more reliable optimization results than operator surrogates trained without derivative information. In our examples, Shape-DINOs achieve 3-8 orders-of-magnitude speedups in state and gradient evaluations. Counting training data generation, Shape-DINOs reduce necessary PDE solves by 1-2 orders-of-magnitude compared to a strictly PDE-based approach for a single OUU problem. Moreover, Shape-DINO construction costs can be amortized across many objectives and risk measures, enabling large-scale shape OUU for complex systems.
[LG-67] A Covering Framework for Offline POMDPs Learning using Belief Space Metric
链接: https://arxiv.org/abs/2603.03191
作者: Youheng Zhu,Yiping Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:In off policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space (distributions over latent states) to relax traditional coverage assumptions. By assuming value relevant functions are Lipschitz continuous in the belief space, we derive error bounds that mitigate exponential blow ups in horizon and memory length. Our unified analysis technique applies to a broad class of OPE algorithms, yielding concrete error bounds and coverage requirements expressed in terms of belief space metrics rather than raw history coverage. We illustrate the improved sample efficiency of this framework via case studies: the double sampling Bellman error minimization algorithm, and the memory based future dependent value functions (FDVF). In both cases, our coverage definition based on the belief space metric yields tighter bounds.
[LG-68] Scalable Uncertainty Quantification for Black-Box Density-Based Clustering
链接: https://arxiv.org/abs/2603.03188
作者: Nicola Bariletto,Stephen G. Walker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce a novel framework for uncertainty quantification in clustering. By combining the martingale posterior paradigm with density-based clustering, uncertainty in the estimated density is naturally propagated to the clustering structure. The approach scales effectively to high-dimensional and irregularly shaped data by leveraging modern neural density estimators and GPU-friendly parallel computation. We establish frequentist consistency guarantees and validate the methodology on synthetic and real data.
[LG-69] From Reachability to Learnability: Geometric Design Principles for Quantum Neural Networks
链接: https://arxiv.org/abs/2603.03071
作者: Vishal S. Ngairangbam,Michael Spannowsky
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Machine Learning (stat.ML)
*备注: 29 pages, 5 figures, 3 tables
点击查看摘要
Abstract:Classical deep networks are effective because depth enables adaptive geometric deformation of data representations. In quantum neural networks (QNNs), however, depth or state reachability alone does not guarantee this feature-learning capability. We study this question in the pure-state setting by viewing encoded data as an embedded manifold in \mathbbCP^2^n-1 and analysing infinitesimal unitary actions through Lie-algebra directions. We introduce Classical-to-Lie-algebra (CLA) maps and the criterion of almost Complete Local Selectivity (aCLS), which combines directional completeness with data-dependent local selectivity. Within this framework, we show that data-independent trainable unitaries are complete but non-selective, i.e. learnable rigid reorientations, whereas pure data encodings are selective but non-tunable, i.e. fixed deformations. Hence, geometric flexibility requires a non-trivial joint dependence on data and trainable weights. We further show that accessing high-dimensional deformations of many-qubit state manifolds requires parametrised entangling directions; fixed entanglers such as CNOT alone do not provide adaptive geometric control. Numerical examples validate that CLS-satisfying data re-uploading models outperform non-tunable schemes while requiring only a quarter of the gate operations. Thus, the resulting picture reframes QNN design from state reachability to controllable geometry of hidden quantum representations.
[LG-70] Generalized Bayes for Causal Inference
链接: https://arxiv.org/abs/2603.03035
作者: Emil Javurek,Dennis Frauen,Yuxin Wang,Stefan Feuerriegel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Uncertainty quantification is central to many applications of causal machine learning, yet principled Bayesian inference for causal effects remains challenging. Standard Bayesian approaches typically require specifying a probabilistic model for the data-generating process, including high-dimensional nuisance components such as propensity scores and outcome regressions. Standard posteriors are thus vulnerable to strong modeling choices, including complex prior elicitation. In this paper, we propose a generalized Bayesian framework for causal inference. Our framework avoids explicit likelihood modeling; instead, we place priors directly on the causal estimands and update these using an identification-driven loss function, which yields generalized posteriors for causal effects. As a result, our framework turns existing loss-based causal estimators into estimators with full uncertainty quantification. Our framework is flexible and applicable to a broad range of causal estimands (e.g., ATE, CATE). Further, our framework can be applied on top of state-of-the-art causal machine learning pipelines (e.g., Neyman-orthogonal meta-learners). For Neyman-orthogonal losses, we show that the generalized posteriors converge to their oracle counterparts and remain robust to first-stage nuisance estimation error. With calibration, we thus obtain valid frequentist uncertainty even when nuisance estimators converge at slower-than-parametric rates. Empirically, we demonstrate that our proposed framework offers causal effect estimation with calibrated uncertainty across several causal inference settings. To the best of our knowledge, this is the first flexible framework for constructing generalized Bayesian posteriors for causal machine learning.
[LG-71] Variance reduction in lattice QCD observables via normalizing flows
链接: https://arxiv.org/abs/2603.02984
作者: Ryan Abbott,Denis Boyda,Yang Fu,Daniel C. Hackett,Gurtej Kanwar,Fernando Romero-López,Phiala E. Shanahan,Julian M. Urban
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 2 tables
点击查看摘要
Abstract:Normalizing flows can be used to construct unbiased, reduced-variance estimators for lattice field theory observables that are defined by a derivative with respect to action parameters. This work implements the approach for observables involving gluonic operator insertions in the SU(3) Yang-Mills theory and two-flavor Quantum Chromodynamics (QCD) in four space-time dimensions. Variance reduction by factors of 10 - 60 is achieved in glueball correlation functions and in gluonic matrix elements related to hadron structure, with demonstrated computational advantages. The observed variance reduction is found to be approximately independent of the lattice volume, so that volume transfer can be utilized to minimize training costs.
[LG-72] Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of Geneformer and scGPT
链接: https://arxiv.org/abs/2603.02952
作者: Ihor Kendiukhov
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注:
点击查看摘要
Abstract:Background: Single-cell foundation models such as Geneformer and scGPT encode rich biological information, but whether this includes causal regulatory logic rather than statistical co-expression remains unclear. Sparse autoencoders (SAEs) can resolve superposition in neural networks by decomposing dense activations into interpretable features, yet they have not been systematically applied to biological foundation models. Results: We trained TopK SAEs on residual stream activations from all layers of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512), producing atlases of 82525 and 24527 features, respectively. Both atlases confirm massive superposition, with 99.8 percent of features invisible to SVD. Systematic characterization reveals rich biological organization: 29 to 59 percent of features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST, with U-shaped layer profiles reflecting hierarchical abstraction. Features organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (median 2.36x), and form cross-layer information highways (63 to 99.8 percent). When tested against genome-scale CRISPRi perturbation data, only 3 of 48 transcription factors (6.2 percent) show regulatory-target-specific feature responses. A multi-tissue control yields marginal improvement (10.4 percent, 5 of 48 TFs), establishing model representations as the bottleneck. Conclusions: These models have internalized organized biological knowledge, including pathway membership, protein interactions, functional modules, and hierarchical abstraction, yet they encode minimal causal regulatory logic. We release both feature atlases as interactive web platforms enabling exploration of more than 107000 features across 30 layers of two leading single-cell foundation models. Subjects: Genomics (q-bio.GN); Machine Learning (cs.LG); Cell Behavior (q-bio.CB) Cite as: arXiv:2603.02952 [q-bio.GN] (or arXiv:2603.02952v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2603.02952 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-73] Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection
链接: https://arxiv.org/abs/2603.02937
作者: Kashaf Gulzar,Korbinian Riedhammer,Elmar Nöth,Andreas K. Maier,Paula Andrea Pérez-Toro
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 6 tables, Journal paper
点击查看摘要
Abstract:Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power ((AUC): 0.769 and 0.746, respectively) and substantial specificity disparities ((\Delta_spec) up to 18% and 15%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.
[LG-74] ChemFlow:A Hierarchical Neural Network for Multiscale Representation Learning in Chemical Mixtures
链接: https://arxiv.org/abs/2603.02810
作者: Jinming Fan,Chao Qian,Wilhelm T. S. Huck,William E. Robinson,Shaodong Zhou
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Accurate prediction of the physicochemical properties of molecular mixtures using graph neural networks remains a significant challenge, as it requires simultaneous embedding of intramolecular interactions while accounting for mixture composition (i.e., concentrations and ratios). Existing approaches are ill-equipped to emulate realistic mixture environments, where densely coupled interactions propagate across hierarchical levels - from atoms and functional groups to entire molecules - and where cross-level information exchange is continuously modulated by composition. To bridge the gap between isolated molecules and realistic chemical environments, we present ChemFlow, a novel hierarchical framework that integrates atomic, functional group, and molecular-level features, facilitating information flow across these levels to predict the behavior of complex chemical mixtures. ChemFlow employs an atomic-level feature fusion module, Chem-embed, to generate context-aware atomic representations influenced by the mixture state and atomic characteristics. Next, bidirectional group-to-molecule and molecule-to-group attention mechanisms enable ChemFlow to capture functional group interactions both within and across molecules in the mixture. By dynamically adjusting representations based on concentration and composition, ChemFlow excels at predicting concentration-dependent properties and significantly outperforms state-of-the-art models in both concentration-sensitive and concentration-independent systems. Extensive experiments demonstrate ChemFlow’s superior accuracy and efficiency in modeling complex chemical mixtures.
[LG-75] Neural quantum support vector data description for one-class classification
链接: https://arxiv.org/abs/2603.02700
作者: Changjae Im,Hyeondo Oh,Daniel K. Park
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures
点击查看摘要
Abstract:One-class classification (OCC) is a fundamental problem in machine learning with numerous applications, such as anomaly detection and quality control. With the increasing complexity and dimensionality of modern datasets, there is a growing demand for advanced OCC techniques with better expressivity and efficiency. We introduce Neural Quantum Support Vector Data Description (NQSVDD), a classical-quantum hybrid framework for OCC that performs end-to-end optimized hierarchical representation learning. NQSVDD integrates a classical neural network with trainable quantum data encoding and a variational quantum circuit, enabling the model to learn nonlinear feature transformations tailored to the OCC objective. The hybrid architecture maps input data into an intermediate high-dimensional feature space and subsequently projects it into a compact latent space defined through quantum measurements. Importantly, both the feature embedding and the latent representation are jointly optimized such that normal data form a compact cluster, for which a minimum-volume enclosing hypersphere provides an effective decision boundary. Experimental evaluations on benchmark datasets demonstrate that NQSVDD achieves competitive or superior AUC performance compared to classical Deep SVDD and quantum baselines, while maintaining parameter efficiency and robustness under realistic noise conditions.
[LG-76] Exact Functional ANOVA Decomposition for Categorical Inputs Models
链接: https://arxiv.org/abs/2603.02673
作者: Baptiste Ferrere(IMT, SINCLAIR AI Lab),Nicolas Bousquet(SINCLAIR AI Lab),Fabrice Gamboa(IMT, ANITI),Jean-Michel Loubes(IMT, REGALIA, ANITI),Joseph Muré
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Functional ANOVA offers a principled framework for interpretability by decomposing a model’s prediction into main effects and higher-order interactions. For independent features, this decomposition is well-defined, strongly linked with SHAP values, and serves as a cornerstone of additive explainability. However, the lack of an explicit closed-form expression for general dependent distributions has forced practitioners to rely on costly sampling-based approximations. We completely resolve this limitation for categorical inputs. By bridging functional analysis with the extension of discrete Fourier analysis, we derive a closed-form decomposition without any assumption. Our formulation is computationally very efficient. It seamlessly recovers the classical independent case and extends to arbitrary dependence structures, including distributions with non-rectangular support. Furthermore, leveraging the intrinsic link between SHAP and ANOVA under independence, our framework yields a natural generalization of SHAP values for the general categorical setting.
[LG-77] Convex and Non-convex Federated Learning with Stale Stochastic Gradients: Diminishing Step Size is All You Need
链接: https://arxiv.org/abs/2603.02639
作者: Xinran Zheng,Tara Javidi,Behrouz Touri
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose a general framework for distributed stochastic optimization under delayed gradient models. In this setting, n local agents leverage their own data and computation to assist a central server in minimizing a global objective composed of agents’ local cost functions. Each agent is allowed to transmit stochastic-potentially biased and delayed-estimates of its local gradient. While a prior work has advocated delay-adaptive step sizes for stochastic gradient descent (SGD) in the presence of delays, we demonstrate that a pre-chosen diminishing step size is sufficient and matches the performance of the adaptive scheme. Moreover, our analysis establishes that diminishing step sizes recover the optimal SGD rates for nonconvex and strongly convex objectives.
[LG-78] Combinatorial Sparse PCA Beyond the Spiked Identity Model
链接: https://arxiv.org/abs/2603.02607
作者: Syamantak Kumar,Purnamrita Sarkar,Kevin Tian,Peiyuan Zhang
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 36 pages, 6 figures
点击查看摘要
Abstract:Sparse PCA is one of the most well-studied problems in high-dimensional statistics. In this problem, we are given samples from a distribution with covariance \Sigma , whose top eigenvector v \in R^d is s -sparse. Existing sparse PCA algorithms can be broadly categorized into (1) combinatorial algorithms (e.g., diagonal or elementwise covariance thresholding) and (2) SDP-based algorithms. While combinatorial algorithms are much simpler, they are typically only analyzed under the spiked identity model (where \Sigma = I_d + \gamma vv^\top for some \gamma 0 ), whereas SDP-based algorithms require no additional assumptions on \Sigma . We demonstrate explicit counterexample covariances \Sigma against the success of standard combinatorial algorithms for sparse PCA, when moving beyond the spiked identity model. In light of this discrepancy, we give the first combinatorial method for sparse PCA that provably succeeds for general \Sigma using s^2 \cdot \mathrmpolylog(d) samples and d^2 \cdot \mathrmpoly(s, \log(d)) time, by providing a global convergence guarantee on a variant of the truncated power method of Yuan and Zhang (2013). We provide a natural generalization of our method to recovering a vector in a sparse leading eigenspace. Finally, we evaluate our method on synthetic and real-world sparse PCA datasets. Comments: 36 pages, 6 figures Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2603.02607 [stat.ML] (or arXiv:2603.02607v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2603.02607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-79] Low-Degree Method Fails to Predict Robust Subspace Recovery
链接: https://arxiv.org/abs/2603.02594
作者: He Jia,Aravindan Vijayaraghavan
类目: Machine Learning (stat.ML); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 27 pages, 1 figure
点击查看摘要
Abstract:The low-degree polynomial framework has been highly successful in predicting computational versus statistical gaps for high-dimensional problems in average-case analysis and machine learning. This success has led to the low-degree conjecture, which posits that this method captures the power and limitations of efficient algorithms for a wide class of high-dimensional statistical problems. We identify a natural and basic hypothesis testing problem in \mathbbR^n which is polynomial time solvable, but for which the low-degree polynomial method fails to predict its computational tractability even up to degree k=n^\Omega(1) . Moreover, the low-degree moments match exactly up to degree k=O(\sqrt\log n/\log\log n) . Our problem is a special case of the well-studied robust subspace recovery problem. The lower bounds suggest that there is no polynomial time algorithm for this problem. In contrast, we give a simple and robust polynomial time algorithm that solves the problem (and noisy variants of it), leveraging anti-concentration properties of the distribution. Our results suggest that the low-degree method and low-degree moments fail to capture algorithms based on anti-concentration, challenging their universality as a predictor of computational barriers.
[LG-80] Optimizing Orbital Parameters of Satellites for a Global Quantum Network
链接: https://arxiv.org/abs/2603.02480
作者: Athul Ashok,Owen DePoint,Jackson MacDonald,Albert Williams,Don Towsley
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Long (8 page, 5 figure) version of paper appearing at QCNC 2026
点击查看摘要
Abstract:Due to fundamental limitations on terrestrial quantum links, satellites have received considerable attention for their potential as entanglement generation sources in a global quantum internet. In this work, we focus on the problem of designing a constellation of satellites for such a quantum network. We find satellite inclination angles and satellite cluster allocations to achieve maximal entanglement generation rates to fixed sets of globally distributed ground stations. Exploring two black-box optimization frameworks: a Bayesian Optimization (BO) approach and a Genetic Algorithm (GA) approach, we find comparable results, indicating their effectiveness for this optimization task. While GA and BO often perform remarkably similar, BO often converges more efficiently, while later growth noted in GAs is indicative of less susceptibility towards local maxima. In either case, they offer substantial improvements over naive approaches that maximize coverage with respect to ground station placement.
[LG-81] Conformal Graph Prediction with Z-Gromov Wasserstein Distances
链接: https://arxiv.org/abs/2603.02460
作者: Gabriel Melo,Thibaut de Saivre,Anna Calissano,Florence d’Alché-Buc
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Supervised graph prediction addresses regression problems where the outputs are structured graphs. Although several approaches exist for graph–valued prediction, principled uncertainty quantification remains limited. We propose a conformal prediction framework for graph-valued outputs, providing distribution–free coverage guarantees in structured output spaces. Our method defines nonconformity via the Z–Gromov–Wasserstein distance, instantiated in practice through Fused Gromov–Wasserstein (FGW), enabling permutation invariant comparison between predicted and candidate this http URL obtain adaptive prediction sets, we introduce Score Conformalized Quantile Regression (SCQR), an extension of Conformalized Quantile Regression (CQR) to handle complex output spaces such as graph–valued outputs. We evaluate the proposed approach on a synthetic task and a real problem of molecule identification.
[LG-82] Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates Oracle Complexity and Information-Theoretic Limits
链接: https://arxiv.org/abs/2603.02417
作者: Daniel Zantedeschi,Kumar Muthuraman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:We develop a Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix – not an exogenous scalar variance. Under exchangeable sampling, the mini-batch gradient covariance is pinned down (to leading order) by the projected covariance of per-sample gradients: it equals projected Fisher information for well-specified likelihood losses and the projected Godambe (sandwich) matrix for general M-estimation losses. This identification forces a diffusion approximation with Fisher/Godambe-structured volatility (effective temperature tau = eta/b) and yields an Ornstein-Uhlenbeck linearization whose stationary covariance is given in closed form by a Fisher-Lyapunov equation. Building on this geometry, we prove matching minimax upper and lower bounds of order Theta(1/N) for Fisher/Godambe risk under a total oracle budget N; the lower bound holds under a martingale oracle condition (bounded predictable quadratic variation), strictly subsuming i.i.d. and exchangeable sampling. These results imply oracle-complexity guarantees for epsilon-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number rather than ambient dimension or Euclidean conditioning. Experiments confirm the Lyapunov predictions and show that scalar temperature matching cannot reproduce directional noise structure.
[LG-83] Neural Demand Estimation with Habit Formation and Rationality Constraints
链接: https://arxiv.org/abs/2603.02331
作者: Marta Grzeskiewicz
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We develop a flexible neural demand system for continuous budget allocation that estimates budget shares on the simplex by minimizing KL divergence. Shares are produced via a softmax of a state-dependent preference scorer and disciplined with regularity penalties (monotonicity, Slutsky symmetry) to support coherent comparative statics and welfare without imposing a parametric utility form. State dependence enters through a habit stock defined as an exponentially weighted moving average of past consumption. Simulations recover elasticities and welfare accurately and show sizable gains when habit formation is present. In our empirical application using Dominick’s analgesics data, adding habit reduces out-of-sample error by c.33%, reshapes substitution patterns, and increases CV losses from a 10% ibuprofen price rise by about 15-16% relative to a static model. The code is available at this https URL .
[LG-84] opological Causal Effects
链接: https://arxiv.org/abs/2603.02289
作者: Kwangho Kim,Hajin Lee
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Estimating causal effects is particularly challenging when outcomes arise in complex, non-Euclidean spaces, where conventional methods often fail to capture meaningful structural variation. We develop a framework for topological causal inference that defines treatment effects through differences in the topological structure of potential outcomes, summarized by power-weighted silhouette functions of persistence diagrams. We develop an efficient, doubly robust estimator in a fully nonparametric model, establish functional weak convergence, and construct a formal test of the null hypothesis of no topological effect. Empirical studies illustrate that the proposed method reliably quantifies topological treatment effects across diverse complex outcome types.
[LG-85] Quantum AS-DeepOnet: Quantum Attentive Stacked DeepONet for Solving 2D Evolution Equations
链接: https://arxiv.org/abs/2603.02261
作者: Hongquan Wang,Hanshu Chen,Ilia Marchevsky,Zhuojia Fu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:DeepONet enables retraining-free inference across varying initial conditions or source terms at the cost of high computational requirements. This paper proposes a hybrid quantum operator network (Quantum AS-DeepOnet) suitable for solving 2D evolution equations. By combining Parameterized Quantum Circuits and cross-subnet attention methods, we can solve 2D evolution equations using only 60% of the trainable parameters while maintaining accuracy and convergence comparable to the classical DeepONet method.
[LG-86] OnDA: On-device Channel Pruning for Efficient Personalized Keyword Spotting INTERSPEECH2026
链接: https://arxiv.org/abs/2603.02247
作者: Matteo Risso,Alessio Burrello,Daniele Jahier Pagliari
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted for review at Interspeech2026
点击查看摘要
Abstract:Always-on keyword spotting (KWS) demands on-device adaptation to cope with user- and environment-specific distribution shifts under tight latency and energy budgets. This paper proposes, for the first time, coupling weight adaptation (i.e., on-device training) with architectural adaptation, in the form of online structured channel pruning, for personalized on-device KWS. Starting from a state-of-the-art self-learning personalized KWS pipeline, we compare data-agnostic and data-aware pruning criteria applied on in-field pseudo-labelled user data. On the HeySnips and HeySnapdragon datasets, we achieve up to 9.63x model-size compression with respect to unpruned baselines at iso-task performance, measured as the accuracy at 0.5 false alarms per hour. When deploying our adaptation pipeline on a Jetson Orin Nano embedded GPU, we achieve up to 1.52x/1.57x and 1.64x/1.77x latency and energy-consumption improvements during online training/inference compared to weights-only adaptation.
[LG-87] LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification
链接: https://arxiv.org/abs/2603.02245
作者: Niloofar Jazaeri,Hilmi R. Dajani,Marco Janeczek,Martin Bouchard
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 7 pages
点击查看摘要
Abstract:Decoding infant cry causes remains challenging for healthcare monitoring due to short nonstationary signals, limited annotations, and strong domain shifts across infants and datasets. We propose a compact acoustic framework that fuses MFCC, STFT, and pitch features within a multi-branch CNN encoder and models temporal dynamics using an enhanced Legendre Memory Unit (LMU). Compared to LSTMs, the LMU backbone provides stable sequence modeling with substantially fewer recurrent parameters, supporting efficient deployment. To improve cross-dataset generalization, we introduce calibrated posterior ensemble fusion with entropy-gated weighting to preserve domain-specific expertise while mitigating dataset bias. Experiments on Baby2020 and Baby Crying demonstrate improved macro-F1 under cross-domain evaluation, along with leakageaware splits and real-time feasibility for on-device monitoring.
附件下载
点击下载今日全部论文列表